MMInA: Benchmarking Multihop Multimodal Internet Agents

(* equal contributions, † corresponding author)
S-Lab, Nanyang Technological University
Teaser.

An example task from MMInA. To evaluate an Internet agent's ability to carry out complex tasks, we make it navigate through a variety of websites to gather information and execute actions. In our proposed holistic evaluation protocol, each phase of the compositional task (defined as a hop) and the overall task are assessed for performance. Our benchmark includes 1,050 varied human-written multimodal tasks that require an average of 2.85 hops between websites and 12.9 actions to complete. The longest compositional task takes 10 hops.

Video

Abstract

Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks, with several appealing properties: 1) Evolving real-world multimodal websites. Our benchmark uniquely operates on evolving real-world websites, ensuring a high degree of realism and applicability to natural user tasks. Our data includes 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to autonomously extract multimodal information from web pages as observations; 2) Multihop web browsing. Our dataset features naturally compositional tasks that require information from or actions on multiple websites to solve, to assess long-range reasoning capabilities on web tasks; 3) Holistic evaluation. We propose a novel protocol for evaluating an agent's progress in completing multihop tasks. We experiment with both standalone (multimodal) language models and heuristic-based web agents. Extensive experiments demonstrate that while long-chain multihop web tasks are easy for humans, they remain challenging for state-of-the-art web agents. We identify that agents are more likely to fail on the early hops when solving tasks of more hops, which results in lower task success rates. To address this issue, we propose a simple memory augmentation approach replaying past action trajectories to reflect. Our method significantly improved both the single-hop and multihop web browsing abilities of agents.

MMInA Benchmark

Benchmarking multihop multimodal Internet agents.

Source websites of the 2,989 hops. We form MMInA tasks by combining hops from 14 evolving websites. The websites are real-world websites that are publicly accessible, dynamically updated, and have a variety of layouts and styles.

MMInA multihop task statistics. Left: Counts of multi-hop tasks. Right: Counts of actions in multi-hop tasks.

MMInA benchmark results. We evaluated 4 types of agents on the proposed MMInA benchmark: 1) LLM Agents; 2) LMM Agents; 3) Heuristic-Based Web Agents; 4) Human Baselines. The hop success rate is defined by the percentage (%) of successful visits of the targeted websites; while the task success rate is calculated by the overall percentage (%) of successful tasks from the whole task set.

Memory-augmented Agents

Augmenting Internet agents with procedural memory.

Memory-augmented agents. Our method complements LMMs by enhancing procedural memory with action trajectories on similar tasks. The memory-augmented agent replays past action trajectories to reflect on the current task, which significantly improves both the single-hop and multihop web browsing abilities of agents.

BibTeX

If you find our work useful, please consider citing our paper:

@@misc{zhang2024mmina,
        title={MMInA: Benchmarking Multihop Multimodal Internet Agents}, 
        author={Ziniu Zhang and Shulin Tian and Liangyu Chen and Ziwei Liu},
        year={2024},
        eprint={2404.09992},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }
}