Abstract

Collaboration is a cornerstone of society. In the real world, human teammates make use of multi-sensory data to tackle challenging tasks in ever-changing environments. It is essential for embodied agents collaborating in visually-rich environments replete with dynamic interactions to understand multi-modal observations and task specifications. To evaluate the performance of generalizable multi-modal collaborative agents, we present TeamCraft, a multi-modal multi-agent benchmark built on top of the open-world video game Minecraft. The benchmark features 55,000 task variants specified by multi-modal prompts, procedurally-generated expert demonstrations for imitation learning, and carefully designed protocols to evaluate model generalization capability. We also perform extensive analyses to better understand the limitations and strengths of existing approaches. Our results indicate that existing models continue to face significant challenges in generalizing to novel goals, scenes, and unseen numbers of agents. These findings underscore the potential for further research in this area.

Tasks

Building: build based on blueprint.

Clearing: remove blocks from an area

Farming: sow and harvest crops

Smelting: obtain goal items using furnaces

Multi-modal VLA Model

Model Architecture

TeamCraft-VLA (Vision-Language-Action) is a multi-modal vision-language action model designed for multi-agent collaborations. The model first encodes multi-modal prompts specifying the task, then encodes the visual observations and inventory information from agents during each time step to generate actions.

Multi-modal Prompts

Multi-Modal

Multi-modal prompts are provided for all tasks. The system prompt includes both the three orthographic views and specific language instructions. Observations consist of first-person views from different agents, along with agent-specific information.

Diversity

Variants

More than 30 target object or resource are used in tasks. Objects, such as a fence, an anvil, or a stone block, have different shapes and textures. Farm crops have different visual appearances during growth stage. The smelting task has resources with different appearances, such as chickens or rabbits.

Experiment Results

Success rate for subgoal and task

For both the 7B and 13B models, the subgoal success rate and task success rate fall short of optimal performance. This highlights the inherent difficulty of the designed tasks and underscores the current limitations of VLA models. Centralized tasks exhibit significantly better performance across nearly all variants, highlighting the challenge of effective planning with partial information. The performance of the language model in the text-based Grid-World significantly surpasses VLA models in multi-modal settings. This suggests that state descriptions provided purely in text format are less challenging for models than multi-modal inputs, underscoring a notable gap in current VLA models’ ability to effectively interpret visual information.

Success rate across models

Task success rates of centralized and decentralized VLA model and GPT-4o. Models are trained with the full data except GPT-4o. TeamCraft-VLA-7B-Cen outperforms the other two methods by a significant margin across nearly all variants.

Team

Check out our paper!

@article{long2024teamcraft,
  title={TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft},
  author={Long, Qian and Li, Zhi and Gong, Ran and Wu, Ying Nian and Terzopoulos, Demetri and Gao, Xiaofeng},
  journal={arXiv preprint arXiv:2412.05255},
  year={2024}
}