GenMAC:
Compositional Text-to-Video Generation with Multi-Agent Collaboration

Kaiyi Huang¹, Yukun Huang¹, Xuefei Ning², Zinan Lin³, Yu Wang², Xihui Liu¹

¹The University of Hong Kong ²Tsinghua University ³Microsoft Research

arXiv 🤗 Paper page

An icy landscape. A vast expanse of snow-covered mountain peaks stretches endlessly. Beneath them is a dense forest and a colossal frozen lake. Two people are boating in one boat, and one person is boating in one boat separately. Above, a ferocious red dragon dominates the sky and commands the heavens.

A robot walking from right to left across the moon with a car driving left to right in the background.

A small mouse in a tattered waistcoat reads a tiny book by the light of a glowing mushroom, with dew drops glistening on the grass around him.

Three roses and two sunflowers.

A sailboat sails from right to left across the blue waters.

Abstract

Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with different objects, and interactions between objects. Our key motivation is that complex tasks can be decomposed into simpler ones, each handled by a role-specialized MLLM agent. Multiple agents can collaborate together to achieve collective intelligence for complex goals. We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign, with an iterative loop between the Generation and Redesign stages to progressively verify and refine the generated videos. The Redesign stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and redesign the text prompts, frame-wise layouts, and guidance scales for the next iteration of generation. To avoid hallucination of a single MLLM agent, we decompose this stage to four sequentially-executed MLLM-based agents: verification agent, suggestion agent, correction agent, and output structuring agent. Furthermore, to tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario. Extensive experiments demonstrate the effectiveness of GenMAC, achieving state-of-the art performance in compositional text-to-video generation.

Highlight

🌟 Task decomposition and role specialization for multi-agent collaboration and collective intelligence.
🌟 GenMAC for compositional text-to-video generation with multi-agent collaboration, an iterative workflow with Design, Generation, and Redesign stages.
🌟 Sequential task decomposition and adaptive self-routing for specialized agent selection.

Video

Method

Overview of GenMAC. (1) Collaborative workflow includes three stages with an iterative loop: Design, Generation, and Redesign. (2) Task decomposition decomposes the redesign stage into four sub-tasks, handled by four agents. (3) Self-routing mechanism allows for adaptive selection of suitable correction agent to address the diverse requirements for compositional text-to-video generation.

Illustration of the allocation of roles in the Redesign stage: verification agent, suggestion agent, correction agent, and output structuring agent within a sequential task breakdown, highlighting the clear responsibilities of each agent.

detail

Examples of GenMAC. Multi-agent collaboration and iterative refinement improves scene accuracy and text alignment.

Qualitative comparison

More examples

A detailed, whimsical illustration of a colorful, rainbow boba tea cup with a cute, smiling face, sitting on a tiki bar-style windowsill overlooking a tropical oceanfront.

In a fantasy world, two beautiful girls are wandering in the city mall chat and eating where the stall is on both side and the big castle is behind.

An icy landscape. A vast expanse of snow-covered mountain peaks stretches endlessly. Beneath them is a dense forest and a colossal frozen lake. Three people are boating in three boats separately Above, a ferocious red dragon dominates the sky and commands the heavens.

Vast field full of colorful herbs on a windy day, depth, atmospheric light.

A cute ragdoll cat is sitting on the table in the middle, and beautiful teapot on the right and a tea cup on the left.

Rabbit tailor sews fabric into a dress.

A football rolling from the left to the right on the grass.

Golden retriever wearing a blue beret, a yellow sunglasses and a red scarf.

A balloon drifts right to left above a statue in a city square.

Five apples on the table.

Eight apples on the table.

Ten apples on the table.

One rose and one sunflower.

Two roses and one sunflower.

Three roses and three sunflowers.

A blue cat and a white dog.

A black cat and an orange dog.

A purple cat and a brown dog.

A blue chair in a white room.

A black chair in an orange room.

A purple chair in a brown room.

BibTeX

@article{huang2024genmaccompositionaltexttovideogeneration,
      title={GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration}, 
      author={Kaiyi Huang and Yukun Huang and Xuefei Ning and Zinan Lin and Yu Wang and Xihui Liu},
      year={2024},
      eprint={2412.04440},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.04440}, 
}