T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation
T2I-CompBench Statistics Properties.
Abstract
Despite the stunning ability to generate high-quality images by recent text-to-image generation models, current approaches often fail to compose objects with different attributes and relationships into a complex and coherent scene. We propose T2I-CompBench, a comprehensive benchmark for open-world compositional text-to-image synthesis, consisting of 6,000 compositional text prompts from 3 categories (attribute binding, object relationships, and complex compositions) and 6 sub-categories (color binding, shape binding, texture binding, spatial relationships, non-spatial relationships, and complex compositions). We further propose several evaluation metrics specifically designed to evaluate compositional text-image generation models, and explore the potential of multimodal LLM for evaluation. We propose an improved baseline, Generative mOdel finetuning with Reward-driven Sample selection (GORS), to boost the compositional generation abilities of pretrained text-to-image models. Extensive experiments and evaluations are conducted to benchmark previous methods on T2I-CompBench, and validate the effectiveness of our proposed evaluation metrics and GORS approach.
Introduction
Evaluation
BLIP-VQA for attribute binding evaluation, UniDet for spatial relationship evaluation, and MiniGPT4-CoT as a potential unified metric.
Method
GORS for Compositional Text-to-image Generation.
Evaluation Results
Qualitative Comparison
Bibtex