T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation

Kaiyi Huang1 Chengqi Duan3 Kaiyue Sun1 Enze Xie2 Zhenguo Li2 Xihui Liu1

1 The University of Hong Kong 2 Huawei Noah's Ark Lab 3 Tsinghua University

Failure cases of Stable Diffusion v2 in compositionality.

 


 

 

 

Abstract

Despite the stunning ability to generate high-quality images by recent text-to-image models, current approaches often struggle to effectively compose objects with different attributes and relationships into a complex and coherent scene. We propose T2I-CompBench++, an enhanced and comprehensive benchmark for compositional text-to-image generation, consisting of 8,000 compositional text prompts from 4 categories (attribute binding, object relationships, generative numeracy, and complexcompositions) and 8 sub-categories (color binding, shape binding, texture binding, 2D/3D-spatial relationships, non-spatial relationships, numeracy, and complex compositions). We further propose several evaluation metrics specifically designed to evaluate compositional text-to-image generation and explore the potential and limitations of multimodal LLMs for evaluation. We introduce a new approach, Generative mOdel finetuning with Rewarddriven Sample selection (GORS), to boost the compositional text-to-image generation abilities of pretrained text-to-image models. Extensive experiments and evaluations are conducted to benchmark previous methods on T2I-CompBench++, and to validate the effectiveness of our proposed evaluation metrics and GORS approach.

 

Evaluation and Benchmarking on T2I Models


(a) Disentangled BLIP-VQA for attribute binding evaluation, (b) UniDet for 2D/3D-spatial relationship evaluation, (c) UniDet for numeracy evaluation, and (d) MLLM as a potential unified metric.

Benchmarking on all categories with proposed metrics. Bold stands for the best score across 7 models in T2ICompbench. Red indicates the best score across 10 models in T2I-CompBench++.

 

Method


GORS for Compositional Text-to-image Generation.

 

Qualitative Comparison


 

Bibtex


    @article{huang2023t2icompbench,
        title={T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation},
        author={Kaiyi Huang and Kaiyue Sun and Enze Xie and Zhenguo Li and Xihui Liu},
        journal={arXiv preprint arXiv: 2307.06350},
        year={2023}
    }