Imagine being asked to draw "iced cola in a teacup". Despite the unusual combination, you would naturally draw a teacup first, then add ice cubes and cola. But what happens when we ask AI artists to do the same? We experimented with this in October 2023 when large-scale AI image generation models were just emerging, and again in July 2024 using state-of-the-art models.
Even the most advanced AI artists like Dall-E 3 struggle to conceptualize "iced cola in a teacup", often defaulting to drawing a transparent glass filled with iced cola instead. This issue is known as text-image misalignment in academia. A recent paper titled "Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models" by Professor Dequan Wang's research group at Shanghai Jiao Tong University explores a new branch of this problem. The paper will be published at the 18th European Conference on Computer Vision (ECCV) in October 2024.
Unlike traditional misalignment issues where the focus is on the mutual influence of two concepts in a pair, the "iced cola in a teacup" example involves a hidden variable - the "transparent glass" - which appears in the image despite not being mentioned in the text prompt. This phenomenon is termed Latent Concept Misalignment (LC-Mis) in the paper.
To explore why the teacup disappears from the generated images, the researchers designed a system using Large Language Models (LLMs) to quickly collect concept pairs similar to "iced cola in a teacup". They explained the logic behind the problem to LLMs, categorized it, and had LLMs generate more categories and concept pairs following similar logic. The generated images were then manually evaluated on a scale of 1 to 5, with 5 indicating complete failure to generate correct images.
To bring back the missing teacup, the researchers proposed a method called Mixture of Concept Experts (MoCE). This approach incorporates the human-like sequential drawing process into the multi-step sampling process of diffusion models. LLMs first suggest drawing a teacup, which is input separately into the diffusion model for T-N sampling steps. The complete prompt "iced cola in a teacup" is then provided for the remaining N steps to generate the final image. The value of N is crucial and is adjusted using binary search based on the alignment scores between the image and the concepts of teacup and iced cola.
Experiments were conducted using MoCE and various baseline models on the collected dataset. Visualizations of the "iced cola in a teacup" example and human expert evaluations across the entire dataset were presented. MoCE significantly reduced the proportion of Level 5 LC-Mis concept pairs compared to baseline models, even outperforming Dall-E 3 (October 2023 version) to some extent.
The researchers also highlighted the limitations of existing automated evaluation metrics for the "iced cola in a teacup" problem. They compared MoCE-generated images with carefully selected images of transparent glass cups with handles, which resemble teacups but are not technically teacups due to their material. Popular metrics like Clipscore and Image-Reward gave higher scores to iced cola in transparent glasses than in teacups, indicating an inherent bias towards associating cola with glass containers.
In conclusion, this study introduces a new branch of text-image misalignment problems - Latent Concept Misalignment (LC-Mis). The researchers developed a system to collect LC-Mis concept pairs, proposed the MoCE method to alleviate the issue, and demonstrated the limitations of current text-image alignment evaluation metrics. Future work will continue to advance generative AI technologies to better meet human needs and expectations.