Culture-TRIP: Culturally-Aware Text-to-Image Generation with Iterative Prompt Refinement

Dongguk University
Accepted to NAACL 2025
MY ALT TEXT

Qualitative comparison of C-TRIP ablated configurations compared to Base Prompt. The six columns can be divided into two groups: Relatively UC nouns (left four columns) and RC nouns (right two columns). The left group needed C-TRIP to introduce culture nouns that were new to them, while the right group had to recall what they already knew through the additional information provided.

Abstract

Text-to-Image models, including Stable Diffusion, have significantly improved in generating images that are highly semantically aligned with the given prompts. However, existing models may fail to produce appropriate images for the cultural concepts or objects that are not well known or underrepresented in western cultures, such as 'hangari' (Korean utensil). In this paper, we propose a novel approach, Culturally-Aware Text-to-Image Generation with Iterative Prompt Refinement (Culture-TRIP) , which refines the prompt in order to improve the alignment of the image with such culture nouns in text-to-image models. Our approach (1) retrieves cultural contexts and visual details related to the culture nouns in the prompt and (2) iteratively refines and evaluates the prompt based on a set of cultural criteria and large language models. The refinement process utilizes the information retrieved from Wikipedia and the Web. Our user survey, conducted with 66 participants from eight different countries demonstrates that our proposed approach enhances the alignment between the images and the prompts. In particular, C-TRIP demonstrates improved alignment between the generated images and underrepresented culture nouns. Our code and dataset will be made publicly available upon acceptance.


Framework

C-TRIP Overview. First, retrieve cultural contexts (cultural background, purpose) and visual details related to the culture nouns as described in Section 3.1. Then, refining the prompt based on the obtained information. We iterative evaluate and refine the prompt as described in Section 3.2.


MY ALT TEXT

Culture-TRIP Results

We present additional qualitative examples illustrating how the C-TRIP approach improves the alignment of culture nouns in UC and RC nouns with the generated images by Stable Diffusion 2. Results are showcased across China, Germany, India, Japan, Pakistan, South Korea, USA, and Vietnam. Additionally, our approach effectively enhanced the alignment between prompt containing culture nouns from the UC nouns and the generated images.

Ablation Study

Our approach performed the best with the Q1 group, with the highest median score and a narrow IQR, presenting consistent improvement across this group. This suggests that C-TRIP effectively reinforces the generation of images for the Q1 group, enhancing alignment for UC nouns. The Q2 group demonstrated the highest upper range within the IQR and the second-highest median, suggesting effective but slightly variable improvements. However, the Q3 group showed the lowest performance with a wide variance, which may potentially indicate that additional information could decrease alignment for culture nouns.


MY ALT TEXT

Limitation

Sources like Wikipedia and general Web content contain cultural biases, which can affect the refinement process and C-TRIP's capacity to provide balanced cultural representation. Future work should focus on enhancing the information retrieval process through developing culturally diverse datasets, thereby ensuring high-quality, relevant data for effective prompt refinement. The limited scope of prompts utilized in our experiments and human evaluations presents a current limitation and suggests an important avenue for future research. Additionally, our work is constrained by the perceptual biases of human annotators from eight countries. To improve the reliability of evaluation outcomes, future work will emphasize the inclusion of annotators from a broader range of cultural backgrounds.


Citation & BibTeX

Suchae Jung, Inseong Choi, Youngsik Yun, and Jihie Kim, "Culture-TRIP: Culturally-Aware Text-to-Image Generation with Iterative Prompt Refinement" The 2025 Annual Conference of the Nations of the Americas Chapter of the ACL (NAACL 2025).

@article{jeong2025culture,
    title={Culture-TRIP: Culturally-Aware Text-to-Image Generation with Iterative Prompt Refinment},
    author={Jeong, Suchae and Choi, Inseong and Yun, Youngsik and Kim, Jihie},
    journal={arXiv preprint arXiv:2502.16902},
    year={2025}
}