CIC: A framework for Culturally-aware Image Captioning

Dongguk University
Accepted to IJCAI 2024
MY ALT TEXT

Comparison of generated captions from Vision-Language Pre-trained models (VLPs) such as BLIP and GIT between the same culture group (blue box) and different culture groups (red box). In the blue box, although both images belong to the Japanese cultural group traditional Japanese clothing (i.e., kimono) is not described in the below image. In the red box, it is difficult to distinguish the group through the generated captions.

Abstract

Image Captioning generates descriptive sentences from images using Vision-Language Pre-trained models (VLPs) such as BLIP, which has improved greatly. However, current methods lack the generation of detailed descriptive captions for the cultural elements depicted in the images, such as the traditional clothing worn by people from Asian cultural groups. In this paper, we propose a new framework, Culturally-aware Image Captioning (CIC), that generates captions and describes cultural elements extracted from cultural visual elements in images representing cultures. Inspired by methods combining visual modality and Large Language Models (LLMs) through appropriate prompts, our framework (1) generates questions based on cultural categories from images, (2) extracts cultural visual elements from Visual Question Answering (VQA) using generated questions, and (3) generates culturally-aware captions using LLMs with the prompts. Our human evaluation conducted on 45 participants from 4 different cultural groups with a high understanding of the corresponding culture shows that our proposed framework generates more culturally descriptive captions when compared to the image captioning baseline based on VLPs. Our code and dataset will be made publicly available upon acceptance.


Framework

CIC framework Overview. First, Culture Questions are generated as described in the paper Section 3.1. Then, cultural visual elements represented in the image are extracted through VQA as described in the paper Section 3.2. Finally, LLM generated culturally-aware captions as described in the paper Section 3.3.


MY ALT TEXT

Culturally-aware Image Caption Results

Caption generated for the image depicted the given 4 differnet cultural groups by the baseline models (GIT, Coca, BLIP2) and our framework (CIC) in the paper. The red words in each caption represent visual elements of each culture. Compared to exsiting baseline models, our framework describes images more culturally


MY ALT TEXT

Limitation

The results of culturally-aware image captioning for modern images in the four given cultural groups are presented. When looking at the four images, it is not difficult to distinguish between cultural groups. However, it is challenging to identify cultural elements from the generated captions, and distinguishing cultural groups based solely on the captions is not possible. This highlights the limitation of our defined five cultural elements, which can only differentiate unique cultures within a cultural group. For modern cultural images, considering additional cultural elements, such as race and modern city style, would be a direction for future work.


MY ALT TEXT

Analysis for Human Evaluation

The number in parenthese indicate the percentage of participants who selected that the cultural description was well done for each image. The three images below show instances where our framework received low ratings, reflecting its limitations within a more contemporary culture.

Citation & BibTeX

Youngsik Yun, and Jihie Kim, "CIC: A framework for Culturally-aware Image Captioning." International Joint Conference on Artificial Intelligence (IJCAI) 2024.

"@article{yun2024cic,
    title={CIC: A Framework for Culturally-Aware Image Captioning},
    author={Yun, Youngsik and Kim, Jihie},
    journal={arXiv preprint arXiv:2402.05374},
    year={2024}
}
"@inproceedings{ijcai2024p180,
    title={CIC: A Framework for CUlturally-Aware Image Captioning},
    author={Yun, Youngsik and Kim, Jihie},
    booktitle={Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, {IJCAI-24}}
    publisher={International Joint Conferences on Artificial Intelligence Organization},
    editor={Kate Larson},
    pages={1625--1633},
    year={2024},
    month={8},
    note={Main Track},
    doi={10.24963/ijcai.2024/180},
    url={https://doi.org/10.24963/ijcai.2024/180},
}