ISSN : 2663-2187

Cross-Lingual Visual Understanding: A Transformer-Based Approach for Bilingual Image Caption Generation

Main Article Content

Emran Al-Buraihy, Dan Wang
ยป doi: 10.48047/AFJBS.6.8.2024.2990-3002

Abstract

In the evolving landscape of artificial intelligence, the capability to automatically generate image captions that are not only accurate but also culturally and linguistically nuanced remains a significant challenge, especially across diverse languages like Arabic and English. This research addresses the gap in bilingual image captioning by developing a transformer-based model designed to handle the complexities of cultural and linguistic diversity effectively. The proposed model integrates Convolutional Neural Networks (CNNs) for robust visual feature extraction with a dual-language transformer architecture that incorporates a novel cultural context embedding layer. This methodology ensures the generation of culturally sensitive and linguistically accurate captions. Employing a meticulously curated dataset featuring culturally diverse images annotated in both target languages, the model was trained and evaluated, demonstrating superior performance over existing models. Quantitative results show remarkable CIDEr scores of 60.2 for English and 58.7 for Arabic, underscoring its efficacy in generating contextually and culturally coherent captions. This study not only advances the field of multilingual image captioning but also sets a new standard for integrating cultural sensitivity into AI, proposing significant implications for future applications in global digital content accessibility.

Article Details