This study explores the development of an automated audio description (AD) framework for local cultural promotional videos using a human-machine collaborative approach. The proposed framework integrates a multimodal large language model, Doubao, with human expertise to enhance AD production, particularly for videos featuring culturally rich content. By focusing on the example of the Fujian-based video “Where There Are Dreams, There Is Fu”, the study addresses two primary challenges in AD: cross-frame coherence and accurate cultural symbol interpretation. Through iterative human-machine collaboration, the model generates coherent, culturally grounded AD scripts that align with the cognitive patterns of visually impaired audiences. This research highlights the potential of GenAI-driven solutions in creating accessible content for public welfare organizations while maintaining cultural authenticity. The proposed framework offers a scalable, cost-effective approach to improving accessibility and promoting cultural heritage for visually impaired individuals.
Rohrbach A, Torabi A, Rohrbach M, et al., 2017, Movie Description. International Journal of Computer Vision, 123(1): 94–120.
Wei L, 2025, Narration, Identity, and Immersion: Strategies for Leveraging Multimodal Large Language Models for Enhancing Cultural Heritage Protection and Inheritance in the New Era. Journal of Yunnan Minzu University (Philosophy and Social Sciences Edition), 42(1): 31–39.
Liu XB, Hu BT, Chen KH, et al., 2023, Key Technologies and Future Development Directions of Large Language Models: Insights from ChatGPT. Bulletin of National Natural Science Foundation of China, 37(5): 758–766.
Campos VP, de Araújo TMU, de Souza Filho GL, et al., 2020, CineAD: A System for Automated Audio Description Script Generation for the Visually Impaired. Universal Access in the Information Society, 19(1): 99–111.
Yuan MT, Ye SC, 2025, Starting from “Audiovisual Translation”: Understanding the Cross-cultural Auditory Communication of Audio Description in Accessible Filmmaking. Film and Television Industry Research, 2(1): 68–78.
Chu P, Wang J, Abrantes A, 2024, LLM-AD: Large Language Model-based Audio Description System. Arxiv, 2405(983): 1–9.
Braun S, Starr K, Delfani J, et al., 2021, When Worlds Collide: AI-created, Human-mediated Video Description Services and the User Experience. Lecture Notes in Computer Science, 13096(1): 147–167.
Sun BL, Wu L, 2025, Research on the Internal Logic and Evolution of Human-Machine Collaborative Creation. Chinese Editor, 24(8): 26–33.