This paper takes Chinese red culture resources as its research subject and focuses on evaluating the Chinese-English translation quality of three major AI platforms: ChatGPT-4.0, ERNIE Bot, and DeepSeek. Through automatic quantitative evaluation, it systematically analyzes their performance in translating red culture texts. The study selects a diverse range of corpora, including historical documents, red classic texts, and culturally loaded terms. Three automatic evaluation metrics—GLEU, METEOR, and COMET—are employed for a comprehensive assessment.
Rivera-Trigueros I, 2022, Machine Translation Systems and Quality Assessment: A Systematic Review. Language Resources and Evaluation, 56(2): 593–619.
Han C, 2020, Translation Quality Assessment: A Critical Methodological Review. The Translator, 26(3): 257–273.
Lauscher S, 2014, Translation Quality Assessment: Where Can Theory and Practice Meet? Evaluation and Translation, Routledge, New York, 149–168.
Salvagno M, Taccone FS, Gerli AG, 2023, Can Artificial Intelligence Help for Scientific Writing? Critical Care, 27(1): 75.
Thorp H, 2023, ChatGPT Is Fun, But Not An Author. Science, 379(6630): 313.
Jiao W, Wang W, Huang JT, et al., 2023, Is ChatGPT a Good Translator? Yes with GPT-4 as the Engine. arXiv. https://doi.org/10.48550/arXiv.2301.08745
Ghassemiazghandi M, 2024, An Evaluation of ChatGPT’s Translation Accuracy Using BLEU Score. Theory and Practice in Language Studies, 14(4): 985–994.
Nemergut M, 2024, Machine Translation Quality Based on TER Analysis from English into Slovak. L10N Journal, 3(2): 60–86.
Wang D, Lin L, Zhao Z, et al., 2023, EvaHan2023: Overview of the First International Ancient Chinese Translation Bakeoff, Proceedings of ALT2023: Ancient Language Translation Workshop, 1–14.
Ali JKM, 2023, Benefits and Challenges of Using ChatGPT: An Exploratory Study on English Language Program. University of Bisha Journal for Humanities, 2(2): 629–641.
Sahari Y, Al-Kadi AMT, Ali JKM, 2023, A Cross-Sectional Study of ChatGPT in Translation: Magnitude of Use, Attitudes, and Uncertainties. Journal of Psycholinguistic Research, 52(6): 2937–2954.
Wang J, Wen Q, 2010, A Review of Automatic Scoring Systems at Home and Abroad and the Enlightenment for Chinese Students. Foreign Languages, 2010(1): 75–81.
Akhtarshenas A, Dini A, Ayoobi N, 2025, ChatGPT or A Silent Everywhere Helper: A Survey of Large Language Models. arXiv. https://doi.org/10.48550/arXiv.2503.17403
Guo D, Zhu Q, Yang D, et al., 2024, DeepSeek-Coder: When the Large Language Model Meets Programming—The Rise of Code Intelligence. ArXiv. https://doi.org/10.48550/arXiv.2401.14196
Lu H, Liu W, Zhang B, et al., 2024, Deepseek-VL: Towards Real-World Vision-Language Understanding. arXiv. https://doi.org/10.48550/arXiv.2403.05525
Wang A, Singh A, Michael J, et al., 2018, GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv. https://doi.org/10.48550/arXiv.1804.07461
Banerjee S, Lavie A, 2005, METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments, Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 65–72.
Rei R, Stewart C, Farinha AC, et al., 2020, COMET: A Neural Framework for MT Evaluation. arXiv. https://doi.org/10.48550/arXiv.2009.09025
Graham Y, Baldwin T, Moffat A, et al., 2013, Continuous Measurement Scales in Human Evaluation of Machine Translation, Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, 33–41.
Snover M, Dorr B, Schwartz R, et al., 2006, A Study of Translation Edit Rate with Targeted Human Annotation, Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (ACL), Cambridge, 223–231.
Lommel A, Uszkoreit H, Burchardt A, 2014, Multidimensional Quality Metrics (MQM): A Framework for Declaring and Describing Translation Quality Metrics. Revista Tradumatica: Tecnologies de la Traduccio, 12: 455–463.