Natural language processing course: Analysis and comparison of translation errors and biases in LLMs
-
ABOUT:
The purpose of this project is to analyze and compare translation errors and biases found in large language models(LLMs). We will evaluate how different models handle translations and look for common errors such as mistranslations, omissions and cultural misinterpretations. In addition we will also explore the bias that could emerge in translation, focus being political bias. By systematically comparing multiple LLMs, we aim to assess translation quality using both automated metrics and human evaluations. The project aims to help improve fairness and accuracy in AI-driven translation systems. -
REQUIREMENTS:
- Python 3.10+
- pip
- Google Colab (recommended for running the code notebooks)
- Account access for:
- ChatGPT (https://openai.com/chatgpt/overview/)
- deepseek (https://www.deepseek.com)
- huggingface (https://huggingface.co) - for creating a token and downloading MistralAI
-
PROJECT FILES:
- data_for_translation/translations.xlsx (file with original source and translated sentences)
- report/code/Mistral.ipynb (code used for translation with MistralAI)
- report/code/COMET.ipynb (code used for COMET evaluation)
-
CRITERIA USED FOR TRANSLATION:
-
Sentences were compared by:
- lexical fidelity,
- tone shifts (emphasis or neutralization),
- addition or omission of ideological markers.
-
Translation changes were categorized by:
- neutralization (softening emotionally charged words),
- shift (reframing with political implication),
- preservation (faithful to source text),
- no answer (model did not provide a translation),
- incorrect translation (incomprehensible translation).
- HOW TO RUN THE PROJECT
- step 1: translate the text
- load translation file
- perform translation using different models (for ChatGPT use
https://openai.com/chatgpt/overview/, for Deepseek use
https://www.deepseek.com, for MistralAI run Mistral.ipynb in Google Colab) - Note: in order to run MistralAI, you must create a token in Huggingface and save it as secret key in Google Colab under the name HF_TOKEN.
- step 2: evaluate translations
- use COMET.ipynb in Colab to get COMET evaluations
- Note: when installing unbabel-comet, if there is an error because of numpy version, run the first cell (pip install numpy<2.0.0) and restart session, then run the cell to install unbabel again.
- perform human evaluation
- step 3: analyze results
- based on COMET scores and human evaluation, make conclusions regarding translation quality across models and analyze any bias that might appear in the translations
-
TEAM:
Tjaša Nadoh
Urška Roblek -
REFERENCES
Roberto Navigli, Simone Conia, and Björn Ross. 2023. Biases in Large Language Models: Origins, Inventory, and Discussion. J. Data and Information Quality 15, 2, Article 10 (June 2023)
Barclay, P. J., & Sami, A. (2024). Investigating Markers and Drivers of Gender Bias in Machine Translations. arXiv.Org, abs/2403.11896