This repository holds code used in writing "CoVoSwitch: Machine Translation of Synthetic Code-Switched Text Based on Intonation Units," to appear in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Paper was accepted on July 9, 2024 for ACL-SRW 2024 (Aug. 11-16 2024).
CoVoSwitch, the dataset I created in this paper, is available through HuggingFace. It was created using CoVoST 2, which is a speech-to-text translation dataset created in turn from Common Voice Corpus 4.0. (It's a switched dataset, hence the name...!) Details of dataset distribution can be found in the paper, but data points roughly amount to 155k~158k rows for each language once train, validation, and test sets are collated.
8/16/24 Update: Paper has been published in ACL Proceedings!
Link to paper: ACL Anthology.
7/22/24 Update: Paper has been published in arXiv!
Link to paper: arXiv.
Dataset: HuggingFace Dataset
If you'd like to use this code or paper, please cite:
@inproceedings{kang-2024-covoswitch,
title = "{C}o{V}o{S}witch: Machine Translation of Synthetic Code-Switched Text Based on Intonation Units",
author = "Kang, Yeeun",
editor = "Fu, Xiyan and
Fleisig, Eve",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-srw.40",
pages = "469--481",
}
Multilingual code-switching translation (NMT), prosodic speech segmentation (intonation unit detection) using Whisper (STT)
Keywords: code-switching, intonation units, machine translation, low-resource languages, speech-to-text, multilingual
-
Theories: Matrix-Embedded Language Framework (1997), Intonation Unit Boundary Constraint (2023)
-
Speech-to-Text Translation Dataset: CoVoST 2 (2021, specifically the En->X subset), based on Common Voice Corpus 4.0 (2020)
-
Intonation Unit Detection: PSST (2023), based on Whisper (2023)
-
Word-level text-to-text aligner: awesome-align (2021)
-
Multilingual Neural Machine Translation (MNMT) Models: Meta's M2M-100 418M (2021), NLLB-200 600M (2022)
-
Metrics: chrF++ (2017, character level), spBLEU (2022, tokenized language-agnostic metric level), COMET (2022, models human judgments of translations).
-
Phonological Hierarchy: Here is an interesting Wikipedia article on phonological hierarchy. This paper and code deals with
3. Prosodic intonation unit (IU)in theDiscourse analytical hierarchysection. A light Wikipedia introduction tointonation units (IU)can also be found here.
Papers for all of these can be found in the references section of my paper.
Arabic (ar), Catalan (ca), Welsh (cy), German (de), Estonian (et), Persian (fa), Indonesian (id), Latvian (lv), Mongolian (mn), Slovenian (sl), Swedish (sv), Tamil (ta), Turkish (tr)
These languages are from the En->X subset of CoVoST 2. Note that Chinese and Japanese were excluded because they are scriptio continua, i.e. not whitespace-separated.
The code structure mostly matches the subsections of Section 2: Synthetic Data Generation described in the paper.
If you would like an intro to how you would pull CoVoSwitch from HuggingFace Datasets, please refer to the evaluate section.
- Download Common Voice 4.0: Download here into your local workspace.
- Process CoVoST 2: generate_transcripts.py
- Since CoVoST 2 is a speech-to-text translation dataset, we need to process both speech and text in the dataset.
- Usage:
python generate_transcripts.py --subset train- --subset argument must either be
train,validation, ortest
- --subset argument must either be
- Creates a directory with 3 files per language pair:
-
- English transcriptions (text): will be used to check whether PSST generates correct IU boundary-marked transcripts (used to check against transcription error).
-
- Target translations (text): will be used in alignment step
-
- Common Voice filenames used (text): will be used as input to PSST for IU boundary-marked transcripts.
-
- Intermediate step: Move Common Voice files to Google Drive (skip this step if you're using GPU locally)
shell-scripts/tarzip_valid.sh: is an example of how to zip local files. You can subsequently upload them into your remote workspace (like Google Drive for Colab).
- Intonation Unit Detection: detect_IU.py
- With English Common Voice files, we generate English transcripts marked with intonation unit boundaries by PSST.
- Usage:
python detect_IU.py --englishfilename "some English file path" --boundaryfilename "some boundary file path". - Creates intonation unit boundary-marked English transcriptions.
- PSST performs two functionalities, details of which can be found in Table 1 of the paper.
- Transcribing
- marking IU boundaries.
--englishfilename: feed name of text file that stores all English Common Voice 4.0 filenames.--boundaryfilename: feed name of text file that will store all IU-detected transcripts, each separated by a newline.
- Descriptive Statistics: generate_statistics.py
- This code generates descriptive statistics on number of IUs and words for each set. See Table 2 in the paper.
- Alignment Generation: generate_alignments.py
- With English transcripts and non-English transcripts from CoVoST 2, we generate word-level alignments. Note that the English transcripts used here is different from the one generated by PSST in the step above. We need a file that contains each English raw transcript separated by a newline and a file that contains each target language raw transcript separated by a newline.
- Usage:
python generate_alignments.py --subsetfolder "subsetfoldername" - For purposes of this code, we assume the following file organization. Suppose you have Welsh (cy) and Catalan (ca) to analyze.
subsetfolder/ βββ en.txt # File that contains English transcripts from CoVoST2, separated by newline. βββ ca.txt # File that contains Catalan transcripts from CoVoST2, separated by newline. βββ cy.txt # File that contains Welsh transcripts from CoVoST2, separated by newline.
- Intonation Unit Replacement: replace_IU.py
- Now with alignments extracted, we proceed to replace IUs. Specifically, this code does:
-
- create code-switched text
-
- print descriptive statistics (Total # generated, src percentage, tgt percentage, CMI, SPF) on code-switched data
-
- To illustrate the generation process with a simple English-Spanish example (IU and alignment are made up here):
English: Hi, my name is Sophia and I am a student at Yale University. Spanish: Hola, mi nombre es Sophia y soy estudiante de la Universidad de Yale. Word-level Alignment: [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 6), (8, 7), (9, 7), (10, 8), (11, 12), (12, 9), (12, 10), (12, 11)] IU transcript: Hi, <|IU_Boundary|> my name is Sophia <|IU_Boundary|> and I am a student at Yale University. source {IU:token} mapping: {0: [0], 1: [1, 2, 3, 4], 2: [5, 6, 7, 8, 9, 10, 11, 12]} Replace English IU Indices: 0, 2 Code-Switched Text: Hola, my name is Sophia y soy estudiante de la Universidad de Yale.- Run
python replace_IU.py --transcriptionsfolder "some transcriptions folder" --csfolder "some code-switching folder"where--transcriptionsfolderis the folder that stores yourEnglish and non-English transcriptions filesANDalignment files.--csfolderis the new directory that the code will write code-switched data to.
- Now with alignments extracted, we proceed to replace IUs. Specifically, this code does:
- Running Translation Models:
- I ran 4 translation experiments.
csw->En,csw->X,X->EnandEn->X. - M2M-100 418M:
run_M2M100.py - NLLB-200 600M:
run_NLLB200.py
- I ran 4 translation experiments.
- Evaluating Translation Performance:
- There are two baselines, as described in section 3 of the paper.
-
- Raw Code-Switched Inputs
-
- Monolingual Translations
-
- So code in this section calculates NMT evaluation scores for raw code-switched inputs and monolingual baselines, and then calculates deltas that system translations of code-switched texts achieve from these two baselines.
chrf++ evaluation: Can be done bychrf = evaluate.load('chrf')and setting word_order=2 inresults = chrf.compute(predictions=prediction, references=reference, word_order=2). See this HuggingFace post.evaluate_spbleu.py: Evaluation code for using spBLEU. spBLEU is based on the SentencePiece tokenizer and was suggested for language-agnostic MT evaluation. spBLEU can be calculated by using theflores200tokenizer available throughsacrebleu. See the tokenizer integrated into thesacrebleurepository in this file.evaluate_comet.py: Evaluation code for using COMET scores.- COMET model:
Unbabel/wmt22-comet-damodel. - I used the pip package: https://pypi.org/project/unbabel-comet/ and
download_model,load_from_checkpoint, but note that there are alternative ways to evaluating with COMET.
- COMET model:
- There are two baselines, as described in section 3 of the paper.
All code (except for Common Voice transcript generation) were run on Google Colab with a single NVIDIA L4 GPU. Jupyter notebooks have been converted to Python scripts to facilitate future use. I have also included shell scripts I used in this project in the directory shell-scripts. tarzip_valid.sh is a shell script that I used to transfer Common Voice audio files from local to Google Drive after zipping them into tar.
Python, HuggingFace transformers & datasets, Shell scripts
HuggingFace transformers - AutoTokenizer, AutoProcessor, AutoModel, AutoModelForSpeechSeq2Seq
Audio libraries: librosa
Quick Start: pip install -r requirements.txt
Latest version of datasets is not compatible with unbabel-comet==2.0.0, so consider either using older version of datasets and unbabel-comet==2.0.0 or newer version of datasets and newer version of unbabel-comet.
There are 2 main code-switching linguistic theories I discovered while writing this paper.
- Matrix-Embedded Language Framework (MLF): keeps one language as the matrix language and includes segments from an embedded language.
- Equivalence Constraint (EC) Theory: grammatical structure present in one language must also exist in the other - as such, parsers are needed for code-switching text creation.
One difficulty that comes with using the Equivalence Constraint is that it requires the use of constituency parsers. I found two parsers I could potentially use - the Stanford Parser and the Berkeley Neural Parser, but these only support a specific subset of languages. I realized that I could not use this approach to create code-switched text in unstudied languages, but found that the speech domain had datasets covering low-resource languages I wanted to study.
So instead I paid attention to a recent prosodic constraint established in EMNLP 2023.
- Intonation Unit Boundary Constraint: code-switching is more common across intonation units than within as a result of looser syntactic relationships.
Meanwhile, I also found that OpenAI's STT model Whisper can be fine-tuned to detect English prosodic boundaries the PSST model, published in CoNLL 2023. This means that Whisper can segment an English utterance into intonation units.
Following the Matrix-Embedded Language Framework and the Intonation Unit Boundary Constraint, I synthesized code-switching text data by switching languages using intonation units as basis units.
Before I graduate in May 2025, I plan to create a fleursswitch dataset, created from Google's FLEURS. This will also be made available through my HuggingFace profile. I am also experimenting with non-English language pairs, which I am partially presenting at Interspeech YFRSW 2024 this year (Kos, Greece), but this is mainly based on the speech-to-speech modality.
I note that this project was created with an English intonation unit detection model. I'm excited to see future work for IU detection on non-English languages. (Do share!!)
Feedback I received from ACL reviewers on this paper are publicly available on OpenReview.
To those who find my repository or paper interesting, here are some further repositories that I discovered while writing this paper that may be of interest.
- GCM: GCM, which generates code-switched text based on the EC Theory. Benepar & Stanford are both available as parsers and they offer a web interface for one-time generation.
I grew up bilingual, splitting my time in the US & Korea, and my sister and I always talk to each other in a perfect mix of both languages. We have never been able to explain why we use some words in English and others in Korean, but recent research Code-Switching Metrics Using Intonation Units demonstrated that code-switching is more common across intonation units, which I found very interesting. In the future, written and spoken language systems will likely need to handle multilingual inputs, and I believe code-switching research is an interesting starting point to advancing towards them. I tested MNMT models with my synthetic dataset because I had always been interested in multimodal translation systems, but I see many other potential fields of application for future code-switching research.
I'm happy to discuss the project further in any capacity through the email (sophia.kang@yale.edu) on my GitHub profile!