Natural language processing course: Automatic generation of Slovenian traffic news for RTV Slovenija
- Janez TomΕ‘iΔ
- Ε½an PuΕ‘enjak
- Matic ZadobovΕ‘ek
- Natural language processing course:
Automatic generation of Slovenian traffic news for RTV Slovenija
This repository documents our project for the Natural Language Processing course. The task was to develop a system for the automatic generation of Slovenian traffic reports based on raw structured data provided. The end goal was to support RTV Slovenija in replacing their current manual process (students writing reports every 30 minutes) with a solution powered by large language models (LLMs).
We explore a variety of techniques β from prompting, to parameter-efficient fine-tuning, and retrieval-augmented generation β in order to automatically generate accurate and nicely structured traffic news reports in Slovenian.
We also designed custom evaluation pipelines (both manual and automatic) and built a dedicated Streamlit app for effective human evaluation.
π For a deeper dive into data cleaning, methodology, modeling choices, and full results/discussion β please refer to our final project report in the report/
folder.
Below is an overview of the repository layout.
.
βββ report/
βββ environment.yml
βββ src/
β βββ utils/
β βββ notebooks/
β β βββ dataset_preparation.ipynb
β β βββ exploration.ipynb
β β βββ evaluation.ipynb
β β βββ finetuning_example.ipynb
β β βββ prompting.ipynb
β βββ evaluation/
β β βββ app/
β β β βββ app_evaluation.py
β β βββ progress/
β β βββ results/
β βββ arnes_hpc/
β βββ archive/
β βββ containers/
β βββ models/
β β βββ final_model_9b/
β βββ scripts/
β βββ finetuning.py
β βββ instructions.txt
β βββ run_base_model.py
β βββ run_finetuned_model.py
β βββ run_instructed_finetuned_model.py
β βββ run_rag_model.py
β βββ run_slurm_finetuning.sh
β βββ run_slurm_base_eval.sh
β βββ run_slurm_finetuned_eval.sh
β βββ run_slurm_instructed_eval.sh
β βββ run_slurm_rag_eval.sh
β βββ rag/
β βββ data_creation/
β βββ embed.py
β βββ rag_instructions_embeddings.npy
β βββ rag_instructions.jsonl
β βββ rag_roads_embeddings.npy
β βββ rag_roads.jsonl
β βββ retrieve_example.py
βββ README.md
PDFs of our project write-ups and submission reports.
Conda environment for local reproducibility (conda env create -f environment.yml
).
Helper functions for parsing, cleaning and structuring raw traffic data.
Interactive Jupyter notebooks for:
- dataset_preparation.ipynb β building and cleaning our dataset
- exploration.ipynb β data inspection and exploratory data analysis
- evaluation.ipynb β automatic metrics (SloBERTa score)
- finetuning_example.ipynb β a toy LoRA run to run on Colab
- prompting.ipynb β structured-prompt experiments
A Streamlit app (app_evaluation.py
) we built to perform manual rating on test outputs.
Each memberβs intermediate ratings.
Final aggregated outputs for all 4 scenarios (base-instructed, fine-tuned, fine-tuned + instructed, fine-tuned + instructed + RAG) on 500 examples.
Everything needed to run our full experiments on the ARNES HPC cluster:
- archive/: old / discarded models & scripts
- containers/: Singularity definition files for building
.sif
images - models/final_model_9b/: the GaMS-9B checkpoint & LoRA adapter
- scripts/:
- finetuning.py: QLoRA fine-tuning on H100
- instructions.txt: base-prompt rules
- run_*.py & run_slurm_*.sh: launch scripts for each of the 4 experiments
- rag/: embed & index road-chunks & instruction docs with LaBSE + retrieval demo
The dataset used in this project is not included in this GitHub repository due to size. However, you can download it from the following link:
π Shared dataset folder (OneDrive)
The shared folder contains the original traffic report data, and also our processed training data.
We created a clean, structured JSONL file called train_promet.jsonl
, which is used for fine-tuning the language model. Each entry is a JSON object with two keys:
"prompt"
β a system-like input containing raw structured text"response"
β the corresponding expected radio-ready traffic report
Follow these instructions depending on the setup (local or HPC):
You can start by creating the required environment using Conda. The provided environment.yml
file will install all necessary dependencies under the environment name nlp-project
. Later on we also provide separate instructions, for those who would prefer to manually install the needed dependencies.
conda env create -f environment.yml
conda activate nlp-project
All notebooks under src/notebooks/
are designed for local execution β with the exception of finetuning_example.ipynb
, which is a simplified demo version of LoRA fine-tuning, and is prepared for execution on Colab.
To run notebooks locally, make sure the following Python packages are installed (already included via environment.yml
, but if you would prefer to do manual installation):
pip install beautifulsoup4 matplotlib numpy pandas seaborn striprtf ipython transformers torch scikit-learn
To run the manual evaluation Streamlit app in src/evaluation/app/app_evaluation.py
, you'll need:
pip install pandas streamlit
You can then launch the application using:
cd src/evaluation/app
streamlit run app_evaluation.py
To run the full training or evaluation jobs on the ARNES HPC cluster:
- Navigate to our project folder:
cd /d/hpc/projects/onj_fri/trije_konjeniki_apokalipse/
- Launch jobs using:
sbatch run_slurm_<type>_eval.sh
The structure is modular β for each .sh
(Slurm script) there's a corresponding Python script that performs the actual execution:
SLURM script | Python script |
---|---|
run_slurm_base_eval.sh |
run_base_model.py |
run_slurm_finetuned_eval.sh |
run_finetuned_model.py |
run_slurm_instructed_eval.sh |
run_instructed_finetuned_model.py |
run_slurm_rag_eval.sh |
run_rag_model.py |
Each of these jobs produces a .txt
file containing model outputs for 500 examples. These results are stored and used later for manual and automatic evaluation.
All the files required for running are already available under the same directory on HPC, so the workflow is fully reproducible and requires no extra setup.
We explored four experimental settings to see the effectiveness of prompt engineering, fine-tuning, and retrieval-augmented generation (RAG).
We used the original cjvt/GaMS-9B-Instruct
model with structured prompting. The input format was designed with our defined rules, and no parameter updates were performed.
We performed QLoRA fine-tuning of the cjvt/GaMS-9B-Instruct
model using our processed train_promet.jsonl
dataset. The dataset was split 80/20 for training and validation.
We used:
- Quantisation: 4-bit NF4 with bfloat16 compute
- LoRA config:
r=8
,alpha=32
,dropout=0.05
, targeting attention modules - Batching:
batch_size=1
withgradient_accumulation=8
- Max length: 512 tokens
- Epochs: 3
- Scheduler: cosine with warmup
- Precision: bfloat16
- Optimizer: AdamW
The adapter and tokenizer were saved to disk for later inference.
We used the fine-tuned model, but kept the structured prompts to guide generation, essentially combining both approaches.
We enhanced the instructed setup with retrieval-augmented generation using dense LaBSE embeddings. We embedded and indexed road and instruction snippets, retrieved the most relevant ones based on cosine similarity to input, and added them to the prompt.
We used SloBERTa and cosine similarity to score the model outputs against ground-truth references across all 4 settings on 500 test samples.
To get a better sense of model quality, we designed a Streamlit-based web app (app_evaluation.py
) to allow all three of us to independently rank outputs of all 4 scenarios per example.
We followed this process:
-
Pre-evaluation calibration β we manually rated 10 calibration examples and compared our ranking differences to improve rating agreement. This helped us normalize our evaluation criteria and better understand nuances in generated outputs.
-
Final evaluation β we then independently rated 30 examples each, across all four model variants.
For each example and scenario, we:
-
Gave a rating from 1 to 5, assessing the overall usefulness and clarity.
-
Compared each output directly to the ground-truth report, noting whether the generated output was better or worse.
The final scores are computed as a global average across all three of us for both criteria.
πΌοΈ The Streamlit app used for manual evaluation:
Results from both evaluations are summarized in the next section.
We evaluated all four experimental setups using two types of evaluation:
- Automatic evaluation using SloBERTa + cosine similarity on 500 test examples.
- Manual evaluation of 30 test examples.
Model | Precision | Recall | F1-score | Length difference (in words) |
---|---|---|---|---|
Base instructed | 0.608 Β± 0.004 | 0.683 Β± 0.004 | 0.643 Β± 0.004 | 1.904 Β± 0.042 |
Fine-tuned | 0.774 Β± 0.003 | 0.753 Β± 0.004 | 0.762 Β± 0.003 | 0.818 Β± 0.022 |
Fine-tuned and instructed | 0.817 Β± 0.003 | 0.752 Β± 0.004 | 0.781 Β± 0.003 | 0.732 Β± 0.017 |
Fine-tuned and instructed + RAG | 0.815 Β± 0.003 | 0.752 Β± 0.004 | 0.779 Β± 0.003 | 0.752 Β± 0.018 |
Scenario | Avg. Rating | % Outputs better than ground truth |
---|---|---|
Base instructed | 1.97 Β± 0.10 | 6.7% Β± 2.6% |
Fine-tuned | 3.01 Β± 0.13 | 22.2% Β± 4.4% |
Fine-tuned + instructed | 3.20 Β± 0.11 | 28.9% Β± 4.8 |
Fine-tuned + instructed + RAG | 3.07 Β± 0.12 | 25.6% Β± 4.6% |
The histogram below shows the distribution of F1 scores across the 500 test examples, highlighting performance spread per method.