Skip to content

UL-FRI-NLP-Course/ul-fri-nlp-course-project-2024-2025-trije-konjeniki-apokalipse

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

65 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Natural language processing course: Automatic generation of Slovenian traffic news for RTV Slovenija


Team members

  • Janez TomΕ‘ič
  • Ε½an PuΕ‘enjak
  • Matic ZadobovΕ‘ek

πŸ“‘ Table of contents


πŸ“– Introduction

This repository documents our project for the Natural Language Processing course. The task was to develop a system for the automatic generation of Slovenian traffic reports based on raw structured data provided. The end goal was to support RTV Slovenija in replacing their current manual process (students writing reports every 30 minutes) with a solution powered by large language models (LLMs).

We explore a variety of techniques β€” from prompting, to parameter-efficient fine-tuning, and retrieval-augmented generation β€” in order to automatically generate accurate and nicely structured traffic news reports in Slovenian.

We also designed custom evaluation pipelines (both manual and automatic) and built a dedicated Streamlit app for effective human evaluation.

πŸ” For a deeper dive into data cleaning, methodology, modeling choices, and full results/discussion β€” please refer to our final project report in the report/ folder.


πŸ“‚ Project structure

Below is an overview of the repository layout.

.
β”œβ”€β”€ report/
β”œβ”€β”€ environment.yml
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ notebooks/
β”‚   β”‚   β”œβ”€β”€ dataset_preparation.ipynb
β”‚   β”‚   β”œβ”€β”€ exploration.ipynb
β”‚   β”‚   β”œβ”€β”€ evaluation.ipynb
β”‚   β”‚   β”œβ”€β”€ finetuning_example.ipynb
β”‚   β”‚   └── prompting.ipynb
β”‚   β”œβ”€β”€ evaluation/
β”‚   β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”‚   └── app_evaluation.py
β”‚   β”‚   β”œβ”€β”€ progress/
β”‚   β”‚   └── results/
β”‚   └── arnes_hpc/
β”‚       β”œβ”€β”€ archive/
β”‚       β”œβ”€β”€ containers/
β”‚       β”œβ”€β”€ models/
β”‚       β”‚   └── final_model_9b/
β”‚       └── scripts/
β”‚           β”œβ”€β”€ finetuning.py
β”‚           β”œβ”€β”€ instructions.txt
β”‚           β”œβ”€β”€ run_base_model.py
β”‚           β”œβ”€β”€ run_finetuned_model.py
β”‚           β”œβ”€β”€ run_instructed_finetuned_model.py
β”‚           β”œβ”€β”€ run_rag_model.py
β”‚           β”œβ”€β”€ run_slurm_finetuning.sh
β”‚           β”œβ”€β”€ run_slurm_base_eval.sh
β”‚           β”œβ”€β”€ run_slurm_finetuned_eval.sh
β”‚           β”œβ”€β”€ run_slurm_instructed_eval.sh
β”‚           └── run_slurm_rag_eval.sh
β”‚           └── rag/
β”‚               β”œβ”€β”€ data_creation/
β”‚               β”œβ”€β”€ embed.py
β”‚               β”œβ”€β”€ rag_instructions_embeddings.npy
β”‚               β”œβ”€β”€ rag_instructions.jsonl
β”‚               β”œβ”€β”€ rag_roads_embeddings.npy
β”‚               β”œβ”€β”€ rag_roads.jsonl
β”‚               └── retrieve_example.py
└── README.md

πŸ“ report/

PDFs of our project write-ups and submission reports.


πŸ“„ environment.yml

Conda environment for local reproducibility (conda env create -f environment.yml).


πŸ“‚ src/utils/

Helper functions for parsing, cleaning and structuring raw traffic data.


πŸ“‚ src/notebooks/

Interactive Jupyter notebooks for:

  • dataset_preparation.ipynb β€” building and cleaning our dataset
  • exploration.ipynb β€” data inspection and exploratory data analysis
  • evaluation.ipynb β€” automatic metrics (SloBERTa score)
  • finetuning_example.ipynb β€” a toy LoRA run to run on Colab
  • prompting.ipynb β€” structured-prompt experiments

πŸ“‚ src/evaluation/

app/

A Streamlit app (app_evaluation.py) we built to perform manual rating on test outputs.

progress/

Each member’s intermediate ratings.

results/

Final aggregated outputs for all 4 scenarios (base-instructed, fine-tuned, fine-tuned + instructed, fine-tuned + instructed + RAG) on 500 examples.


πŸ“‚ src/arnes_hpc/

Everything needed to run our full experiments on the ARNES HPC cluster:

  • archive/: old / discarded models & scripts
  • containers/: Singularity definition files for building .sif images
  • models/final_model_9b/: the GaMS-9B checkpoint & LoRA adapter
  • scripts/:
    • finetuning.py: QLoRA fine-tuning on H100
    • instructions.txt: base-prompt rules
    • run_*.py & run_slurm_*.sh: launch scripts for each of the 4 experiments
    • rag/: embed & index road-chunks & instruction docs with LaBSE + retrieval demo

πŸ“Š Data

The dataset used in this project is not included in this GitHub repository due to size. However, you can download it from the following link:

πŸ”— Shared dataset folder (OneDrive)

The shared folder contains the original traffic report data, and also our processed training data.

We created a clean, structured JSONL file called train_promet.jsonl, which is used for fine-tuning the language model. Each entry is a JSON object with two keys:

  • "prompt" β€” a system-like input containing raw structured text
  • "response" β€” the corresponding expected radio-ready traffic report

πŸ” Reproducibility

Follow these instructions depending on the setup (local or HPC):


🐍 Local setup with Conda

You can start by creating the required environment using Conda. The provided environment.yml file will install all necessary dependencies under the environment name nlp-project. Later on we also provide separate instructions, for those who would prefer to manually install the needed dependencies.

conda env create -f environment.yml
conda activate nlp-project

πŸ““ Notebooks

All notebooks under src/notebooks/ are designed for local execution β€” with the exception of finetuning_example.ipynb, which is a simplified demo version of LoRA fine-tuning, and is prepared for execution on Colab.

To run notebooks locally, make sure the following Python packages are installed (already included via environment.yml, but if you would prefer to do manual installation):

pip install beautifulsoup4 matplotlib numpy pandas seaborn striprtf ipython transformers torch scikit-learn

🌐 Streamlit app

To run the manual evaluation Streamlit app in src/evaluation/app/app_evaluation.py, you'll need:

pip install pandas streamlit

You can then launch the application using:

cd src/evaluation/app
streamlit run app_evaluation.py

πŸ’» ARNES HPC cluster setup

To run the full training or evaluation jobs on the ARNES HPC cluster:

  1. Navigate to our project folder:
cd /d/hpc/projects/onj_fri/trije_konjeniki_apokalipse/
  1. Launch jobs using:
sbatch run_slurm_<type>_eval.sh

The structure is modular β€” for each .sh (Slurm script) there's a corresponding Python script that performs the actual execution:

SLURM script Python script
run_slurm_base_eval.sh run_base_model.py
run_slurm_finetuned_eval.sh run_finetuned_model.py
run_slurm_instructed_eval.sh run_instructed_finetuned_model.py
run_slurm_rag_eval.sh run_rag_model.py

Each of these jobs produces a .txt file containing model outputs for 500 examples. These results are stored and used later for manual and automatic evaluation.

All the files required for running are already available under the same directory on HPC, so the workflow is fully reproducible and requires no extra setup.


πŸ§ͺ Experiments

We explored four experimental settings to see the effectiveness of prompt engineering, fine-tuning, and retrieval-augmented generation (RAG).


1️⃣ Base-instructed (prompting only)

We used the original cjvt/GaMS-9B-Instruct model with structured prompting. The input format was designed with our defined rules, and no parameter updates were performed.


2️⃣ Fine-tuned

We performed QLoRA fine-tuning of the cjvt/GaMS-9B-Instruct model using our processed train_promet.jsonl dataset. The dataset was split 80/20 for training and validation.

We used:

  • Quantisation: 4-bit NF4 with bfloat16 compute
  • LoRA config: r=8, alpha=32, dropout=0.05, targeting attention modules
  • Batching: batch_size=1 with gradient_accumulation=8
  • Max length: 512 tokens
  • Epochs: 3
  • Scheduler: cosine with warmup
  • Precision: bfloat16
  • Optimizer: AdamW

The adapter and tokenizer were saved to disk for later inference.


3️⃣ Fine-tuned + instructed

We used the fine-tuned model, but kept the structured prompts to guide generation, essentially combining both approaches.


4️⃣ Fine-tuned + instructed + RAG

We enhanced the instructed setup with retrieval-augmented generation using dense LaBSE embeddings. We embedded and indexed road and instruction snippets, retrieved the most relevant ones based on cosine similarity to input, and added them to the prompt.


βœ… Evaluation

πŸ“ˆ Automatic evaluation

We used SloBERTa and cosine similarity to score the model outputs against ground-truth references across all 4 settings on 500 test samples.

πŸ‘₯ Manual evaluation

To get a better sense of model quality, we designed a Streamlit-based web app (app_evaluation.py) to allow all three of us to independently rank outputs of all 4 scenarios per example.

We followed this process:

  1. Pre-evaluation calibration β€” we manually rated 10 calibration examples and compared our ranking differences to improve rating agreement. This helped us normalize our evaluation criteria and better understand nuances in generated outputs.

  2. Final evaluation β€” we then independently rated 30 examples each, across all four model variants.

For each example and scenario, we:

  • Gave a rating from 1 to 5, assessing the overall usefulness and clarity.

  • Compared each output directly to the ground-truth report, noting whether the generated output was better or worse.

The final scores are computed as a global average across all three of us for both criteria.

πŸ–ΌοΈ The Streamlit app used for manual evaluation:

Manual evaluation app

Results from both evaluations are summarized in the next section.


πŸ“Š Results

We evaluated all four experimental setups using two types of evaluation:

  • Automatic evaluation using SloBERTa + cosine similarity on 500 test examples.
  • Manual evaluation of 30 test examples.

πŸ”¬ Automatic evaluation (SloBERTa cosine similarity)

Model Precision Recall F1-score Length difference (in words)
Base instructed 0.608 Β± 0.004 0.683 Β± 0.004 0.643 Β± 0.004 1.904 Β± 0.042
Fine-tuned 0.774 Β± 0.003 0.753 Β± 0.004 0.762 Β± 0.003 0.818 Β± 0.022
Fine-tuned and instructed 0.817 Β± 0.003 0.752 Β± 0.004 0.781 Β± 0.003 0.732 Β± 0.017
Fine-tuned and instructed + RAG 0.815 Β± 0.003 0.752 Β± 0.004 0.779 Β± 0.003 0.752 Β± 0.018

πŸ‘₯ Manual evaluation

Scenario Avg. Rating % Outputs better than ground truth
Base instructed 1.97 Β± 0.10 6.7% Β± 2.6%
Fine-tuned 3.01 Β± 0.13 22.2% Β± 4.4%
Fine-tuned + instructed 3.20 Β± 0.11 28.9% Β± 4.8
Fine-tuned + instructed + RAG 3.07 Β± 0.12 25.6% Β± 4.6%

πŸ“ˆ F1 score distribution

The histogram below shows the distribution of F1 scores across the 500 test examples, highlighting performance spread per method.

F1 score distribution


⬆️ back to top

About

ul-fri-nlp-classroom-ul-fri-nlp-course-project-2024-2025-Project-template created by GitHub Classroom

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 88.8%
  • Python 10.6%
  • Shell 0.6%