🛣️ NLP Project: Automatic Generation of Slovenian Traffic News for RTV Slovenija

This repository contains the full implementation of an NLP project developed for the Natural Language Processing course. The objective was to automatically generate short Slovenian traffic reports for RTV Slovenija using prompt engineering and fine-tuning of a large language model.

📁 Project Structure

report/report.pdf – Final project report
data/ – Raw and processed data used for training, evaluation, and testing
evalvacija/ – Contains generated predictions and references for evaluation
extract_data.py – Converts .rtf reference files to .txt
rouge.py – Computes ROUGE scores to evaluate generated outputs
ara_2.py – Prompt-based report generation script (old)
prompts_and_responses.py – Generates reports using prompt engineering with relevant input data (old)

📝 Prompt engineering Scripts

model_download.py - Locally downloads the GaMS 9B model (needs 35 GB disk space)
prompt_input_preparation.py – Fetches inputs required for prompt engineering
prompt_engineering.py – Runs LLM using a prompt, writing output to prompt_engineering_results/

📦 Fine-tuning Scripts

prepare_txt_files_ft.py – Converts all reference reports (year 2024) into .txt format
fine_tunning_data_collection.py – Collects training/validation data for fine-tuning
- Outputs: train_4.jsonl, val_4.jsonl
fine_tuning.py – Fine-tunes the LLM on the above dataset
llm_finetune_9/ – Folder of the fine-tuned model
evaluation_data_collection.py – Collects evaluation data from the reference reports from folder 2023/
generate_finetuned_results.py – Generates reports using the fine-tuned model on many examples
generate_finetuned.py – Generates a report on a single example with the fine-tuned model

⚙️ Setup Instructions

1. ✅ Dependencies

Install the required packages:

pip install -r requirements.txt

🔽 2. Download the Data

Download the dataset from this link and extract the contents of the RTVSlo/ folder into the data/ folder.

📑 3. Convert .RTF Files to .TXT

Run the following to convert reference .rtf reports into .txt format:

python extract_data.py

The converted files are then put into a new folder data/txts.

📊 4. Generate Reports

4.1 Generating Reports with Prompt Engineering

Download the model:

python model_download.py

Run the prompt engineering script on 2023 dataset (must have excel data in data/Podatki - PrometnoPorocilo_2022_2023_2024.xlsx):

python prompt_engineering.py

Outputs are saved in prompt_engineering_results/.

4.2 Generating Reports with Fine-tuning

To generate a report using the fine-tuned model for a single input:

Single example generation:

Edit the data variable in generate_finetuned.py with your input.
Run the script:

python generate_finetuned.py

The report will be printed to the console or in the log file when running on SLURM.

Multiple examples generation:

Prepare the evaluation data:

python evaluation_data_collection.py

It will prepare the data based on the references inside the 2023/ folder.

Generate reports:

python generate_finetuned_results.py

Outputs are saved in finetuned_results/.

🔧 5 Fine-tuning the Model

Prepare .txt reference files:

python prepare_txt_files_ft.py

Collect data for training:

python fine_tunning_data_collection.py

This creates train_4.jsonl and val_4.jsonl

Fine-tune the model:

python fine_tuning.py

The trained model will be saved in llm_finetune_9/.

📊 Evaluation

The script calculates ROUGE scores between the files in the evalvacija/predictions/ and evalvacija/references/ directories. Before running the evaluation, make sure the generated reports you want to evaluate are placed in correct folders.

Run the evaluation:

python rouge.py

🖥️ Running on SLURM

To run scripts on a SLURM-based cluster just change the file you want to run inside the file run.sh and set bigger a higher time limit if needed. After that you just submit the job with the following command:

sbatch run.sh

You can check the status of your job with:

squeue --me

Contact

For any questions or issues, please contact:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🛣️ NLP Project: Automatic Generation of Slovenian Traffic News for RTV Slovenija

📁 Project Structure

📝 Prompt engineering Scripts

📦 Fine-tuning Scripts

⚙️ Setup Instructions

1. ✅ Dependencies

🔽 2. Download the Data

📑 3. Convert .RTF Files to .TXT

📊 4. Generate Reports

4.1 Generating Reports with Prompt Engineering

4.2 Generating Reports with Fine-tuning

🔧 5 Fine-tuning the Model

📊 Evaluation

🖥️ Running on SLURM

Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
2023		2023
data		data
evalvacija		evalvacija
fine_tuned_results		fine_tuned_results
fine_tuning_training		fine_tuning_training
gams9b-finetuned_9		gams9b-finetuned_9
generate_responses_prompt_engineering		generate_responses_prompt_engineering
prompt_engineering_results		prompt_engineering_results
report		report
.gitignore		.gitignore
LICENSE		LICENSE
Ocene promptov.xlsx		Ocene promptov.xlsx
README.md		README.md
ara_2.py		ara_2.py
evaluation.jsonl		evaluation.jsonl
evaluation_data_collection.py		evaluation_data_collection.py
extract_data.py		extract_data.py
fine_tuning.py		fine_tuning.py
fine_tunning_data_collection.py		fine_tunning_data_collection.py
ft.py		ft.py
generate_finetuned.py		generate_finetuned.py
generate_fintuned_results.py		generate_fintuned_results.py
match_data_references.py		match_data_references.py
model_download.py		model_download.py
prepare_txt_files_ft.py		prepare_txt_files_ft.py
prompt_engineering.py		prompt_engineering.py
prompt_input_preparation.py		prompt_input_preparation.py
prompts_and_responses.py		prompts_and_responses.py
requirements.txt		requirements.txt
rouge.py		rouge.py
run.sh		run.sh
train_4.jsonl		train_4.jsonl
val_4.jsonl		val_4.jsonl

License

UL-FRI-NLP-Course/ul-fri-nlp-course-project-2024-2025-papagaji

Folders and files

Latest commit

History

Repository files navigation

🛣️ NLP Project: Automatic Generation of Slovenian Traffic News for RTV Slovenija

📁 Project Structure

📝 Prompt engineering Scripts

📦 Fine-tuning Scripts

⚙️ Setup Instructions

1. ✅ Dependencies

🔽 2. Download the Data

📑 3. Convert .RTF Files to .TXT

📊 4. Generate Reports

4.1 Generating Reports with Prompt Engineering

4.2 Generating Reports with Fine-tuning

🔧 5 Fine-tuning the Model

📊 Evaluation

🖥️ Running on SLURM

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages