Skip to content

UL-FRI-NLP-Course/ul-fri-nlp-course-project-2024-2025-brainstorm

 
 

Repository files navigation

Natural language processing course: Automatic Generation of Slovenian Traffic News for RTV Slovenija

This project develops an automated system for generating Slovenian-language radio traffic reports for RTV Slovenija using Large Language Models (LLMs). The system transforms raw traffic event data into stylistically consistent and factually accurate reports suitable for radio broadcast.

Project Overview

The goal is to automate the manual process currently used by RTV Slovenija, where students check and type traffic reports every 30 minutes. The system uses fine-tuned LLMs with prompt engineering to generate short, radio-appropriate traffic reports from Excel data provided by the promet.si portal.

Key Features

  • Multi-stage Data Preprocessing: Sophisticated pipeline to align raw traffic data with compiled RTV reports
  • Fine-tuned GaMS-9B Model: Slovenian-specific model optimized for traffic report generation using LoRA (Low-Rank Adaptation)
  • HPC Cluster Integration: Distributed training and inference on high-performance computing infrastructure
  • Comprehensive Evaluation: Both quantitative metrics (BLEU, ROUGE, METEOR) and LLM-based heuristic assessments
  • Rule-based Post-processing: Automated corrections for grammar, terminology, and style consistency

Project Structure

├── data_preprocessing/          # Data cleaning and preparation scripts
│   ├── data_cleanup.py         # Main data cleaning utilities
│   ├── make_test_data.py       # Test dataset creation
│   ├── primerjava_w_sentences_BERT_w_QA.py  # BERT-based semantic matching
│   └── LLM_porocilo_extraction.py  # Event extraction using Gemini
├── FineTune/                   # Fine-tuning implementation
│   ├── finetune_GaMS_9B.py     # Main fine-tuning script
│   ├── params_GaMS_9B.py       # Configuration and hyperparameters
│   ├── data_loader.py          # Data loading utilities
│   ├── fine_tune.sh            # SLURM batch script for training
│   └── Instructions.ipynb      # Detailed setup instructions
├── Generate_Report/            # Report generation and testing
│   ├── report_generation.py    # Main generation script
│   └── run_rep_gen.sh          # SLURM batch script for testing
├── final_report_generation/    # Final optimized generation pipeline
├── gemini-api/                 # Gemini API integration for prompting
├── HPC/                        # HPC setup and configuration
│   ├── HowToHPC.ipynb          # HPC usage instructions
│   └── LLama_17B/              # Large model experiments
├── evaluation_of_generated/    # Evaluation scripts and metrics
├── report/                     # Academic report (LaTeX)
├── previous_tesing/            # Archive of early experiments and tests
└── Report_Generation/          # Additional generation utilities

Methodology

1. Data Preprocessing

  • Event Extraction: Used Gemini 2.0 Flash and heuristic approach to extract structured events from RTV reports
  • Semantic Matching: Employed Slovenian BERT model (rokn/slovlo-v1) for embedding traffic event data and semantic matching of input and report event data based on calculating cosine similarity between embeddings
  • Data Alignment: Created input-output pairs for training by aggregating relevant traffic events

2. Model Selection and Fine-tuning

  • Base Model: GaMS-9B-Instruct (Slovenian-specific LLM)
  • Fine-tuning Method: Parameter-Efficient Fine-Tuning (PEFT) using LoRA
  • Training Environment: HPC cluster with GPU acceleration
  • Optimization: Custom training loop with gradient accumulation and learning rate scheduling

3. Prompt Engineering

  • Initial Exploration: Tested various models (Gemini, Mistral-AI, Llama, DeepSeek)
  • Prompt Evolution: From detailed API prompts to concise, tailored fine-tuning prompts
  • Style Guidelines: Incorporated RTV Slovenija's formatting and style requirements

4. Evaluation Framework

  • Quantitative Metrics: BLEU, ROUGE-1/2/L, METEOR scores
  • Qualitative Assessment: LLM-based heuristic qualitative evaluation using Gemini 2.0 Flash
  • Parameter Optimization: Tested multiple generation configurations (temperature, repetition penalty)

Key Results

The fine-tuned GaMS-9B model achieved optimal performance with:

  • Temperature: 0.4
  • Repetition Penalty: 1.2
  • ROUGE-L F1 Score: 0.593
  • High qualitative ratings for factual accuracy and stylistic appropriateness

Usage

Prerequisites

  • Python 3.8+
  • PyTorch with CUDA support
  • Transformers library
  • Access to HPC cluster (for training)

Quick Start

  1. Data Preparation:

    cd data_preprocessing
    python data_cleanup.py
    python make_test_data.py
  2. Fine-tuning (on HPC):

    cd FineTune
    sbatch fine_tune.sh
  3. Generate Reports:

    cd Generate_Report
    sbatch run_rep_gen.sh

Configuration

Model and training parameters can be adjusted in FineTune/params_GaMS_9B.py.

Evaluation

The system is evaluated using:

  • Lexical Similarity: BLEU, ROUGE, METEOR metrics against reference reports
  • Content Quality: Factual correctness, location accuracy, event type identification
  • Style Assessment: Conciseness, grammatical correctness, radio suitability

Future Work

  • Model Refinement: Explore advanced fine-tuning techniques and parameter optimization
  • Content Enhancement: Improve relevance filtering and domain-specific understanding
  • Real-time Integration: Develop pipeline for production deployment at RTV Slovenija
  • Multilingual Extension: Add translation capabilities for broader accessibility

Data Sources

  • Input Data: Traffic event data from promet.si portal (Excel format)
  • Target Output: RTV Slovenija radio traffic reports
  • Guidelines: Official formatting and style instructions for human reporters

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Academic Report

A comprehensive academic report detailing the methodology, evaluation, and results is available in the report/ directory.

Notes


Authors: Aljaž Justin, Edin Ćehić, Lea Briški
Advisor: Slavko Žitnik
Institution: University of Ljubljana, Faculty of Computer and Information Science
Course: Natural Language Processing (2024/2025)

About

ul-fri-nlp-classroom-ul-fri-nlp-course-project-2024-2025-Project-template created by GitHub Classroom

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 63.9%
  • Jupyter Notebook 31.8%
  • Shell 4.3%