Skip to content

UL-FRI-NLP-Course/ul-fri-nlp-course-project-2024-2025-coccodrillo

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🐊 NLP Project: coccodrillo

Description

Coccodrillo is an intelligent natural language assistant. It is capable of interpreting complex requests regarding travel, weather, restaurants, events, safety, and much more. It uses advanced NLP models to understand user intents, extract relevant entities, correct errors, and generate structured responses.


🎯 Supported Intents

The system recognizes 8 main intents:

  1. πŸ“ Safety updates for a location – "Is it safe?"

  2. πŸŒͺ️ Weather alerts and extreme weather conditions

  3. πŸ—ΊοΈ Recommended places to visit

    • Data: Dataset downloaded from Lonely Planet, with the help of Python scripts based on Selenium and BeautifulSoup.
    • Integration: Uses Google Maps to check the availability of locations (for example, to know if a place is temporarily closed).
  4. 🎢 Information on concerts and events

  5. 🍽️ Best restaurants to eat

  6. 🍝 Recommendations on typical dishes or foods

    • Data: Dataset downloaded from TasteAtlas, using Python scripts with Selenium and BeautifulSoup.
  7. 🌀️ Future weather forecasts

  8. πŸš„ Information on trains, flights, and buses


πŸ“‚ How to Run the Project

0. Requirements

This project requires PyTorch 2.6.0.

βœ… PyTorch 2.6.0 is compatible with the following Python versions:

  • Python 3.9
  • Python 3.10
  • Python 3.11
  • Python 3.13

⚠️ Note: The file ./setup/requirements.txt lists the currently installed libraries, generated with the command pip freeze > requirements.txt, but in theory all necessary libraries, with the correct versions, can be installed using the Python script to_install.py.

1. Install Dependencies

To install all the necessary libraries and download the required BERT models and spaCy language model, run the following commands:

python3 ./setup/to_install.py

2. Install ChromeDriver

This project uses Chrome for browser automation, which requires ChromeDriver to be installed.

You can download the appropriate version of ChromeDriver for your operating system (Windows, macOS, or Linux) from the following link:
πŸ‘‰ ChromeDriver Downloads

Note: For convenience, ChromeDriver executables for Linux and Windows are already included in the ./setup/driver directory.

βš™οΈ Requirements

  • Make sure to download the ChromeDriver version that matches your installed version of Google Chrome.
  • If you don’t have Google Chrome installed, you can download it here:
    πŸ‘‰ Download Google Chrome

🐧 On Linux:

  • Place the chromedriver executable inside the bin/ directory.

πŸͺŸ On Windows:

  • Place the chromedriver_win32 directory in the root of this project.
  • Ensure that the ChromeDriver file has a .exe extension (chromedriver.exe).

3. Run the Project

To run the project, navigate to the src directory and execute the run.py script using the following command:

cd src
python3 run.py

βš™οΈ How It Works

  • The user enters a natural language phrase in run.py

  • The BERT model all-MiniLM-L6-v2 is used to classify the intent of the request

  • The text is analyzed to extract cities, dates, and other entities through:

    • 🧠 Question Answering with:
      tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
      model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
    • If information is missing (e.g., date), the system detects this and asks the user for clarification.
    • The system handles spelling errors and formatting through semantic checks.
  • Downloaded datasets and information available on the web (using Python libraries like Selenium and BeautifulSoup) are used to search for all the necessary data.

  • The output is presented to the user in a well-structured format.


🧾 System Output

First Response:
An introductory sentence generated with a bi-grams model based on a simple dataset, explaining that the system is searching for information.

Second Response:
A structured output of all the information found.

πŸ’‘ Strengths

βœ… Automatic Correction

The system automatically corrects small errors in city names or dates, thanks to spelling checks and semantic similarity logic.


πŸ“… Support for Multiple Cities and Dates

The user can make complex requests, including multiple destinations and time periods within the same sentence.


🌍 Local News + Automatic Translation

For safety requests regarding foreign cities or countries:

  • News is searched in the local language to maximize accuracy.
  • Texts are summarized and translated using the following tools:
from transformers import pipeline, BartTokenizer

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

πŸ“Ž Selecting the Most Relevant News

To filter articles and return only the most relevant ones to the query, the similarity between the user query and found articles is calculated:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)
query_vector = tfidf_matrix[-1]
document_vectors = tfidf_matrix[:-1]
similarity = cosine_similarity(query_vector, document_vectors).flatten()

Only articles with similarity above a predefined threshold are included in the response.

πŸ›‘οΈ Verified Safety Sources


✨ Custom Travel Recommendations

The system constructs a graph of places to visit:

  • πŸ›οΈ Nodes = Points of interest (museums, landmarks, squares, etc.)
  • πŸ“ Edges = Walking distances between each location
  • Each node includes two weights:
    • 🎨 Beauty: a value representing how iconic or scenic the place is
    • ⏱️ Visit time: estimated duration needed to explore the location

πŸ” Route Optimization

This project implements a minimum-cost algorithm for optimizing sightseeing routes, with the goal of providing an efficient and enjoyable itinerary. It balances factors such as the beauty of the place, walking distance, and visit duration to create an ideal sightseeing plan.

Features

  • 🎨 Beauty of the place (highly weighted)
  • 🚢 Walking distance (moderately weighted)
  • βŒ› Visit duration (low or no weight)
  • βœ… Google Maps availability data: Includes only locations that are open on the selected days and excludes temporarily closed or under renovation places.
  • 8 hours per day of sightseeing time, ensuring the itinerary respects the total time available (e.g., 3 days Γ— 8 hours = 24 hours).

Supported Cities

Currently, the route optimization feature supports the following cities:

  • Rome
  • Ljubljana
  • Prague
  • Vienna
  • Florence
  • Naples
  • Maribor
  • Paris
  • Valencia
  • Barcelona
  • Madrid

Future Expansion

We plan to extend the list of supported cities in the future. The dataset for additional cities has already been downloaded but requires manual formatting to meet the model's specifications.


πŸ’¬ Example Query

"I would like to go to Rome for 3 days. Can you recommend the best things to visit? Start from Termini Station."

With the following configuration:

importance_time_visit = 0.0
importance_beauty     = 0.7
importance_edge       = 0.3

The output is:

{
  'Rome': ([
    ('Museo Nazionale Romano: Palazzo Massimo alle Terme', '1'),
    ('Basilica di Santa Maria Maggiore', '1'),
    ('Colosseum', '1'),
    ('Roman Forum', '1'),
    ('Pantheon', '1'),
    ('Piazza Navona', '1'),
    ('Villa Farnesina', '1'),
    ("Castel Sant'Angelo", '1'),
    ("St Peter's Basilica", '1'),
    ('Vatican Gardens', '1'),
    ('Sistine Chapel', '0'),  # ← marked as closed
    ('Vatican Museums', '1'),
    ('Gianicolo', '1'),
    ('Museo della Repubblica Romana e della Memoria Garibaldina', '1'),
    ('Basilica di Santa Maria in Trastevere', '1'),
    ('Jewish Ghetto', '1'),
    ("Campo de' Fiori", '1'),
    ('Trevi Fountain', '1'),
    ('Galleria Doria Pamphilj', '1'),
    ('Piazza di Spagna', '1'),
    ('Pincio Hill Gardens', '1'),
    ('Museo e Galleria Borghese', '1')
  ], 23)
}

πŸ•’ Time & Availability

  • Total time used: ~23 hours
  • 🚫 Closed location excluded: Sistine Chapel (temporarily closed)

πŸ–ΌοΈ Visual Itinerary

Below is the optimized travel route drawn on a map.
Each segment is color-coded based on the order of visitation (earliest to latest).

Optimized itinerary map for Rome


βš™οΈ How It Works

πŸ”§ Input

  • πŸ™οΈ City: Rome
  • πŸ“… Number of days
  • πŸ“ Starting location (e.g., Termini Station)

🧠 Process

  • Constructs a weighted graph of attractions
  • Filters out temporarily closed or inaccessible sites
  • Runs a route optimization algorithm with custom weights:
    • 🎨 Beauty of each place
    • βŒ› Visit time
    • 🚢 Walking distance
  • Computes total estimated time
  • Generates a visual map of the itinerary

πŸ“¦ Output

  • βœ… A sorted list of recommended places to visit
  • ⏳ Estimated total visit time
  • πŸ—ΊοΈ A visual path connecting all selected locations

🎫 Events and Concerts

The user can search for events specifying:

  • City
  • Dates
  • Artist
  • Music genre

The system connects to public website (e.g., Bandsintown) to show updated events.


🍽️ Restaurants and Local Food

  • Suggests the best restaurants based on the area
  • Recommends typical dishes based on location

☁️ Weather and Alerts

The weather module provides:

  • Detailed forecasts for cities and dates
  • Automatic detection of extreme events or abnormal conditions

✈️ Transport

The system provides up-to-date information on:

  • Trains
  • Buses

⚠️ Weaknesses

While the system is robust and flexible, it has some technical limitations that are currently being improved:

🚌 Train and Bus Search

  • The transport website, if constantly queried, can block automated traffic detecting it as suspicious activity.
  • In these cases, the search may fail or return incomplete results.

🌐 Article Translation

  • Long articles may cause errors in the translation phase.
  • The system splits texts into individual sentences, but sometimes even a single sentence is too long to be translated correctly.
  • In these cases, the result is provided in the original language.

πŸ“Œ Examples (Testing 01/05/2025)

User:

Tell me the best typical food in Rome.

Output

Food in rome


User:

Can you recommend some places to visit in Valencia for 3 days?

Output

Food in rome


User:

I would like to go in Rome for 3 days, can you reccomend for me the best things to visit? Start to Termini Station

Output

Rome places to visit


User:

Hello, can you say me the current situation about the security in France, is safe?

Output

News France doc1

News France doc2


User:

Tell me the last news about the warning weather alert in Valencia.

Output

News Valencia


User:

Some concerts in Ljubljana for tomorrow.

Output

Events in Lubiana


User:

I am going in Milan in the 1 June, there are concert by Jerry Cantrell?

Output

Events in Milan


User:

Can you write for me the best places where i can eat in Prague

Output

Restaurants in Prague


User:

What are the typical food in Naples? and in Paris?

Output

Naples food

Paris food


User:

I am going in Berlin, tell me the temperature for friday.

Output

Temperature Berlin


πŸ”§ Technologies Used

  • Language: Python 3.10+

  • NLP Models:

    • BERT (for Question Answering and Named Entity Recognition)
    • MiniLM (for intent classification)
    • BART (for summarization and translation)
  • Main Libraries:

    • transformers, sentence-transformers
    • scikit-learn, networkx, geopy
    • nltk, spacy, pandas, requests
    • beautifulsoup4, selenium

βœ… Testing

The system has been tested using real-world natural language phrases to verify its reliability and effectiveness in real scenarios. Three main types of tests were conducted to assess various aspects of the system:

  1. Test on ambiguous and incomplete requests
    Examples with incomplete or ambiguous phrases were used to verify how the system handles request interpretation and the processing of missing information.
    The results of these tests are available in the testing/query_with_error_testing folder.

  2. Intent classification test
    This test verified if the system can correctly classify the user intent, even in the presence of complex phrases or multiple intents.
    The results of these tests are available in the testing/intent_testing folder.

  3. Real request and output test
    In this test, real examples of requests were used, executing a complete simulation from query formulation to the system-generated output. The goal was to evaluate the quality of the responses generated and the overall system reliability.
    The results of these tests are available in the testing/final_output_testing folder.

Testing Details:

  • In some cases, spelling errors were deliberately introduced in the requests to test the system's ability to correct them automatically.
  • Multiple cities or dates were also included in a single request to verify how the system handles complex scenarios.

These tests helped identify and fix any weaknesses, improving the overall system reliability.

πŸ“¬ Contact

Do you have suggestions, bugs to report, or want to contribute?

πŸ‘‰ Open an issue or contact me directly on Davide GitHub or Ondrej GitHub


About

ul-fri-nlp-classroom-ul-fri-nlp-course-project-2024-2025-Project-template created by GitHub Classroom

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%