Natural language processing course: Conversational Agent with Retrieval-Augmented Generation

In this report, we developed a set of RAG solutions and performed a thorough evaluation. We used two LLMs, DeepSeek and Qwen3, to compare the behavior of RAG with and without reasoning, respectively. We built two methods for information retrieval and injection, one based on prompt analysis and nondiscriminative prompt injection, and one based on function calling at the LLM's discretion. To perform more robust testing, we evaluated all variations of our RAG systems on a prepared question set, using two different-sized LLMs-as-judge -- GPT-4.1.-mini and 14-billion-parameter Qwen model -- and our own subjective observations. We found that both judges performed similarly on some metrics, with GPT-4.1.-mini performing better on others. The results show that Qwen with function-calling RAG system outperforms all other variants.

Set-up

This repository can be cloned onto the HPC, or it can also be found on the shared folder at /d/hpc/projects/onj_fri/teammlg/.

Once you have the repository, run the following to create the container:

./ul-fri-nlp-course-project-2024-2025-teammlg/code/sling_setup.sh

Then navigate into code directory and run:

sbatch ./sling-run.sh

to run the evaluation (you have to be in code directory for relative imports to work).

Chatbot usage

Run the following command to create an interactive session:

srun --job-name "chatbot testing" --cpus-per-task 4 --mem-per-cpu 1500 --time 30:00 --gres=gpu:2 --partition=gpu --pty bash

Then run the following (model options are: deepseek_baseline, deepseek_naive, qwen_baseline, qwen_naive):

singularity exec --nv ../../containers/nlp-v1.sif python ./conversation_evaluation.py --model qwen_naive

For instructions on running the advanced RAG systems and different options, see Codebase.

Shard loading can take up to 30 minutes. The > symbol indicates that the system is waiting for your query. Response generation typically takes around 30 seconds. To terminate the current session type quit.

If you get SSL-related errors, run:

unset SSL_CERT_FILE

Codebase

The codebase diverged slightly during the development; therefore, the RAG systems are split between code and code_v2.

code - contains the baseline and naive RAG systems, the evaluation of those systems, and additional code for processing results.

code_v2 - contains the baseline and advanced RAG systems, their partial evaluation, and code for results analysis.

To run interactive conversation on either baseline or naive RAG system, move into code and run:

python ./conversation_evaluation.py --model qwen_naive

To run interactive conversation on either baseline or advanced RAG system, move into code_v2 and run:

python ./main.py --rag_type {baseline,advanced} --model {qwen,deepseek} --operation converse --output_directory <optional, a string> --uses_memory

(the {} signify that you should choose one of the options)
(the flag --uses_memory is optional and should be used if running a conversation with the chatbot)

Functionality inside `code`

Brief description of the main functionality of each script:

evaluation.py: simultaneously evaluate multiple models on a list of questions, either data/evaluation_questions.txt (no ground truth) or data/evaluation_questions.json (ground truth).
converstaion_evaluation.py: evaluate one model at a time by having a real-time conversation with it.
scraper.py: obtain structured data from websites TMDB, Letterboxd, JustWatch.
scraper_v2.py: modified scraper, such that Qwen model can choose which functions to call based on user query.
POStagger.py: finds titles and names of interest in the user's prompt
rag.py: inserts scraped data (based on POS tagger) into the model's context.
summarizer.py: summarizes scraped data for advanced RAG.
memory.py: buffer for recent queries and replies for conversational models.
metrics.py: calculates evaluation metrics based on the model's output and ground truth.

Name		Name	Last commit message	Last commit date
Latest commit History 315 Commits
code		code
code_v2		code_v2
report		report
results		results
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Natural language processing course: Conversational Agent with Retrieval-Augmented Generation

Set-up

Chatbot usage

Codebase

Functionality inside `code`

About

Uh oh!

Releases

Packages

Languages

License

UL-FRI-NLP-Course/ul-fri-nlp-course-project-2024-2025-teammlg

Folders and files

Latest commit

History

Repository files navigation

Natural language processing course: Conversational Agent with Retrieval-Augmented Generation

Set-up

Chatbot usage

Codebase

Functionality inside code

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Functionality inside `code`

Packages