Dieses CLI-Tool steuert eine RAG-Pipeline (Retrieval-Augmented Generation) zur Generierung, Bewertung und Optimierung von KI-Antworten basierend auf Wissensquellen. Die Pipeline unterstützt sowohl Setup- und Analysefunktionen als auch interaktive Nutzung über CLI oder Web-API.
Für den Betrieb der gesamten Pipeline gelten folgende Voraussetzungen:
- Python 3.8+
- Alle Abhängigkeiten aus
requirements.txt - Ollama mit lokal verfügbarem Modell
llama3(für Retrieval- und Antwortgenerierung) - Für die Evaluation wird ein gültiger OpenAI API Key sowie Zugriff auf das Modell
o3benötigt
git clone https://github.com/MoDimGH/ReductionOfHallucinations
cd ReductionOfHallucinations
pip install -r requirements.txtDie CLI besteht aus den Kommandos setup, run, evaluate und analyse.
Initialisiert verschiedene Komponenten der Pipeline:
python -m rag_pipeline setup <step>Verfügbare Schritte:
| Schritt | Beschreibung |
|---|---|
create_dataset |
Scrapet alle Dateien neu und erstellt ein neues Trainingsdataset |
create_testset |
Erstellt und kontextualisiert ein Testset |
update_database |
Füllt die Vektor-Datenbank mit den generierten Dokumenten |
qa_optimization |
Erstellt QA-Dokumente und füllt eine optimierte QA-Datenbank |
all |
Führt alle obigen Schritte in Folge aus |
Beispiel:
python -m rag_pipeline setup update_databaseStartet die Pipeline im gewünschten Modus:
python -m rag_pipeline run <cli_chat|web_api> [--optimizations <opt1,opt2,...>]Modi:
cli_chat– Startet den interaktiven CLI-Chatweb_api– Startet eine FastAPI-Anwendung unterhttp://localhost:8000
Optionale Optimierungen (kommagetrennt):
qa_sections– Nutzung der QA-optimierten Datenbankhybrid_search– Kombination aus Vektor- und BM25-Suchescore_thresholding– Score-basierte Filterungprompt_engineering– Anpassung der Prompts
Beispiel:
python -m rag_pipeline run cli_chat --optimizations qa_sections,hybrid_searchStartet eine automatisierte Evaluation aller Architektur- und Optimierungskombinationen:
python -m rag_pipeline evaluateFührt umfassende Analysen der Evaluationsergebnisse aus:
python -m rag_pipeline analyseDabei werden u.a. folgende Metriken ausgewertet:
- Halluzinationsarten
- Use-Case-spezifische Fehler
- Vergleich mit Baseline
- Kombinationen von Architekturen und Optimierungen
Die erstellten Diagramme können unter "./benchmarking/evaluation/analysis_output/" abgerufen werden.
Validierung des generierten Testdatensatzes:
python -m streamlit run "./benchmarking/manual_testset_validation/validation_app.py"→ erreichbar unter: http://localhost:8501
Validierung der automatisierten Bewertung:
python -m streamlit run "./benchmarking/evaluation/validate_evaluation_process.py"→ ebenfalls erreichbar unter: http://localhost:8501
⚠ Hinweis: Die beiden Tools können nicht parallel auf demselben Port laufen. Bei gleichzeitiger Nutzung muss einer davon auf einen anderen Port umgestellt werden.
.
├── scraping/
├── benchmarking/
│ ├── testset_generation/
│ ├── manual_testset_validation/
│ └── evaluation/
├── rag_pipeline/
│ ├── query_rag.py
│ ├── build_rag_pipeline.py
│ ├── populate_database.py
│ └── ...
├── optimizations/
│ └── qa/
├── web/
│ └── backend/
│ └── api.py
Moses Dimmel
Diese README wurde mithilfe von KI erstellt und an die tatsächliche Projektstruktur und -logik angepasst.
Bremen University of Applied Sciences
Faculty IV: Electrical Engineering and Computer Science
| Study program: | Cooperative Degree Programme in Computer Engineering B.Sc. |
| Planned submission date: | 20.07.2025 |
Dataport AöR is the IT service provider for public administration in a total of seven federal states in Germany. Eight locations have now been established across the sponsoring states. Among other things, Dataport is developing the service portal[1] for the City of Hamburg.
A chatbot is now to be developed for this purpose, with which users of the site can chat to find out general information about various city administration services. In the event that users provide the chatbot with sensitive data, this must also be processed by it. To ensure the security of citizens' data, the Large Language Model (LLM) for the chatbot is to be operated in the in-house data center. The aim is to use the most resource-efficient model possible to avoid excessive hardware costs.
It is essential for the added value of the application to prevent the occurrence of hallucinations, which occur all the more frequently in models with fewer resources. "Hallucinations in the context of LLMs usually refer to situations where the model generates content that is not based on factual or correct information. The occasional generation of outputs that appear plausible but are factually incorrect undermines the reliability of LLMs in real-world scenarios[2][author's translation] (Cheng Niu, 2025, p. 1) . To avoid most of the hallucinations that arise, "Retrieval-Augmented Generation" (RAG) is to be used for the chatbot. This means that documents with a similar context to the user's search query are retrieved and delivered to an LLM so that it responds based on external context and not on intrinsic trained knowledge
Nevertheless, various types of hallucinations continue to occur even with the use of RAG: (Wan Zhang, 2025)
- Answer contradicts the source's own statements (intrinsic hallucination)
- LLM supplements information that is not in the source (extrinsic hallucination)
- Answer contains factually incorrect statements (factual hallucination)
- LLM deviates from the task set ("Faithfulness Hallucination")
- LLM invents credible-sounding but false facts ("factual mirage")
- LLM reacts to an incorrect assumption in the prompt with a fictitious response ("Silver Lining")
The scientific literature describes numerous strategies for reducing hallucinations, including specifically in RAG applications. In this section, these techniques are presented, scientifically categorized and then evaluated based on their applicability in this work.
The models for embedding can be trained on domain-specific data (Yunianto, Permanasari, & Widyawan, 2020) . This can significantly increase the relevance of the retrieved documents. Since domain-specific data is available for this work, this method is feasible.
The combination of sparse (e.g. BM25) and dense (e.g. Sentence-BERT) models is an effective method for improving precision and recall for document retrieval (Priyanka Mandikal, 2024) . Sparse models primarily extract documents in which the query occurs exactly word for word, but neglect documents with similar context or synonyms to the contents of the search query. Dense models, on the other hand, more easily overlook documents in which the search query occurs word for word, but pay close attention to the context of the documents. The strengths of both are therefore combined here, so that as many correct documents as possible are retrieved, but as few irrelevant documents as possible. This method is well suited for this work, as the identification of services often involves word-based searches, but the context often needs to be considered, as users may ask questions without domain-specific prior knowledge.
By reformulating the user input, information retrieval can be significantly improved (Shengyu Mao, 2024) . This technology can also help RAG to increase the quality of the retrieved documents. As the implementation is very lightweight, this can be used well for the planned prototype.
To increase the quality of the retrieved documents, it is also advisable to pre-process the data (Kettunen, 2025) . To do this, an LLM can create a question-and-answer list for the content of the section once for each document or each block of information that has been completed in terms of content. This list would contain answers to questions that are commonly asked. This makes it easier for the embedding model to successfully match questions to the correct documents and retrieve them. This strategy is suitable for the prototype to be built.
If the RAG system finds matching documents, but their context does not match the context of the user input very closely, the system should output "I don't know" instead of trying to generate an answer, or if few more closely matching documents could be retrieved, the answer should be based only on these. As this excludes less relevant documents, the possibility of misdirection of the answer generation by them is also eliminated and the accuracy of the answers increases. We achieve this by introducing a threshold for the confidence values of the retrieved documents (Radeva, Popchev, & Dimitrova, 2024) . This is also lightweight, so it is also suitable for this work.
It is possible to check the answers of the RAG system for factual correctness using another system. For example, the open source system "Poly-FEVER" (Hanzhi Zhang, 2025) was developed for this purpose. This consists of 77,973 labeled factual assumptions in over 11 languages. However, it is not of great use for the system to be developed as Poly-FEVER deals with general topics, but the present use case is in the area of government services. However, with some technical effort, custom benchmarks can be created on the available data of the use case, and hallucination detection can be performed using tools such as RAGAS (Shahul Es, 2025) .
Certain instructions in the prompt (Pranab Sahoo, 2024) can be used to suppress speculative statements by the LLM so that the LLM always prioritizes the external knowledge of the retrieved documents higher than the intrinsic knowledge acquired through training. It makes sense to implement this in the prototype.
The more complex the user's prompt, the more difficult it is for the LLM to provide a correct, meaningful answer. However, chain-of-thought prompting enables complex reasoning by dividing the reasoning chain into individual, less complex steps. This is particularly useful for calculations. Dividing the answer finding process into several steps increases transparency, which gives users more confidence in the answer and makes it easier to understand the content of the answer. (Takeshi Kojima, 2025) describes how this can be achieved by adding "Let's approach this step by step" to the user's prompt. This can also be integrated into the prototypes of this work.
If the user's prompt has been formulated in a way that is somewhat incomprehensible or no content has been found, the LLM could ask the user questions and provide clarification (Vipula Rawte, 2024) . Special prompt templates can be used for this purpose. This is feasible in principle.
The combination of a RAG system with a knowledge graph (Nicholas Matsumoto, 2024) as an additional source of knowledge significantly reduces hallucinations. However, a complex system architecture is required for implementation, which is beyond the scope of this paper.
The model responses can be improved through user feedback (Yu Bai, 2024) . On the one hand, the model can be trained using "Reinforcement Learning from Human Feedback" (RLHF). On the other hand, it is also possible for users to mark hallucinations in the answer and for these to be transferred to a collection for cross-checking in future answers. This is feasible for this work
The generation quality of RAG applications can also be improved by systematically tuning the retrieval and model parameters (Matthew Barker, 2025) . This can be implemented by means of an experimental approach in this work.
This section presents current techniques for detecting hallucinations in RAG applications and evaluating the reduction of hallucinations. The methods are scientifically classified and then evaluated based on their applicability in this work.
In order to check the generated answers for coherence, factual accuracy and consistency, a powerful language model can be used as an evaluation instance. Examples include SelfCheckGPT (Potsawee Manakul, 2023) , G-Eval, Prometheus, Lynx or Trustworthy Language Model (TLM). A comparison of the different models was carried out by (Sardana, 2025) . This approach is well suited for this project, as no ground truth answers are given and the evaluation of the answers by humans would be too time-consuming. However, the quality of the assessments depends heavily on the assessment model used.
Tools such as REDEEP (Zhongxiang Sun, 2024) and LibreEval (Research, 2025) make it possible to create your own benchmarks based on domain-specific documents. The aim is to measure metrics such as faithfulness, answer correctness or precision. The generated answers can then be compared with known, verified statements for the purpose of evaluating hallucinations. Even if the creation of benchmarks is time-consuming, hallucinations can also be recognized and possibly even mitigated in a domain-specific context. This makes it suitable for this project.
After going live, the RAG system can continue to be optimized based on user feedback (Yu Bai, 2024) . For this purpose, a functionality can be introduced that allows end users to mark hallucinations in answers or rate the answer based on its factual accuracy. In this way, the model can be trained using RLHF. This involves some technical effort, but is very practical for business applications.
The relevance and coverage of the retrieved documents can already be assessed at the retrieval level using tools such as RAGAS (Shahul Es, 2025) , REDEEP and reranking (e.g. with BERT (Koroteev, 2021) ). This involves checking whether the respective document proves the answer and evaluating metrics such as similarity scores, coverage or faithfulness. This can be very useful for the prototype to be developed in order to optimize the retrieval process, as no manual annotations are necessary here. However, automatically generated retrieval benchmarks can also be used here for continuous evaluation.
The aim of this bachelor thesis is the prototypical development of a domain-specific chatbot for the city of Hamburg's service portal[3] based on the principle of "Retrieval-Augmented Generation" (RAG). The focus here is on reducing hallucinations in the generated answers. To address this challenge, selected measures to avoid hallucinations are integrated, tested and evaluated.
The first step involves the automated procurement and processing of relevant content from the publicly accessible areas of hamburg.de. This data will be merged into a knowledge database and a RAG pipeline will be implemented on the basis of this, in which a selection of easy-to-implement methods for hallucination avoidance will be integrated, including:
- Data preparation through Q&A generation
- Hybrid retrieval (sparse + dense),
- Threshold-based filtering of irrelevant documents
- Various methods of prompt engineering to avoid speculation
The aim is to investigate the effectiveness of these approaches with regard to the reduction of different types of hallucinations. For this purpose, suitable evaluation metrics are defined and automated evaluation procedures (LLM-as-a-Judge or RAGAS) are used for analysis. The evaluation is carried out using generated realistic test scenarios.
The focus of the work is on building a functional prototype that is both technically functional and methodologically meaningful enough to evaluate the effectiveness of individual hallucination reduction strategies in the context of a domain-specific use case. The aim is not a generally valid comparison of all theoretically possible approaches, but rather a proof of concept for the effectiveness of selected, practicable measures in the context of a specific application situation.
The focus of the implementation is on the technical realization of a RAG-supported chatbot for answering citizens' questions on the basis of publicly available administrative documents and on the evaluation of various methods for reducing hallucinations in the generated answers.
- Data acquisition & preparation
- Scraping relevant content from hamburg.de/service and any related pages
- Pre-processing of documents (formatting, filtering)
- Q&A generation for structured information processing
- System architecture & implementation
- Configuration and integration of a RAG system incl. vector database
- Implementation of a simple user interface for interaction
- Evaluation preparation
- Selection and implementation of suitable benchmarking methods for hallucination detection (RAGAS, LLM-as-a-Judge)
- Creation of domain-specific metrics and ground truth data
- Test runs & comparison
- Execution of test series with various optimization strategies (Q&A data preparation, hybrid search, prompt engineering, thresholds)
- Comparison of the systems based on qualitative and quantitative criteria
- Documentation & evaluation
- Analysis and presentation of the results
- Reflection on the feasibility and effectiveness of the methods used
- Composing the written Bachelor thesis
| week | Task |
|---|---|
| 1 | Data acquisition (scraping) and pre-processing |
| 2 | Q&A creation, development of the knowledge database |
| 3 | Implementation of the RAG pipeline, initial tests |
| 4 | UI development, expansion to include optimization methods |
| 5 | Preparation of the evaluation (metrics, test data, benchmarks) |
| 6 | Test runs and comparison of the methods for hallucination reduction |
| 7 | Evaluation, analysis, creation of figures and tables |
| 8-9 | Writing, fine-tuning and submitting the Bachelor's thesis |
Hardware:
- Development on local computer
- Use of cloud instances if necessary
Software and tools:
- Python,
- Postgres vector database
- Streamlit for UI prototype
- Github for version control
- Access to LLM API
Test data:
- Generation of own benchmarks for evaluation on the basis of publicly accessible content from the hamburg.de/service website
- Introduction
- Problem definition
- Objective of the work
- Structure of the work
- Theoretical foundations
- Retrieval Augmented Generation (RAG)
- Hallucinations in LLM editions
- Methods for reducing hallucinations
- Overview of existing RAG-based systems in the context of public authorities
- Requirements analysis and system design
- Use case: Citizen inquiries based on public administration information
- Requirements for data sources, security and timeliness
- System architecture and component overview
- Implementation of the prototype
- Data acquisition and pre-processing (scraping, Q&A generation)
- Development of the RAG pipeline (retriever, LLM, vector database)
- Implementation of the user interface
- Methods for hallucination avoidance and detection
- Retrieval optimization (hybrid search, filter)
- Prompt engineering and context restrictions
- Confidence metrics and evaluation procedures (e.g. RAGAS, LLM-as-a-Judge)
- Evaluation concept and test design
- Evaluation and results
- Description of the test data and scenarios
- Comparison of the methods used
- Discussion of the results
- Boundaries and limitations
- Conclusion and outlook
- Summary of the work
- Assessment of target achievement
- Outlook on further developments and potential for practical use
- Appendix
- Illustrations, code snippets, configuration files
- List of sources
- Affidavit
Cheng Niu, Y. W. (17. 04 2025). RAGTruth: A Hallucination Corpus for Developing Trustworthy. Retrieved from https://arxiv.org/pdf/2401.00396
Hanzhi Zhang, S. A. (25. 04 2025). Poly-FEVER: A Multilingual Fact Verification Benchmark for Hallucination Detection in Large Language Models. Retrieved from https://arxiv.org/abs/2503.16541
Kettunen, N. (2025). Development of a framework for pre-processing domain-specific data using a technical language processing approach. Retrieved from https://lutpub.lut.fi/handle/10024/168926
Koroteev, M. V. (22. 04 2021). BERT: A Review of Applications in Natural Language Processing and Understanding. Retrieved from https://arxiv.org/abs/2103.11943
Matthew Barker, A. B. (25. 2 2025). Faster, Cheaper, Better: Multi-Objective Hyperparameter Optimization for LLM and RAG Systems. Retrieved from https://arxiv.org/abs/2502.18635
Nicholas Matsumoto, J. M. (3. 6 2024). KRAGEN: a knowledge graph-enhanced RAG framework for biomedical problem solving using large language models . Retrieved from https://academic.oup.com/bioinformatics/article/40/6/btae353/7687047
Potsawee Manakul, A. L. (15. 04 2023). SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. Retrieved from https://arxiv.org/abs/2303.08896
Pranab Sahoo, A. K. (5. 2 2024). A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. Retrieved from https://rotmandigital.ca/wp-content/uploads/2024/09/A-Systematic-Survey-of-Prompt-Engineering-in-Large-Language-Models.pdf
Priyanka Mandikal, R. M. (8. 1 2024). Sparse Meets Dense: A Hybrid Approach to Enhance Scientific Document Retrieval. Retrieved from https://arxiv.org/abs/2401.04055
Radeva, I., Popchev, I., & Dimitrova, M. (2024). Similarity Thresholds in Retrieval-Augmented Generation. Retrieved from https://ieeexplore.ieee.org/abstract/document/10705214?casa_token=eQ-r5Pc63ccAAAAA:VOYjoH0fEsfbclOgfU-NBZ63l7Qb64FLHtK9hsoLMpz76obf5NmnVye8dvf8xVOmGN5fhjMVOQ
Research, A. (25. 04 2025). LibreEval: The Open-Source Benchmark for RAG Hallucination Detection. Retrieved from https://arize.com/llm-hallucination-dataset/
Sardana, A. (4 2025). Real-Time Evaluation Models for RAG: Who Detects Hallucinations Best? Retrieved from https://www.researchgate.net/publication/390247766_Real-Time_Evaluation_Models_for_RAG_Who_Detects_Hallucinations_Best
Shahul Es, J. J. (25. 04 2025). RAGAs: Automated Evaluation of Retrieval Augmented Generation. Retrieved from https://aclanthology.org/2024.eacl-demo.16/
Shengyu Mao, Y. J. (23. 5 2024). RaFe: Ranking Feedback Improves Query Rewriting for RAG. Retrieved from https://arxiv.org/abs/2405.14431
Takeshi Kojima, S. S. (25. 04 2025). Large Language Models are Zero-Shot Reasoners. Retrieved from https://arxiv.org/abs/2205.11916
Vipula Rawte, S. T. (27. 4 2024). "Sorry, Come Again?" Prompting -- Enhancing Comprehension and Diminishing Hallucination with [PAUSE]-injected Optimal Paraphrasing. Retrieved from https://arxiv.org/abs/2403.18976
Wan Zhang, J. Z. (17. 04 2025). Hallucination Mitigation for Retrieval-Augmented Large Language Models: A Review. Retrieved from https://www.mdpi.com/2227-7390/13/5/856
Yu Bai, Y. M. (21. 6 2024). Pistis-RAG: Enhancing Retrieval-Augmented Generation with Human Feedback. Retrieved from https://arxiv.org/abs/2407.00072
Yunianto, I., Permanasari, A. E., & Widyawan, W. (1. 12 2020). Domain-Specific Contextualized Embedding: A Systematic Literature Review. Retrieved from https://ieeexplore.ieee.org/abstract/document/9271752?casa_token=SpIZDtY_vkQAAAAA:mlX3j-xotQTG0Q8nkdMh7Me_Nvg8jXZ5O1CeSU0M_rdLAXvWX96p6QerkENs8Zq1WrsxexQEmQ
Zhongxiang Sun, X. Z. (15. 10 2024). ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability. Retrieved from https://arxiv.org/abs/2410.11414
-
https://www.hamburg.de/service (Last accessed: 04/17/2025) ↑
-
Hallucination in the context of LLMs usually refers to a situation where the model generates content that is not based on factual or accurate information [...]. The occasional generation of outputs that appear plausible but are factually incorrect significantly undermine the reliability of LLMs [...]. ↑
-
https://www.hamburg.de/service (Last accessed: 04/17/2025) ↑
