Skip to content

Commit 45748aa

Browse files
committed
feat: Enhance RAG system with web processing - Add web content processing support with Trafilatura - Update requirements.txt with new dependencies - Add documentation and example outputs - Improve store.py for handling web content - Add example processed content in docs/gan.json
1 parent ab75086 commit 45748aa

File tree

9 files changed

+770
-29
lines changed

9 files changed

+770
-29
lines changed

agentic_rag/README.md

Lines changed: 36 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,11 @@ An intelligent RAG (Retrieval Augmented Generation) system that uses an LLM agen
55
The system has the following features:
66

77
- Intelligent query routing
8-
- PDF processing using Docling for accurate text extraction
9-
- Persistent vector storage with ChromaDB
8+
- PDF processing using Docling for accurate text extraction and chunking
9+
- Persistent vector storage with ChromaDB (PDF and Websites)
1010
- Smart context retrieval and response generation
11-
- FastAPI-based REST API
12-
- Support for both OpenAI-based agents or local, transformer-based agents (Mistral-7B by default)
11+
- FastAPI-based REST API for document upload and querying
12+
- Support for both OpenAI-based agents or local, transformer-based agents (`Mistral-7B` by default)
1313

1414
## Setup
1515

@@ -23,15 +23,15 @@ The system has the following features:
2323

2424
2. Authenticate with HuggingFace:
2525

26-
The system uses Mistral-7B by default, which requires authentication with HuggingFace:
26+
The system uses `Mistral-7B` by default, which requires authentication with HuggingFace:
2727

28-
a. Create a HuggingFace account [here](https://huggingface.co/join)
28+
a. Create a HuggingFace account [here](https://huggingface.co/join), if you don't have one yet.
2929
3030
b. Accept the Mistral-7B model terms & conditions [here](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
3131
3232
c. Create an access token [here](https://huggingface.co/settings/tokens)
3333
34-
d. Create a `config.yaml` file (you can copy from `config_example.yaml`):
34+
d. Create a `config.yaml` file (you can copy from `config_example.yaml`), and add your HuggingFace token:
3535
```yaml
3636
HUGGING_FACE_HUB_TOKEN: your_token_here
3737
```
@@ -42,11 +42,11 @@ The system has the following features:
4242
OPENAI_API_KEY=your-api-key-here
4343
```
4444
45-
4. If no API key is provided, the system will automatically download and use `Mistral-7B-Instruct-v0.2` for text generation when using the local model. No additional configuration is needed.
45+
If no API key is provided, the system will automatically download and use `Mistral-7B-Instruct-v0.2` for text generation when using the local model. No additional configuration is needed.
4646
4747
## 1. Getting Started
4848
49-
You can use this solution in three ways:
49+
You can launch this solution in three ways:
5050
5151
### 1. Using the Complete REST API
5252
@@ -58,11 +58,11 @@ python main.py
5858
5959
The API will be available at `http://localhost:8000`. You can then use the API endpoints as described in the API Endpoints section below.
6060
61-
### 2. Using Individual Components via Command Line
61+
### 2. Using Individual Python Components via Command Line
6262
6363
#### Process PDFs
6464
65-
Process a PDF file and save the chunks to a JSON file:
65+
To process a PDF file and save the chunks to a JSON file, run:
6666
6767
```bash
6868
# Process a single PDF
@@ -75,23 +75,38 @@ python pdf_processor.py --input path/to/pdf/directory --output chunks.json
7575
python pdf_processor.py --input https://example.com/document.pdf --output chunks.json
7676
# sample pdf: https://arxiv.org/pdf/2203.06605
7777
```
78+
#### Process Websites with Trafilatura
79+
80+
Process a single website and save the content to a JSON file:
81+
```bash
82+
python web_processor.py --input https://example.com --output docs/web_content.json
83+
```
84+
85+
Or, process multiple URLs from a file and save them into a single JSON file:
86+
87+
```bash
88+
python web_processor.py --input urls.txt --output docs/web_content.json
89+
```
7890
7991
#### Manage Vector Store
8092
81-
Add documents to the vector store and query them:
93+
To add documents to the vector store and query them, run:
8294
8395
```bash
84-
# Add documents from a chunks file
96+
# Add documents from a chunks file, by default to the pdf_collection
8597
python store.py --add chunks.json
98+
# for websites, use the --add-web flag
99+
python store.py --add-web docs/web_content.json
86100
87-
# Query the vector store directly, or with local_rag_agent.py
101+
# Query the vector store directly, both pdf and web collections
102+
# llm will make the best decision on which collection to query based upon your input
88103
python store.py --query "your search query"
89104
python local_rag_agent.py --query "your search query"
90105
```
91106
92107
#### Use RAG Agent
93108
94-
Query documents using either OpenAI or a local model:
109+
To query documents using either OpenAI or a local model, run:
95110
96111
```bash
97112
# Using OpenAI (requires API key in .env)
@@ -103,7 +118,7 @@ python local_rag_agent.py --query "Can you explain the DaGAN Approach proposed i
103118
104119
### 3. Complete Pipeline Example
105120
106-
Here's how to process a document and query it using the local model:
121+
First, we process a document and query it using the local model. Then, we add the document to the vector store and query from the knowledge base to get the RAG system in action.
107122
108123
```bash
109124
# 1. Process the PDF
@@ -114,15 +129,12 @@ python store.py --add chunks.json
114129
115130
# 3. Query using local model
116131
python local_rag_agent.py --query "Can you explain the DaGAN Approach proposed in the Depth-Aware Generative Adversarial Network for Talking Head Video Generation article?"
117-
```
118132
119-
Or using OpenAI (requires API key):
120-
```bash
121-
# Same steps 1 and 2 as above, then:
133+
# Or using OpenAI (requires API key):
122134
python rag_agent.py --query "Can you explain the DaGAN Approach proposed in the Depth-Aware Generative Adversarial Network for Talking Head Video Generation article?"
123135
```
124136
125-
## API Endpoints
137+
## Annex: API Endpoints
126138
127139
### Upload PDF
128140
@@ -133,7 +145,7 @@ Content-Type: multipart/form-data
133145
file: <pdf-file>
134146
```
135147
136-
Uploads and processes a PDF file, storing its contents in the vector database.
148+
This endpoint uploads and processes a PDF file, storing its contents in the vector database.
137149
138150
### Query
139151
@@ -146,7 +158,7 @@ Content-Type: application/json
146158
}
147159
```
148160
149-
Processes a query through the agentic RAG pipeline and returns a response with context.
161+
This endpoint processes a query through the agentic RAG pipeline and returns a response with context.
150162
151163
## Annex: Architecture
152164
@@ -164,7 +176,7 @@ The RAG Agent flow is the following:
164176
1. Analyzes query type
165177
2. Try to find relevant PDF context, regardless of query type
166178
3. If PDF context is found, use it to generate a response.
167-
4. If no PDF context is found OR if it's a general knowledge query, use the pre-trainedLLM directly
179+
4. If no PDF context is found OR if it's a general knowledge query, use the pre-trained LLM directly
168180
5. Fall back to a "no information" response only in edge cases.
169181

170182
## Hardware Requirements

0 commit comments

Comments
 (0)