Skip to content

Commit 79c92e8

Browse files
authored
Merge pull request game-by-virtuals#101 from MichielMAnalytics/mvoortman/RAGPlugin
added RAG (both populate and search + tg bot example)
2 parents 49c4892 + 38bd7df commit 79c92e8

File tree

13 files changed

+2071
-1
lines changed

13 files changed

+2071
-1
lines changed

.github/workflows/validate-new-plugin-metadata.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ jobs:
1616
uses: actions/checkout@v4
1717
with:
1818
fetch-depth: 0
19-
ref: ${{ github.event.pull_request.head.ref }}
19+
ref: ${{ github.event.pull_request.head.sha }}
2020

2121
- name: Identify New Plugin Directories
2222
id: find_new_plugins

plugins/RAGPinecone/README.md

Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,219 @@
1+
# RAGPinecone Plugin for GAME SDK
2+
3+
A Retrieval Augmented Generation (RAG) plugin using Pinecone as the vector database for the GAME SDK.
4+
5+
## Features
6+
7+
- Query a knowledge base for relevant context
8+
- Advanced hybrid search (vector + BM25) for better retrieval
9+
- AI-generated answers based on retrieved documents
10+
- Add documents to the knowledge base
11+
- Delete documents from the knowledge base
12+
- Chunk documents for better retrieval
13+
- Process documents from a folder automatically
14+
- Integrate with Telegram bot for RAG-powered conversations
15+
16+
## Installation
17+
18+
### From Source
19+
20+
1. Clone the repository or navigate to the plugin directory:
21+
```bash
22+
cd game-python/plugins/RAGPinecone
23+
```
24+
25+
2. Install the plugin in development mode:
26+
```bash
27+
pip install -e .
28+
```
29+
30+
This will install all required dependencies and make the plugin available in your environment.
31+
32+
## Setup and Configuration
33+
34+
1. Set the following environment variables:
35+
- `PINECONE_API_KEY`: Your Pinecone API key
36+
- `OPENAI_API_KEY`: Your OpenAI API key (for embeddings)
37+
- `GAME_API_KEY`: Your GAME API key
38+
- `TELEGRAM_BOT_TOKEN`: Your Telegram bot token (if using with Telegram)
39+
40+
2. Import and initialize the plugin to use in your agent:
41+
42+
```python
43+
from rag_pinecone_gamesdk.rag_pinecone_plugin import RAGPineconePlugin
44+
from rag_pinecone_gamesdk.rag_pinecone_game_functions import query_knowledge_fn, add_document_fn
45+
46+
# Initialize the plugin
47+
rag_plugin = RAGPineconePlugin(
48+
pinecone_api_key="your-pinecone-api-key",
49+
openai_api_key="your-openai-api-key",
50+
index_name="your-index-name",
51+
namespace="your-namespace"
52+
)
53+
54+
# Add the functions to your agent's action space
55+
agent_action_space = [
56+
query_knowledge_fn(rag_plugin),
57+
add_document_fn(rag_plugin),
58+
# ... other functions
59+
]
60+
```
61+
62+
## Available Functions
63+
64+
### Basic RAG Functions
65+
66+
1. `query_knowledge(query: str, num_results: int = 3)` - Query the knowledge base for relevant context
67+
2. `add_document(content: str, metadata: dict = None)` - Add a document to the knowledge base
68+
69+
### Advanced RAG Functions
70+
71+
1. `advanced_query_knowledge(query: str)` - Query the knowledge base using hybrid retrieval (vector + BM25) and get an AI-generated answer
72+
2. `get_relevant_documents(query: str)` - Get relevant documents using hybrid retrieval without generating an answer
73+
74+
Example usage of advanced functions:
75+
76+
```python
77+
from rag_pinecone_gamesdk.search_rag import RAGSearcher
78+
from rag_pinecone_gamesdk.rag_pinecone_game_functions import advanced_query_knowledge_fn, get_relevant_documents_fn
79+
80+
# Initialize the RAG searcher
81+
rag_searcher = RAGSearcher(
82+
pinecone_api_key="your-pinecone-api-key",
83+
openai_api_key="your-openai-api-key",
84+
index_name="your-index-name",
85+
namespace="your-namespace"
86+
)
87+
88+
# Add the advanced functions to your agent's action space
89+
agent_action_space = [
90+
advanced_query_knowledge_fn(rag_searcher),
91+
get_relevant_documents_fn(rag_searcher),
92+
# ... other functions
93+
]
94+
```
95+
96+
## Populating the Knowledge Base
97+
98+
### Using the Documents Folder
99+
100+
The easiest way to populate the knowledge base is to place your documents in the `Documents` folder and run the provided script:
101+
102+
```bash
103+
cd game-python/plugins/RAGPinecone
104+
python examples/populate_knowledge_base.py
105+
```
106+
107+
This will process all supported files in the Documents folder and add them to the knowledge base.
108+
109+
Supported file types:
110+
- `.txt` - Text files
111+
- `.pdf` - PDF documents
112+
- `.docx` - Word documents
113+
- `.doc` - Word documents
114+
- `.csv` - CSV files
115+
- `.md` - Markdown files
116+
- `.html` - HTML files
117+
118+
### Using the API
119+
120+
You can also populate the knowledge base programmatically:
121+
122+
```python
123+
from rag_pinecone_gamesdk.populate_rag import RAGPopulator
124+
125+
# Initialize the populator
126+
populator = RAGPopulator(
127+
pinecone_api_key="your-pinecone-api-key",
128+
openai_api_key="your-openai-api-key",
129+
index_name="your-index-name",
130+
namespace="your-namespace"
131+
)
132+
133+
# Add a document
134+
content = "Your document content here"
135+
metadata = {
136+
"title": "Document Title",
137+
"author": "Author Name",
138+
"source": "Source Name",
139+
}
140+
141+
status, message, results = populator.add_document(content, metadata)
142+
print(f"Status: {status}")
143+
print(f"Message: {message}")
144+
print(f"Results: {results}")
145+
146+
# Process all documents in a folder
147+
status, message, results = populator.process_documents_folder()
148+
print(f"Status: {status}")
149+
print(f"Message: {message}")
150+
print(f"Processed {results.get('total_files', 0)} files, {results.get('successful_files', 0)} successful")
151+
```
152+
153+
## Testing the Advanced Search
154+
155+
You can test the advanced search functionality using the provided example script:
156+
157+
```bash
158+
cd game-python/plugins/RAGPinecone
159+
python examples/test_advanced_search.py
160+
```
161+
162+
This will run a series of test queries using the advanced hybrid retrieval system.
163+
164+
## Integration with Telegram
165+
166+
See the `examples/test_rag_pinecone_telegram.py` file for an example of how to integrate the RAGPinecone plugin with a Telegram bot.
167+
168+
To run the Telegram bot with advanced RAG capabilities:
169+
170+
```bash
171+
cd game-python/plugins/RAGPinecone
172+
python examples/test_rag_pinecone_telegram.py
173+
```
174+
175+
## Advanced Usage
176+
177+
### Hybrid Retrieval
178+
179+
The advanced search functionality uses a hybrid retrieval approach that combines:
180+
181+
1. **Vector Search**: Uses embeddings to find semantically similar documents
182+
2. **BM25 Search**: Uses keyword matching to find documents with relevant terms
183+
184+
This hybrid approach often provides better results than either method alone, especially for complex queries.
185+
186+
### Custom Document Processing
187+
188+
You can customize how documents are processed by extending the `RAGPopulator` class:
189+
190+
```python
191+
from rag_pinecone_gamesdk.populate_rag import RAGPopulator
192+
193+
class CustomRAGPopulator(RAGPopulator):
194+
def chunk_document(self, content, metadata):
195+
# Custom chunking logic
196+
# ...
197+
return chunked_docs
198+
```
199+
200+
### Custom Embedding Models
201+
202+
You can use different embedding models by specifying the `embedding_model` parameter:
203+
204+
```python
205+
rag_plugin = RAGPineconePlugin(
206+
embedding_model="sentence-transformers/all-mpnet-base-v2"
207+
)
208+
```
209+
210+
## Requirements
211+
212+
- Python 3.9+
213+
- Pinecone account
214+
- OpenAI API key
215+
- GAME SDK
216+
- langchain
217+
- langchain_community
218+
- langchain_pinecone
219+
- langchain_openai
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
import os
2+
import logging
3+
import tempfile
4+
import requests
5+
import re
6+
from dotenv import load_dotenv
7+
import gdown
8+
9+
from rag_pinecone_gamesdk.populate_rag import RAGPopulator
10+
from rag_pinecone_gamesdk import DEFAULT_INDEX_NAME, DEFAULT_NAMESPACE
11+
12+
# Configure logging
13+
logging.basicConfig(
14+
level=logging.INFO,
15+
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
16+
)
17+
logger = logging.getLogger(__name__)
18+
19+
def download_from_google_drive(folder_url, download_folder):
20+
"""
21+
Download all files from a Google Drive folder
22+
23+
Args:
24+
folder_url: URL of the Google Drive folder
25+
download_folder: Local folder to download files to
26+
27+
Returns:
28+
List of downloaded file paths
29+
"""
30+
logger.info(f"Downloading files from Google Drive folder: {folder_url}")
31+
32+
# Extract folder ID from URL
33+
folder_id_match = re.search(r'folders/([a-zA-Z0-9_-]+)', folder_url)
34+
if not folder_id_match:
35+
logger.error(f"Could not extract folder ID from URL: {folder_url}")
36+
return []
37+
38+
folder_id = folder_id_match.group(1)
39+
logger.info(f"Folder ID: {folder_id}")
40+
41+
# Create download folder if it doesn't exist
42+
os.makedirs(download_folder, exist_ok=True)
43+
44+
# Download all files in the folder
45+
try:
46+
# Use gdown to download all files in the folder
47+
downloaded_files = gdown.download_folder(
48+
id=folder_id,
49+
output=download_folder,
50+
quiet=False,
51+
use_cookies=False
52+
)
53+
54+
if not downloaded_files:
55+
logger.warning("No files were downloaded from Google Drive")
56+
return []
57+
58+
logger.info(f"Downloaded {len(downloaded_files)} files from Google Drive")
59+
return downloaded_files
60+
61+
except Exception as e:
62+
logger.error(f"Error downloading files from Google Drive: {str(e)}")
63+
return []
64+
65+
def main():
66+
# Load environment variables
67+
load_dotenv()
68+
69+
# Check for required environment variables
70+
pinecone_api_key = os.environ.get("PINECONE_API_KEY")
71+
openai_api_key = os.environ.get("OPENAI_API_KEY")
72+
73+
if not pinecone_api_key:
74+
logger.error("PINECONE_API_KEY environment variable is not set")
75+
return
76+
77+
if not openai_api_key:
78+
logger.error("OPENAI_API_KEY environment variable is not set")
79+
return
80+
81+
# Google Drive folder URL
82+
google_drive_url = "https://drive.google.com/drive/folders/1dKYDQxenDkthF0MPr-KOsdPNqEmrAq1c?usp=sharing"
83+
84+
# Create a temporary directory for downloaded files
85+
with tempfile.TemporaryDirectory() as temp_dir:
86+
logger.info(f"Created temporary directory for downloaded files: {temp_dir}")
87+
88+
# Download files from Google Drive
89+
downloaded_files = download_from_google_drive(google_drive_url, temp_dir)
90+
91+
if not downloaded_files:
92+
logger.error("No files were downloaded from Google Drive. Exiting.")
93+
return
94+
95+
# Get the Documents folder path for local processing
96+
documents_folder = os.path.join(
97+
os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
98+
"Documents"
99+
)
100+
101+
# Ensure the Documents folder exists
102+
if not os.path.exists(documents_folder):
103+
os.makedirs(documents_folder)
104+
logger.info(f"Created Documents folder at: {documents_folder}")
105+
106+
# Initialize the RAGPopulator
107+
logger.info("Initializing RAGPopulator...")
108+
populator = RAGPopulator(
109+
pinecone_api_key=pinecone_api_key,
110+
openai_api_key=openai_api_key,
111+
index_name=DEFAULT_INDEX_NAME,
112+
namespace=DEFAULT_NAMESPACE,
113+
documents_folder=temp_dir, # Use the temp directory with downloaded files
114+
)
115+
116+
# Process all documents in the temporary folder
117+
logger.info(f"Processing downloaded documents from: {temp_dir}")
118+
status, message, results = populator.process_documents_folder()
119+
120+
# Log the results
121+
logger.info(f"Status: {status}")
122+
logger.info(f"Message: {message}")
123+
logger.info(f"Processed {results.get('total_files', 0)} files, {results.get('successful_files', 0)} successful")
124+
125+
# Get all document IDs
126+
ids = populator.fetch_all_ids()
127+
logger.info(f"Total vectors in database: {len(ids)}")
128+
129+
# Print detailed results for each file
130+
if 'results' in results:
131+
logger.info("\nDetailed results:")
132+
for result in results['results']:
133+
file_path = result.get('file_path', 'Unknown file')
134+
status = result.get('status', 'Unknown status')
135+
message = result.get('message', 'No message')
136+
logger.info(f"File: {os.path.basename(file_path)}")
137+
logger.info(f"Status: {status}")
138+
logger.info(f"Message: {message}")
139+
logger.info("---")
140+
141+
if __name__ == "__main__":
142+
main()

0 commit comments

Comments
 (0)