Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/validate-new-plugin-metadata.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
uses: actions/checkout@v4
with:
fetch-depth: 0
ref: ${{ github.event.pull_request.head.ref }}
ref: ${{ github.event.pull_request.head.sha }}

- name: Identify New Plugin Directories
id: find_new_plugins
Expand Down
219 changes: 219 additions & 0 deletions plugins/RAGPinecone/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
# RAGPinecone Plugin for GAME SDK

A Retrieval Augmented Generation (RAG) plugin using Pinecone as the vector database for the GAME SDK.

## Features

- Query a knowledge base for relevant context
- Advanced hybrid search (vector + BM25) for better retrieval
- AI-generated answers based on retrieved documents
- Add documents to the knowledge base
- Delete documents from the knowledge base
- Chunk documents for better retrieval
- Process documents from a folder automatically
- Integrate with Telegram bot for RAG-powered conversations

## Installation

### From Source

1. Clone the repository or navigate to the plugin directory:
```bash
cd game-python/plugins/RAGPinecone
```

2. Install the plugin in development mode:
```bash
pip install -e .
```

This will install all required dependencies and make the plugin available in your environment.

## Setup and Configuration

1. Set the following environment variables:
- `PINECONE_API_KEY`: Your Pinecone API key
- `OPENAI_API_KEY`: Your OpenAI API key (for embeddings)
- `GAME_API_KEY`: Your GAME API key
- `TELEGRAM_BOT_TOKEN`: Your Telegram bot token (if using with Telegram)

2. Import and initialize the plugin to use in your agent:

```python
from rag_pinecone_gamesdk.rag_pinecone_plugin import RAGPineconePlugin
from rag_pinecone_gamesdk.rag_pinecone_game_functions import query_knowledge_fn, add_document_fn

# Initialize the plugin
rag_plugin = RAGPineconePlugin(
pinecone_api_key="your-pinecone-api-key",
openai_api_key="your-openai-api-key",
index_name="your-index-name",
namespace="your-namespace"
)

# Add the functions to your agent's action space
agent_action_space = [
query_knowledge_fn(rag_plugin),
add_document_fn(rag_plugin),
# ... other functions
]
```

## Available Functions

### Basic RAG Functions

1. `query_knowledge(query: str, num_results: int = 3)` - Query the knowledge base for relevant context
2. `add_document(content: str, metadata: dict = None)` - Add a document to the knowledge base

### Advanced RAG Functions

1. `advanced_query_knowledge(query: str)` - Query the knowledge base using hybrid retrieval (vector + BM25) and get an AI-generated answer
2. `get_relevant_documents(query: str)` - Get relevant documents using hybrid retrieval without generating an answer

Example usage of advanced functions:

```python
from rag_pinecone_gamesdk.search_rag import RAGSearcher
from rag_pinecone_gamesdk.rag_pinecone_game_functions import advanced_query_knowledge_fn, get_relevant_documents_fn

# Initialize the RAG searcher
rag_searcher = RAGSearcher(
pinecone_api_key="your-pinecone-api-key",
openai_api_key="your-openai-api-key",
index_name="your-index-name",
namespace="your-namespace"
)

# Add the advanced functions to your agent's action space
agent_action_space = [
advanced_query_knowledge_fn(rag_searcher),
get_relevant_documents_fn(rag_searcher),
# ... other functions
]
```

## Populating the Knowledge Base

### Using the Documents Folder

The easiest way to populate the knowledge base is to place your documents in the `Documents` folder and run the provided script:

```bash
cd game-python/plugins/RAGPinecone
python examples/populate_knowledge_base.py
```

This will process all supported files in the Documents folder and add them to the knowledge base.

Supported file types:
- `.txt` - Text files
- `.pdf` - PDF documents
- `.docx` - Word documents
- `.doc` - Word documents
- `.csv` - CSV files
- `.md` - Markdown files
- `.html` - HTML files

### Using the API

You can also populate the knowledge base programmatically:

```python
from rag_pinecone_gamesdk.populate_rag import RAGPopulator

# Initialize the populator
populator = RAGPopulator(
pinecone_api_key="your-pinecone-api-key",
openai_api_key="your-openai-api-key",
index_name="your-index-name",
namespace="your-namespace"
)

# Add a document
content = "Your document content here"
metadata = {
"title": "Document Title",
"author": "Author Name",
"source": "Source Name",
}

status, message, results = populator.add_document(content, metadata)
print(f"Status: {status}")
print(f"Message: {message}")
print(f"Results: {results}")

# Process all documents in a folder
status, message, results = populator.process_documents_folder()
print(f"Status: {status}")
print(f"Message: {message}")
print(f"Processed {results.get('total_files', 0)} files, {results.get('successful_files', 0)} successful")
```

## Testing the Advanced Search

You can test the advanced search functionality using the provided example script:

```bash
cd game-python/plugins/RAGPinecone
python examples/test_advanced_search.py
```

This will run a series of test queries using the advanced hybrid retrieval system.

## Integration with Telegram

See the `examples/test_rag_pinecone_telegram.py` file for an example of how to integrate the RAGPinecone plugin with a Telegram bot.

To run the Telegram bot with advanced RAG capabilities:

```bash
cd game-python/plugins/RAGPinecone
python examples/test_rag_pinecone_telegram.py
```

## Advanced Usage

### Hybrid Retrieval

The advanced search functionality uses a hybrid retrieval approach that combines:

1. **Vector Search**: Uses embeddings to find semantically similar documents
2. **BM25 Search**: Uses keyword matching to find documents with relevant terms

This hybrid approach often provides better results than either method alone, especially for complex queries.

### Custom Document Processing

You can customize how documents are processed by extending the `RAGPopulator` class:

```python
from rag_pinecone_gamesdk.populate_rag import RAGPopulator

class CustomRAGPopulator(RAGPopulator):
def chunk_document(self, content, metadata):
# Custom chunking logic
# ...
return chunked_docs
```

### Custom Embedding Models

You can use different embedding models by specifying the `embedding_model` parameter:

```python
rag_plugin = RAGPineconePlugin(
embedding_model="sentence-transformers/all-mpnet-base-v2"
)
```

## Requirements

- Python 3.9+
- Pinecone account
- OpenAI API key
- GAME SDK
- langchain
- langchain_community
- langchain_pinecone
- langchain_openai
142 changes: 142 additions & 0 deletions plugins/RAGPinecone/examples/populate_knowledge_base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
import os
import logging
import tempfile
import requests
import re
from dotenv import load_dotenv
import gdown

from rag_pinecone_gamesdk.populate_rag import RAGPopulator
from rag_pinecone_gamesdk import DEFAULT_INDEX_NAME, DEFAULT_NAMESPACE

# Configure logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)

def download_from_google_drive(folder_url, download_folder):
"""
Download all files from a Google Drive folder

Args:
folder_url: URL of the Google Drive folder
download_folder: Local folder to download files to

Returns:
List of downloaded file paths
"""
logger.info(f"Downloading files from Google Drive folder: {folder_url}")

# Extract folder ID from URL
folder_id_match = re.search(r'folders/([a-zA-Z0-9_-]+)', folder_url)
if not folder_id_match:
logger.error(f"Could not extract folder ID from URL: {folder_url}")
return []

folder_id = folder_id_match.group(1)
logger.info(f"Folder ID: {folder_id}")

# Create download folder if it doesn't exist
os.makedirs(download_folder, exist_ok=True)

# Download all files in the folder
try:
# Use gdown to download all files in the folder
downloaded_files = gdown.download_folder(
id=folder_id,
output=download_folder,
quiet=False,
use_cookies=False
)

if not downloaded_files:
logger.warning("No files were downloaded from Google Drive")
return []

logger.info(f"Downloaded {len(downloaded_files)} files from Google Drive")
return downloaded_files

except Exception as e:
logger.error(f"Error downloading files from Google Drive: {str(e)}")
return []

def main():
# Load environment variables
load_dotenv()

# Check for required environment variables
pinecone_api_key = os.environ.get("PINECONE_API_KEY")
openai_api_key = os.environ.get("OPENAI_API_KEY")

if not pinecone_api_key:
logger.error("PINECONE_API_KEY environment variable is not set")
return

if not openai_api_key:
logger.error("OPENAI_API_KEY environment variable is not set")
return

# Google Drive folder URL
google_drive_url = "https://drive.google.com/drive/folders/1dKYDQxenDkthF0MPr-KOsdPNqEmrAq1c?usp=sharing"

# Create a temporary directory for downloaded files
with tempfile.TemporaryDirectory() as temp_dir:
logger.info(f"Created temporary directory for downloaded files: {temp_dir}")

# Download files from Google Drive
downloaded_files = download_from_google_drive(google_drive_url, temp_dir)

if not downloaded_files:
logger.error("No files were downloaded from Google Drive. Exiting.")
return

# Get the Documents folder path for local processing
documents_folder = os.path.join(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
"Documents"
)

# Ensure the Documents folder exists
if not os.path.exists(documents_folder):
os.makedirs(documents_folder)
logger.info(f"Created Documents folder at: {documents_folder}")

# Initialize the RAGPopulator
logger.info("Initializing RAGPopulator...")
populator = RAGPopulator(
pinecone_api_key=pinecone_api_key,
openai_api_key=openai_api_key,
index_name=DEFAULT_INDEX_NAME,
namespace=DEFAULT_NAMESPACE,
documents_folder=temp_dir, # Use the temp directory with downloaded files
)

# Process all documents in the temporary folder
logger.info(f"Processing downloaded documents from: {temp_dir}")
status, message, results = populator.process_documents_folder()

# Log the results
logger.info(f"Status: {status}")
logger.info(f"Message: {message}")
logger.info(f"Processed {results.get('total_files', 0)} files, {results.get('successful_files', 0)} successful")

# Get all document IDs
ids = populator.fetch_all_ids()
logger.info(f"Total vectors in database: {len(ids)}")

# Print detailed results for each file
if 'results' in results:
logger.info("\nDetailed results:")
for result in results['results']:
file_path = result.get('file_path', 'Unknown file')
status = result.get('status', 'Unknown status')
message = result.get('message', 'No message')
logger.info(f"File: {os.path.basename(file_path)}")
logger.info(f"Status: {status}")
logger.info(f"Message: {message}")
logger.info("---")

if __name__ == "__main__":
main()
Loading