-
Notifications
You must be signed in to change notification settings - Fork 5
Context Retrieval Persona #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jonahjung22
wants to merge
24
commits into
jupyter-ai-contrib:main
Choose a base branch
from
jonahjung22:data_science_persona
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
57ed47a
New persona integrating jupyter ai tools
jonahjung22 5caff8f
Context Retrieval Persona
jonahjung22 461f281
test context file
jonahjung22 74ed255
updated toml and wrapper tool
jonahjung22 c2d52ed
Increased RAG chunks, specific md file naming, logging
jonahjung22 c89afb3
Removed unnecessary files
jonahjung22 cc2b99d
updated the names of the files; updated README with persona new capab…
jonahjung22 2a24e18
modified toml
jonahjung22 20262e4
building out the context persona using pocketflow
jonahjung22 772d548
new method for rag based approach using pocketflow
jonahjung22 55157b7
Merge branch 'main' into data_science_persona
jonahjung22 95b68f8
added test notebook
jonahjung22 d910e61
added greetings
jonahjung22 82834fc
Separating 1 persona for each PR
jonahjung22 e5e188b
cleaned up some code
jonahjung22 ad6a220
updated README
jonahjung22 a882287
added test files
jonahjung22 1ebe1f2
removing some lines
jonahjung22 7488430
updated persona code and removed unnecessary components
jonahjung22 af883f9
remove unnecessary comments
jonahjung22 cb68c1f
updated dependencies
jonahjung22 46d26cd
deleted unnecessary folder
jonahjung22 dd81447
Changes to the whole RAG structure implemented
jonahjung22 b7f6eac
removed unnecessary file
jonahjung22 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Submodule jupyter-ai-personas
deleted from
4af5de
214 changes: 214 additions & 0 deletions
214
jupyter_ai_personas/context_retrieval_persona/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,214 @@ | ||
# Context Retrieval Persona | ||
|
||
## Overview | ||
|
||
The Context Retrieval Persona analyzes your data science notebooks and finds relevant resources from the Python Data Science Handbook using RAG (Retrieval-Augmented Generation). It employs a three-agent system to provide comprehensive analysis and actionable recommendations. | ||
|
||
## Features | ||
|
||
- **Intelligent Notebook Analysis**: Extracts libraries, analysis stage, domain, and objectives from your notebooks | ||
- **Full Notebook RAG Search**: Returns complete relevant notebooks instead of fragments for comprehensive context | ||
- **Handbook-Only Search**: Avoids redundant searching by focusing on external handbook content only | ||
- **Multi-Agent Coordination**: NotebookAnalyzer, KnowledgeSearcher, and MarkdownGenerator working together | ||
- **Comprehensive Markdown Reports**: Detailed reports with code examples, explanations, and next steps | ||
- **Optimized Search**: 1-2 complete notebooks per query with clean terminal logging | ||
- **Automatic Report Generation**: Creates `repo_context.md` with comprehensive analysis | ||
|
||
## Architecture | ||
|
||
### Three-Agent System | ||
|
||
1. **NotebookAnalyzer**: Extracts structured context from your notebook | ||
|
||
- Uses `extract_rag_context` tool to read notebook content | ||
- Identifies libraries (pandas, numpy, sklearn, matplotlib, etc.) | ||
- Determines analysis stage (data_loading, eda, preprocessing, modeling, evaluation, visualization) | ||
- Outputs structured JSON with path, libraries, stage, domain, and objectives | ||
|
||
2. **KnowledgeSearcher**: Performs targeted handbook-only RAG searches | ||
|
||
- Generates 4-5 targeted search queries based on notebook analysis | ||
- Uses `search_handbook_only` to find relevant complete notebooks | ||
- Each search returns 1-2 most relevant notebooks (not fragments) | ||
- Provides comprehensive handbook content to MarkdownGenerator | ||
|
||
3. **MarkdownGenerator**: Creates detailed markdown reports | ||
- Synthesizes notebook analysis with RAG search results | ||
- Includes substantial content from retrieved handbooks | ||
- Creates cross-references between user's work and handbook examples | ||
- Saves comprehensive reports as `repo_context.md` | ||
|
||
## Core Components | ||
|
||
### Context Retrieval Persona (`persona.py`) | ||
|
||
- Main persona class orchestrating the three-agent system | ||
- Handles Jupyter AI integration and message processing | ||
- Initializes AWS Bedrock models and agent coordination | ||
- Manages greeting detection and team workflow | ||
|
||
### RAG Tool (`rag_tool.py`) | ||
|
||
Core RAG system with two main classes: | ||
|
||
- **RAG**: Loads handbook content into ChromaDB vectorstore using HuggingFace embeddings | ||
- **RAGTool**: Agno toolkit providing `search_handbook_only()` function | ||
- Returns complete notebooks (1-2 per search) instead of fragments | ||
- Clean terminal logging showing retrieved notebook titles and stats | ||
|
||
### Notebook Reader Tool (`file_reader_tool.py`) | ||
|
||
- `NotebookReaderTool`: Provides `extract_rag_context` function | ||
- Reads complete notebook content and metadata | ||
- Extracts context for the NotebookAnalyzer agent | ||
|
||
## Installation & Setup | ||
|
||
### Prerequisites | ||
|
||
Install the context retrieval persona with its dependencies: | ||
|
||
```bash | ||
pip install -e ".[context_retriever]" | ||
``` | ||
|
||
This installs: | ||
|
||
- `agno` - Multi-agent framework | ||
- `boto3` - AWS Bedrock integration | ||
- `langchain` & `langchain-core` & `langchain-community` - RAG framework | ||
- `sentence-transformers` - Embedding models | ||
- `chromadb` - Vector database | ||
- `nbformat` - Jupyter notebook reading | ||
|
||
### Setup Python Data Science Handbook | ||
|
||
```bash | ||
# Clone the handbook repository | ||
cd jupyter_ai_personas/context_retrieval_persona/ | ||
git clone https://github.com/jakevdp/PythonDataScienceHandbook.git | ||
``` | ||
|
||
### AWS Configuration | ||
|
||
Configure AWS credentials for Bedrock access: | ||
|
||
```bash | ||
aws configure | ||
# or set environment variables: | ||
export AWS_ACCESS_KEY_ID=your_key | ||
export AWS_SECRET_ACCESS_KEY=your_secret | ||
export AWS_DEFAULT_REGION=us-east-1 | ||
``` | ||
|
||
## Usage | ||
|
||
### Basic Usage | ||
|
||
In Jupyter AI chat, use the @ mention to activate the persona: | ||
|
||
``` | ||
@ContextRetrievalPersona notebook: /path/to/your/notebook.ipynb | ||
Analyze my machine learning workflow and find relevant handbook resources | ||
``` | ||
|
||
### Workflow Example | ||
|
||
1. **User Request**: Provides notebook path and description | ||
2. **NotebookAnalyzer**: Reads and analyzes notebook content | ||
3. **KnowledgeSearcher**: Performs 4-5 targeted searches in handbook | ||
4. **MarkdownGenerator**: Creates comprehensive `repo_context.md` report | ||
|
||
### Terminal Output | ||
|
||
During processing, you'll see clean RAG search logs: | ||
|
||
``` | ||
🔍 RAG SEARCH: 'sklearn RandomForest classification' | ||
📚 Found 2 relevant notebooks: | ||
1. 05.08-Random-Forests.ipynb (15 cells, 12450 chars) | ||
2. 05.03-Hyperparameters-and-Model-Validation.ipynb (22 cells, 18920 chars) | ||
``` | ||
|
||
### Generated Report Structure | ||
|
||
The `repo_context.md` file includes: | ||
|
||
- **Executive Summary**: Overview of findings and connections | ||
- **Current Notebook Analysis**: Libraries, stage, domain, objectives from your notebook | ||
- **Comprehensive Handbook Resources**: Full code examples and explanations from retrieved notebooks | ||
- **Detailed Code Examples**: Complete implementations from handbook | ||
- **Cross-References and Learning Paths**: Connections between your work and handbook content | ||
- **Actionable Implementation Steps**: Specific next steps based on analysis | ||
|
||
## Technical Details | ||
|
||
### RAG Implementation | ||
|
||
- **Embedding Model**: `sentence-transformers/all-MiniLM-L6-v2` | ||
- **Vector Store**: ChromaDB with persistent storage | ||
- **Search Strategy**: Similarity search returning complete notebooks (not fragments) | ||
- **Results per Search**: 2 most relevant complete notebooks | ||
- **Cell-Based Chunking**: Uses notebook cells as natural document boundaries | ||
|
||
### Optimizations | ||
|
||
- **Handbook-Only Search**: Avoids redundant notebook content in RAG results | ||
- **Complete Notebook Retrieval**: Returns full notebooks instead of fragments for better context | ||
- **One-Time Loading**: Vector store loaded once per session with handbook_loaded flag | ||
- **Clean Logging**: Minimal terminal output showing only essential search information | ||
- **JSON Validation Fix**: Uses `capture_validation_error=None` to suppress nbformat warnings | ||
|
||
## File Structure | ||
|
||
``` | ||
context_retrieval_persona/ | ||
├── README.md # This documentation | ||
├── persona.py # Main persona class with three-agent system | ||
├── rag_tool.py # RAG and RAGTool classes for handbook search | ||
├── file_reader_tool.py # NotebookReaderTool for content extraction | ||
├── __init__.py # Package initialization | ||
├── repo_context.md # Generated markdown reports | ||
├── PythonDataScienceHandbook/ # Cloned handbook repository | ||
│ └── notebooks/ # 100+ handbook notebooks | ||
└── vector_stores/ # ChromaDB vector storage | ||
└── rag/ # Renamed from simple_rag | ||
├── chroma.sqlite3 | ||
└── [vector files] | ||
``` | ||
|
||
## Troubleshooting | ||
|
||
### Common Issues | ||
|
||
1. **Missing Dependencies**: Install all required packages | ||
|
||
```bash | ||
pip install -e ".[context_retriever]" | ||
``` | ||
|
||
2. **Handbook Not Found**: Clone the handbook repository | ||
|
||
```bash | ||
cd jupyter_ai_personas/context_retrieval_persona/ | ||
git clone https://github.com/jakevdp/PythonDataScienceHandbook.git | ||
``` | ||
|
||
3. **AWS/Bedrock Issues**: Configure AWS credentials | ||
|
||
```bash | ||
aws configure | ||
``` | ||
|
||
4. **JSON Validation Warnings**: These are now suppressed with `capture_validation_error=None` | ||
|
||
5. **Vector Store Loading**: First run builds the vector store (5-10 minutes), subsequent runs are fast | ||
|
||
## Contributing | ||
|
||
To extend the system: | ||
|
||
1. **Enhance RAG Search**: Modify `RAGTool` class in `rag_tool.py` | ||
2. **Improve Context Extraction**: Update `NotebookReaderTool` in `file_reader_tool.py` | ||
3. **Refine Agent Instructions**: Update agent prompts in `persona.py` | ||
4. **Add New Analysis Capabilities**: Extend the three-agent system workflow |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
"""Data Science Persona package for Jupyter AI.""" |
160 changes: 160 additions & 0 deletions
160
jupyter_ai_personas/context_retrieval_persona/file_reader_tool.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,160 @@ | ||
import json | ||
import os | ||
from typing import Dict, Any, List, Optional | ||
from agno.tools import Toolkit | ||
|
||
class NotebookReaderTool(Toolkit): | ||
"""Tool for reading and extracting complete content from Jupyter notebooks.""" | ||
|
||
def __init__(self): | ||
super().__init__(name="notebook_reader") | ||
self.register(self.extract_rag_context) | ||
|
||
def extract_rag_context(self, notebook_path: str) -> str: | ||
""" | ||
Extract complete content from a Jupyter notebook for RAG context. | ||
|
||
Args: | ||
notebook_path: Path to the .ipynb notebook file | ||
|
||
Returns: | ||
str: Formatted string containing all notebook content including cells, | ||
outputs, markdown, and metadata | ||
""" | ||
try: | ||
if not os.path.exists(notebook_path): | ||
return f"Error: Notebook file not found at {notebook_path}" | ||
|
||
if not notebook_path.endswith('.ipynb'): | ||
return f"Error: File must be a .ipynb notebook file, got {notebook_path}" | ||
|
||
with open(notebook_path, 'r', encoding='utf-8') as f: | ||
notebook = json.load(f) | ||
|
||
# Extract notebook metadata and cells | ||
context = f"=== NOTEBOOK ANALYSIS ===\n" | ||
context += f"File: {notebook_path}\n" | ||
context += f"Kernel: {notebook.get('metadata', {}).get('kernelspec', {}).get('display_name', 'Unknown')}\n" | ||
context += f"Language: {notebook.get('metadata', {}).get('kernelspec', {}).get('language', 'Unknown')}\n\n" | ||
cells = notebook.get('cells', []) | ||
context += f"=== NOTEBOOK CONTENT ({len(cells)} cells) ===\n\n" | ||
|
||
for i, cell in enumerate(cells, 1): | ||
cell_type = cell.get('cell_type', 'unknown') | ||
context += f"--- Cell {i} ({cell_type.upper()}) ---\n" | ||
source = cell.get('source', []) | ||
if isinstance(source, list): | ||
source_text = ''.join(source) | ||
else: | ||
source_text = str(source) | ||
|
||
context += f"SOURCE:\n{source_text}\n" | ||
|
||
# Get cell outputs for code cells | ||
if cell_type == 'code': | ||
outputs = cell.get('outputs', []) | ||
if outputs: | ||
context += f"OUTPUTS:\n" | ||
for j, output in enumerate(outputs): | ||
output_type = output.get('output_type', 'unknown') | ||
context += f" Output {j+1} ({output_type}):\n" | ||
if output_type == 'stream': | ||
text = ''.join(output.get('text', [])) | ||
context += f" {text}\n" | ||
elif output_type == 'execute_result' or output_type == 'display_data': | ||
data = output.get('data', {}) | ||
for mime_type, content in data.items(): | ||
if mime_type == 'text/plain': | ||
if isinstance(content, list): | ||
content = ''.join(content) | ||
context += f" {content}\n" | ||
elif mime_type == 'text/html': | ||
context += f" [HTML OUTPUT]\n" | ||
elif 'image' in mime_type: | ||
context += f" [IMAGE: {mime_type}]\n" | ||
elif output_type == 'error': | ||
ename = output.get('ename', 'Error') | ||
evalue = output.get('evalue', '') | ||
context += f" ERROR: {ename}: {evalue}\n" | ||
|
||
context += "\n" | ||
|
||
# Extract imports and library usage | ||
imports = self._extract_imports(notebook) | ||
if imports: | ||
context += f"=== DETECTED LIBRARIES ===\n" | ||
for imp in imports: | ||
context += f"- {imp}\n" | ||
context += "\n" | ||
|
||
# Extract data science context | ||
ds_context = self._extract_data_science_context(notebook) | ||
if ds_context: | ||
context += f"=== DATA SCIENCE CONTEXT ===\n{ds_context}\n" | ||
|
||
return context | ||
|
||
except json.JSONDecodeError: | ||
return f"Error: Invalid JSON in notebook file {notebook_path}" | ||
except Exception as e: | ||
return f"Error reading notebook {notebook_path}: {str(e)}" | ||
|
||
def _extract_imports(self, notebook: Dict[str, Any]) -> List[str]: | ||
"""Extract import statements from notebook cells.""" | ||
imports = [] | ||
cells = notebook.get('cells', []) | ||
|
||
for cell in cells: | ||
if cell.get('cell_type') == 'code': | ||
source = cell.get('source', []) | ||
if isinstance(source, list): | ||
source_text = ''.join(source) | ||
else: | ||
source_text = str(source) | ||
|
||
lines = source_text.split('\n') | ||
for line in lines: | ||
line = line.strip() | ||
if line.startswith('import ') or line.startswith('from '): | ||
imports.append(line) | ||
|
||
return list(set(imports)) | ||
|
||
def _extract_data_science_context(self, notebook: Dict[str, Any]) -> str: | ||
"""Extract data science context from notebook content.""" | ||
context_items = [] | ||
cells = notebook.get('cells', []) | ||
|
||
ds_patterns = { | ||
'pandas': ['pd.read_', 'DataFrame', '.head()', '.describe()', '.info()'], | ||
'numpy': ['np.array', 'np.mean', 'np.std', 'numpy'], | ||
'matplotlib': ['plt.', 'matplotlib', '.plot()', '.show()'], | ||
'seaborn': ['sns.', 'seaborn'], | ||
'sklearn': ['sklearn', 'fit()', 'predict()', 'score()'], | ||
'analysis': ['correlation', 'regression', 'classification', 'clustering'], | ||
'data_ops': ['merge', 'join', 'groupby', 'pivot', 'melt'] | ||
} | ||
|
||
detected = {category: [] for category in ds_patterns.keys()} | ||
|
||
for cell in cells: | ||
if cell.get('cell_type') == 'code': | ||
source = cell.get('source', []) | ||
if isinstance(source, list): | ||
source_text = ''.join(source) | ||
else: | ||
source_text = str(source) | ||
|
||
for category, patterns in ds_patterns.items(): | ||
for pattern in patterns: | ||
if pattern.lower() in source_text.lower(): | ||
detected[category].append(pattern) | ||
|
||
active_categories = {k: list(set(v)) for k, v in detected.items() if v} | ||
|
||
if active_categories: | ||
context_items.append("Analysis stage indicators:") | ||
for category, patterns in active_categories.items(): | ||
context_items.append(f" {category}: {', '.join(patterns[:3])}") | ||
|
||
return '\n'.join(context_items) if context_items else "" |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.py
file instead of a notebook.ipynb
file, it still processed the context retrieval? Not sure why.