Skip to content
Open
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
57ed47a
New persona integrating jupyter ai tools
jonahjung22 Jul 10, 2025
5caff8f
Context Retrieval Persona
jonahjung22 Jul 15, 2025
461f281
test context file
jonahjung22 Jul 15, 2025
74ed255
updated toml and wrapper tool
jonahjung22 Jul 15, 2025
c2d52ed
Increased RAG chunks, specific md file naming, logging
jonahjung22 Jul 16, 2025
c89afb3
Removed unnecessary files
jonahjung22 Jul 16, 2025
cc2b99d
updated the names of the files; updated README with persona new capab…
jonahjung22 Jul 16, 2025
2a24e18
modified toml
jonahjung22 Jul 17, 2025
20262e4
building out the context persona using pocketflow
jonahjung22 Jul 19, 2025
772d548
new method for rag based approach using pocketflow
jonahjung22 Jul 21, 2025
55157b7
Merge branch 'main' into data_science_persona
jonahjung22 Jul 23, 2025
95b68f8
added test notebook
jonahjung22 Jul 24, 2025
d910e61
added greetings
jonahjung22 Jul 24, 2025
82834fc
Separating 1 persona for each PR
jonahjung22 Jul 24, 2025
e5e188b
cleaned up some code
jonahjung22 Jul 24, 2025
ad6a220
updated README
jonahjung22 Jul 24, 2025
a882287
added test files
jonahjung22 Jul 29, 2025
1ebe1f2
removing some lines
jonahjung22 Jul 29, 2025
7488430
updated persona code and removed unnecessary components
jonahjung22 Aug 5, 2025
af883f9
remove unnecessary comments
jonahjung22 Aug 7, 2025
cb68c1f
updated dependencies
jonahjung22 Aug 8, 2025
46d26cd
deleted unnecessary folder
jonahjung22 Aug 8, 2025
dd81447
Changes to the whole RAG structure implemented
jonahjung22 Aug 11, 2025
b7f6eac
removed unnecessary file
jonahjung22 Aug 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion jupyter-ai-personas
Submodule jupyter-ai-personas deleted from 4af5de
239 changes: 239 additions & 0 deletions jupyter_ai_personas/context_retrieval_persona/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
# Context Retrieval Persona

## Overview

The Context Retriever Persona is a multi-agent system that understands your current data science work and finds relevant resources from the comprehensive Python Data Science Handbook using semantic search. It consists of three specialized agents working together to provide actionable insights.

## Features

- **Notebook Analysis**: Automatically extracts context from your Jupyter notebooks including libraries, analysis stage, and objectives
- **RAG-Powered Search**: Semantic search through the entire Python Data Science Handbook repository
- **Context-Aware Recommendations**: Provides relevant code examples, best practices, and documentation based on your current work
- **Multi-Agent Architecture**: Three specialized agents for analysis, search, and report generation
- **Comprehensive Reports**: Generates detailed markdown reports with actionable next steps
- **Optimized Performance**: Improved caching and simplified logging for faster execution
- **Automatic Report Saving**: Generated reports are automatically saved as `repo_context.md`
- **Improved RAG Parameters**: Increased chunk size (1500 chars) and search results (8 chunks) for better coverage

## Architecture

### Three-Agent System

1. **NotebookAnalyzer**: Extracts context from your notebook content
- Identifies libraries being used (pandas, numpy, scikit-learn, etc.)
- Determines analysis stage (data loading, EDA, preprocessing, modeling, etc.)
- Extracts objectives and current progress

2. **KnowledgeSearcher**: Performs targeted RAG searches
- Multiple search strategies based on context
- Semantic search through 100+ handbook notebooks
- Filters for relevant code examples and explanations

3. **MarkdownGenerator**: Creates comprehensive reports
- Executive summaries of findings
- Relevant code examples with explanations
- Actionable next steps for your analysis

## Core Components

### Context Retriever Persona (`context_retriever_persona.py`)
Main persona class that orchestrates the three-agent system and handles Jupyter AI integration.

### RAG Core System (`rag_core.py`)
- Repository management for Python Data Science Handbook
- Document extraction from Jupyter notebooks
- Vector storage using ChromaDB
- Semantic search with HuggingFace embeddings

### RAG Integration Tool (`rag_integration_tool.py`)
Agno tool wrapper providing clean integration with the agent system:
- `search_repository()`: General semantic search
- `search_by_topic()`: Topic-specific searches
- `search_code_examples()`: Code-focused searches

### Notebook Reader Tool (`file_reader_tool.py`)
Comprehensive notebook content extraction:
- Reads all cell types (code, markdown)
- Extracts outputs and metadata
- Detects libraries and analysis patterns
- Provides structured context for search

## Installation & Setup

### Prerequisites
```bash
# Install required packages
pip install chromadb sentence-transformers langchain nbformat gitpython
```

### Quick Setup
```bash
# Run the setup script
python setup_rag_system.py
```

This will:
1. Check dependencies
2. Clone the Python Data Science Handbook repository
3. Build the vector store (first run takes 5-10 minutes)
4. Test the system functionality

### Manual Setup
```python
from rag_core import create_handbook_rag

# Initialize the RAG system
rag = create_handbook_rag(force_rebuild=False)

# Test search functionality
results = rag.search("pandas dataframe operations", k=5)
```

## Usage

### Basic Usage
In Jupyter AI, activate the Context Retriever Persona and provide:

```
I need help with data visualization using matplotlib and seaborn.
notebook: /path/to/my/analysis.ipynb
```

### Typical Workflow
1. **Context Analysis**: The system reads your notebook to understand:
- What libraries you're using
- What stage of analysis you're in
- What data you're working with

2. **Knowledge Search**: Performs multiple targeted searches:
- Library-specific examples
- Analysis stage best practices
- Problem domain patterns

3. **Report Generation**: Creates a comprehensive markdown report with:
- Executive summary of findings
- Current notebook analysis
- Relevant code examples
- Actionable next steps

### Example Output
```markdown
## Executive Summary
Based on your notebook analysis, you're in the exploratory data analysis stage
using pandas and matplotlib. Found relevant handbook content for data
visualization best practices and statistical analysis patterns.

## Current Notebook Analysis
- Libraries: pandas, matplotlib, seaborn
- Analysis Stage: exploratory_data_analysis
- Data Operations: groupby, pivot, plotting

## Relevant Resources
### Data Visualization with Matplotlib
[Code examples and explanations from the handbook]

### Statistical Analysis Patterns
[Relevant statistical methods and implementations]

## Actionable Next Steps
1. Implement correlation analysis using the patterns from Section 04.05
2. Consider using seaborn for advanced statistical plots
3. Apply dimensionality reduction techniques from Chapter 05
```

## Configuration

### Environment Variables
```bash
# Optional: Configure data paths
export RAG_REPO_PATH="/path/to/PythonDataScienceHandbook"
export RAG_VECTOR_STORE_PATH="/path/to/vector_stores"
```

### Customization
Modify parameters in `rag_core.py`:
```python
rag = PythonDSHandbookRAG(
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be updated to take in the chosen embedding model from Jupyter-AI. The embedding model would then need to be called using the functions in Jupyter AI.

chunk_size=1500, # Increased chunk size
chunk_overlap=300 # Increased overlap
)
```

### RAG Search Parameters
- **Default Results**: 8 chunks per search (increased from 5)
- **Chunk Size**: 1500 characters (increased from 1000)
- **Chunk Overlap**: 300 characters (increased from 200)
- **Efficient Logging**: Concise search result logging with essential debugging information

## File Structure

```
context_retrieval_persona/
├── README.md # This file
├── context_retrieval_persona.py # Main persona class
├── rag_core.py # Core RAG system
├── rag_integration_tool.py # Agno tool wrapper
├── file_reader_tool.py # Notebook content extraction
├── setup_rag_system.py # Setup script
├── __init__.py # Package initialization
├── test_context_retrieval.ipynb # Test notebook
├── repo_context.md # Generated markdown reports
├── PythonDataScienceHandbook/ # Cloned repository
│ └── notebooks/ # 100+ handbook notebooks
└── vector_stores/ # ChromaDB vector storage
└── python_ds_handbook/
├── chroma.sqlite3
└── metadata.json
```

## Performance Notes

- **First Run**: 5-10 minutes to build vector store
- **Subsequent Runs**: <3 seconds using cached vectors and optimized code
- **Memory Usage**: ~500MB for full vector store
- **Search Speed**: <1 second for semantic queries
- **Recent Optimizations**: Simplified logging, improved caching, and reduced code complexity

## Troubleshooting

### Common Issues

1. **Import Errors**: Ensure all dependencies are installed
```bash
pip install chromadb sentence-transformers langchain
```

2. **Vector Store Issues**: Force rebuild if corrupted
```python
rag = create_handbook_rag(force_rebuild=True)
```

3. **Repository Problems**: Check git connectivity
```bash
git clone https://github.com/jakevdp/PythonDataScienceHandbook.git
```

### Debug Information
```python
# Check system status with setup script
python setup_rag_system.py

# Or manually check RAG system
from rag_integration_tool import create_simple_rag_tools
rag_tool = create_simple_rag_tools()
status = rag_tool.get_system_status()
print(status) # Detailed system diagnostics
```

## Contributing

To extend the system:

1. **Add New Search Methods**: Extend `RAGSearchTool` in `rag_integration_tool.py`
2. **Enhance Context Extraction**: Modify `NotebookReaderTool` in `file_reader_tool.py`
3. **Improve Agent Instructions**: Update agent prompts in `context_retriever_persona.py`

## License

This project uses the Python Data Science Handbook, which is available under the MIT License. See the handbook repository for full license details.
1 change: 1 addition & 0 deletions jupyter_ai_personas/context_retrieval_persona/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Data Science Persona package for Jupyter AI."""
Loading
Loading