Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
57ed47a
New persona integrating jupyter ai tools
jonahjung22 Jul 10, 2025
5caff8f
Context Retrieval Persona
jonahjung22 Jul 15, 2025
461f281
test context file
jonahjung22 Jul 15, 2025
74ed255
updated toml and wrapper tool
jonahjung22 Jul 15, 2025
c2d52ed
Increased RAG chunks, specific md file naming, logging
jonahjung22 Jul 16, 2025
c89afb3
Removed unnecessary files
jonahjung22 Jul 16, 2025
cc2b99d
updated the names of the files; updated README with persona new capab…
jonahjung22 Jul 16, 2025
2a24e18
modified toml
jonahjung22 Jul 17, 2025
20262e4
building out the context persona using pocketflow
jonahjung22 Jul 19, 2025
772d548
new method for rag based approach using pocketflow
jonahjung22 Jul 21, 2025
55157b7
Merge branch 'main' into data_science_persona
jonahjung22 Jul 23, 2025
95b68f8
added test notebook
jonahjung22 Jul 24, 2025
d910e61
added greetings
jonahjung22 Jul 24, 2025
82834fc
Separating 1 persona for each PR
jonahjung22 Jul 24, 2025
e5e188b
cleaned up some code
jonahjung22 Jul 24, 2025
ad6a220
updated README
jonahjung22 Jul 24, 2025
a882287
added test files
jonahjung22 Jul 29, 2025
1ebe1f2
removing some lines
jonahjung22 Jul 29, 2025
7488430
updated persona code and removed unnecessary components
jonahjung22 Aug 5, 2025
af883f9
remove unnecessary comments
jonahjung22 Aug 7, 2025
cb68c1f
updated dependencies
jonahjung22 Aug 8, 2025
46d26cd
deleted unnecessary folder
jonahjung22 Aug 8, 2025
dd81447
Changes to the whole RAG structure implemented
jonahjung22 Aug 11, 2025
b7f6eac
removed unnecessary file
jonahjung22 Aug 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion jupyter-ai-personas
Submodule jupyter-ai-personas deleted from 4af5de
214 changes: 214 additions & 0 deletions jupyter_ai_personas/context_retrieval_persona/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
# Context Retrieval Persona

## Overview

The Context Retrieval Persona analyzes your data science notebooks and finds relevant resources from the Python Data Science Handbook using RAG (Retrieval-Augmented Generation). It employs a three-agent system to provide comprehensive analysis and actionable recommendations.

## Features

- **Intelligent Notebook Analysis**: Extracts libraries, analysis stage, domain, and objectives from your notebooks
- **Full Notebook RAG Search**: Returns complete relevant notebooks instead of fragments for comprehensive context
- **Handbook-Only Search**: Avoids redundant searching by focusing on external handbook content only
- **Multi-Agent Coordination**: NotebookAnalyzer, KnowledgeSearcher, and MarkdownGenerator working together
- **Comprehensive Markdown Reports**: Detailed reports with code examples, explanations, and next steps
- **Optimized Search**: 1-2 complete notebooks per query with clean terminal logging
- **Automatic Report Generation**: Creates `repo_context.md` with comprehensive analysis

## Architecture

### Three-Agent System

1. **NotebookAnalyzer**: Extracts structured context from your notebook

- Uses `extract_rag_context` tool to read notebook content
- Identifies libraries (pandas, numpy, sklearn, matplotlib, etc.)
- Determines analysis stage (data_loading, eda, preprocessing, modeling, evaluation, visualization)
- Outputs structured JSON with path, libraries, stage, domain, and objectives

2. **KnowledgeSearcher**: Performs targeted handbook-only RAG searches

- Generates 4-5 targeted search queries based on notebook analysis
- Uses `search_handbook_only` to find relevant complete notebooks
- Each search returns 1-2 most relevant notebooks (not fragments)
- Provides comprehensive handbook content to MarkdownGenerator

3. **MarkdownGenerator**: Creates detailed markdown reports
- Synthesizes notebook analysis with RAG search results
- Includes substantial content from retrieved handbooks
- Creates cross-references between user's work and handbook examples
- Saves comprehensive reports as `repo_context.md`

## Core Components

### Context Retrieval Persona (`persona.py`)

- Main persona class orchestrating the three-agent system
- Handles Jupyter AI integration and message processing
- Initializes AWS Bedrock models and agent coordination
- Manages greeting detection and team workflow

### RAG Tool (`rag_tool.py`)

Core RAG system with two main classes:

- **RAG**: Loads handbook content into ChromaDB vectorstore using HuggingFace embeddings
- **RAGTool**: Agno toolkit providing `search_handbook_only()` function
- Returns complete notebooks (1-2 per search) instead of fragments
- Clean terminal logging showing retrieved notebook titles and stats

### Notebook Reader Tool (`file_reader_tool.py`)

- `NotebookReaderTool`: Provides `extract_rag_context` function
- Reads complete notebook content and metadata
- Extracts context for the NotebookAnalyzer agent

## Installation & Setup

### Prerequisites

Install the context retrieval persona with its dependencies:

```bash
pip install -e ".[context_retriever]"
```

This installs:

- `agno` - Multi-agent framework
- `boto3` - AWS Bedrock integration
- `langchain` & `langchain-core` & `langchain-community` - RAG framework
- `sentence-transformers` - Embedding models
- `chromadb` - Vector database
- `nbformat` - Jupyter notebook reading

### Setup Python Data Science Handbook

```bash
# Clone the handbook repository
cd jupyter_ai_personas/context_retrieval_persona/
git clone https://github.com/jakevdp/PythonDataScienceHandbook.git
```

### AWS Configuration

Configure AWS credentials for Bedrock access:

```bash
aws configure
# or set environment variables:
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1
```

## Usage

### Basic Usage

In Jupyter AI chat, use the @ mention to activate the persona:

```
@ContextRetrievalPersona notebook: /path/to/your/notebook.ipynb
Analyze my machine learning workflow and find relevant handbook resources
```

### Workflow Example

1. **User Request**: Provides notebook path and description
2. **NotebookAnalyzer**: Reads and analyzes notebook content
3. **KnowledgeSearcher**: Performs 4-5 targeted searches in handbook
4. **MarkdownGenerator**: Creates comprehensive `repo_context.md` report

### Terminal Output

During processing, you'll see clean RAG search logs:

```
🔍 RAG SEARCH: 'sklearn RandomForest classification'
📚 Found 2 relevant notebooks:
1. 05.08-Random-Forests.ipynb (15 cells, 12450 chars)
2. 05.03-Hyperparameters-and-Model-Validation.ipynb (22 cells, 18920 chars)
```

### Generated Report Structure

The `repo_context.md` file includes:

- **Executive Summary**: Overview of findings and connections
- **Current Notebook Analysis**: Libraries, stage, domain, objectives from your notebook
- **Comprehensive Handbook Resources**: Full code examples and explanations from retrieved notebooks
- **Detailed Code Examples**: Complete implementations from handbook
- **Cross-References and Learning Paths**: Connections between your work and handbook content
- **Actionable Implementation Steps**: Specific next steps based on analysis

## Technical Details

### RAG Implementation

- **Embedding Model**: `sentence-transformers/all-MiniLM-L6-v2`
- **Vector Store**: ChromaDB with persistent storage
- **Search Strategy**: Similarity search returning complete notebooks (not fragments)
- **Results per Search**: 2 most relevant complete notebooks
- **Cell-Based Chunking**: Uses notebook cells as natural document boundaries

### Optimizations

- **Handbook-Only Search**: Avoids redundant notebook content in RAG results
- **Complete Notebook Retrieval**: Returns full notebooks instead of fragments for better context
- **One-Time Loading**: Vector store loaded once per session with handbook_loaded flag
- **Clean Logging**: Minimal terminal output showing only essential search information
- **JSON Validation Fix**: Uses `capture_validation_error=None` to suppress nbformat warnings

## File Structure

```
context_retrieval_persona/
├── README.md # This documentation
├── persona.py # Main persona class with three-agent system
├── rag_tool.py # RAG and RAGTool classes for handbook search
├── file_reader_tool.py # NotebookReaderTool for content extraction
├── __init__.py # Package initialization
├── repo_context.md # Generated markdown reports
├── PythonDataScienceHandbook/ # Cloned handbook repository
│ └── notebooks/ # 100+ handbook notebooks
└── vector_stores/ # ChromaDB vector storage
└── rag/ # Renamed from simple_rag
├── chroma.sqlite3
└── [vector files]
```

## Troubleshooting

### Common Issues

1. **Missing Dependencies**: Install all required packages

```bash
pip install -e ".[context_retriever]"
```

2. **Handbook Not Found**: Clone the handbook repository

```bash
cd jupyter_ai_personas/context_retrieval_persona/
git clone https://github.com/jakevdp/PythonDataScienceHandbook.git
```

3. **AWS/Bedrock Issues**: Configure AWS credentials

```bash
aws configure
```

4. **JSON Validation Warnings**: These are now suppressed with `capture_validation_error=None`

5. **Vector Store Loading**: First run builds the vector store (5-10 minutes), subsequent runs are fast

## Contributing

To extend the system:

1. **Enhance RAG Search**: Modify `RAGTool` class in `rag_tool.py`
2. **Improve Context Extraction**: Update `NotebookReaderTool` in `file_reader_tool.py`
3. **Refine Agent Instructions**: Update agent prompts in `persona.py`
4. **Add New Analysis Capabilities**: Extend the three-agent system workflow
1 change: 1 addition & 0 deletions jupyter_ai_personas/context_retrieval_persona/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Data Science Persona package for Jupyter AI."""
160 changes: 160 additions & 0 deletions jupyter_ai_personas/context_retrieval_persona/file_reader_tool.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
import json
import os
from typing import Dict, Any, List, Optional
from agno.tools import Toolkit

class NotebookReaderTool(Toolkit):
"""Tool for reading and extracting complete content from Jupyter notebooks."""

def __init__(self):
super().__init__(name="notebook_reader")
self.register(self.extract_rag_context)

def extract_rag_context(self, notebook_path: str) -> str:
"""
Extract complete content from a Jupyter notebook for RAG context.

Args:
notebook_path: Path to the .ipynb notebook file

Returns:
str: Formatted string containing all notebook content including cells,
outputs, markdown, and metadata
"""
try:
if not os.path.exists(notebook_path):
return f"Error: Notebook file not found at {notebook_path}"

if not notebook_path.endswith('.ipynb'):
return f"Error: File must be a .ipynb notebook file, got {notebook_path}"

Comment on lines +25 to +30
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This should be sent to the chat panel not just printed in the logs.
  2. When I tried this with a .py file instead of a notebook .ipynb file, it still processed the context retrieval? Not sure why.
  3. When I gave it a non-existent file, it still processed the RAG, pulling up various pandas notebook from the PDSH.

with open(notebook_path, 'r', encoding='utf-8') as f:
notebook = json.load(f)

# Extract notebook metadata and cells
context = f"=== NOTEBOOK ANALYSIS ===\n"
context += f"File: {notebook_path}\n"
context += f"Kernel: {notebook.get('metadata', {}).get('kernelspec', {}).get('display_name', 'Unknown')}\n"
context += f"Language: {notebook.get('metadata', {}).get('kernelspec', {}).get('language', 'Unknown')}\n\n"
cells = notebook.get('cells', [])
context += f"=== NOTEBOOK CONTENT ({len(cells)} cells) ===\n\n"

for i, cell in enumerate(cells, 1):
cell_type = cell.get('cell_type', 'unknown')
context += f"--- Cell {i} ({cell_type.upper()}) ---\n"
source = cell.get('source', [])
if isinstance(source, list):
source_text = ''.join(source)
else:
source_text = str(source)

context += f"SOURCE:\n{source_text}\n"

# Get cell outputs for code cells
if cell_type == 'code':
outputs = cell.get('outputs', [])
if outputs:
context += f"OUTPUTS:\n"
for j, output in enumerate(outputs):
output_type = output.get('output_type', 'unknown')
context += f" Output {j+1} ({output_type}):\n"
if output_type == 'stream':
text = ''.join(output.get('text', []))
context += f" {text}\n"
elif output_type == 'execute_result' or output_type == 'display_data':
data = output.get('data', {})
for mime_type, content in data.items():
if mime_type == 'text/plain':
if isinstance(content, list):
content = ''.join(content)
context += f" {content}\n"
elif mime_type == 'text/html':
context += f" [HTML OUTPUT]\n"
elif 'image' in mime_type:
context += f" [IMAGE: {mime_type}]\n"
elif output_type == 'error':
ename = output.get('ename', 'Error')
evalue = output.get('evalue', '')
context += f" ERROR: {ename}: {evalue}\n"

context += "\n"

# Extract imports and library usage
imports = self._extract_imports(notebook)
if imports:
context += f"=== DETECTED LIBRARIES ===\n"
for imp in imports:
context += f"- {imp}\n"
context += "\n"

# Extract data science context
ds_context = self._extract_data_science_context(notebook)
if ds_context:
context += f"=== DATA SCIENCE CONTEXT ===\n{ds_context}\n"

return context

except json.JSONDecodeError:
return f"Error: Invalid JSON in notebook file {notebook_path}"
except Exception as e:
return f"Error reading notebook {notebook_path}: {str(e)}"

def _extract_imports(self, notebook: Dict[str, Any]) -> List[str]:
"""Extract import statements from notebook cells."""
imports = []
cells = notebook.get('cells', [])

for cell in cells:
if cell.get('cell_type') == 'code':
source = cell.get('source', [])
if isinstance(source, list):
source_text = ''.join(source)
else:
source_text = str(source)

lines = source_text.split('\n')
for line in lines:
line = line.strip()
if line.startswith('import ') or line.startswith('from '):
imports.append(line)

return list(set(imports))

def _extract_data_science_context(self, notebook: Dict[str, Any]) -> str:
"""Extract data science context from notebook content."""
context_items = []
cells = notebook.get('cells', [])

ds_patterns = {
'pandas': ['pd.read_', 'DataFrame', '.head()', '.describe()', '.info()'],
'numpy': ['np.array', 'np.mean', 'np.std', 'numpy'],
'matplotlib': ['plt.', 'matplotlib', '.plot()', '.show()'],
'seaborn': ['sns.', 'seaborn'],
'sklearn': ['sklearn', 'fit()', 'predict()', 'score()'],
'analysis': ['correlation', 'regression', 'classification', 'clustering'],
'data_ops': ['merge', 'join', 'groupby', 'pivot', 'melt']
}

detected = {category: [] for category in ds_patterns.keys()}

for cell in cells:
if cell.get('cell_type') == 'code':
source = cell.get('source', [])
if isinstance(source, list):
source_text = ''.join(source)
else:
source_text = str(source)

for category, patterns in ds_patterns.items():
for pattern in patterns:
if pattern.lower() in source_text.lower():
detected[category].append(pattern)

active_categories = {k: list(set(v)) for k, v in detected.items() if v}

if active_categories:
context_items.append("Analysis stage indicators:")
for category, patterns in active_categories.items():
context_items.append(f" {category}: {', '.join(patterns[:3])}")

return '\n'.join(context_items) if context_items else ""
Loading
Loading