jupyter-ai-contrib · jonahjung22 · Jul 10, 2025 · Jul 15, 2025 · Jul 15, 2025 · Jul 15, 2025
diff --git a/jupyter-ai-personas b/jupyter-ai-personas
diff --git a/jupyter_ai_personas/context_retrieval_persona/README.md b/jupyter_ai_personas/context_retrieval_persona/README.md
@@ -0,0 +1,214 @@
+# Context Retrieval Persona
+
+## Overview
+
+The Context Retrieval Persona analyzes your data science notebooks and finds relevant resources from the Python Data Science Handbook using RAG (Retrieval-Augmented Generation). It employs a three-agent system to provide comprehensive analysis and actionable recommendations.
+
+## Features
+
+- **Intelligent Notebook Analysis**: Extracts libraries, analysis stage, domain, and objectives from your notebooks
+- **Full Notebook RAG Search**: Returns complete relevant notebooks instead of fragments for comprehensive context
+- **Handbook-Only Search**: Avoids redundant searching by focusing on external handbook content only
+- **Multi-Agent Coordination**: NotebookAnalyzer, KnowledgeSearcher, and MarkdownGenerator working together
+- **Comprehensive Markdown Reports**: Detailed reports with code examples, explanations, and next steps
+- **Optimized Search**: 1-2 complete notebooks per query with clean terminal logging
+- **Automatic Report Generation**: Creates `repo_context.md` with comprehensive analysis
+
+## Architecture
+
+### Three-Agent System
+
+1. **NotebookAnalyzer**: Extracts structured context from your notebook
+
+   - Uses `extract_rag_context` tool to read notebook content
+   - Identifies libraries (pandas, numpy, sklearn, matplotlib, etc.)
+   - Determines analysis stage (data_loading, eda, preprocessing, modeling, evaluation, visualization)
+   - Outputs structured JSON with path, libraries, stage, domain, and objectives
+
+2. **KnowledgeSearcher**: Performs targeted handbook-only RAG searches
+
+   - Generates 4-5 targeted search queries based on notebook analysis
+   - Uses `search_handbook_only` to find relevant complete notebooks
+   - Each search returns 1-2 most relevant notebooks (not fragments)
+   - Provides comprehensive handbook content to MarkdownGenerator
+
+3. **MarkdownGenerator**: Creates detailed markdown reports
+   - Synthesizes notebook analysis with RAG search results
+   - Includes substantial content from retrieved handbooks
+   - Creates cross-references between user's work and handbook examples
+   - Saves comprehensive reports as `repo_context.md`
+
+## Core Components
+
+### Context Retrieval Persona (`persona.py`)
+
+- Main persona class orchestrating the three-agent system
+- Handles Jupyter AI integration and message processing
+- Initializes AWS Bedrock models and agent coordination
+- Manages greeting detection and team workflow
+
+### RAG Tool (`rag_tool.py`)
+
+Core RAG system with two main classes:
+
+- **RAG**: Loads handbook content into ChromaDB vectorstore using HuggingFace embeddings
+- **RAGTool**: Agno toolkit providing `search_handbook_only()` function
+- Returns complete notebooks (1-2 per search) instead of fragments
+- Clean terminal logging showing retrieved notebook titles and stats
+
+### Notebook Reader Tool (`file_reader_tool.py`)
+
+- `NotebookReaderTool`: Provides `extract_rag_context` function
+- Reads complete notebook content and metadata
+- Extracts context for the NotebookAnalyzer agent
+
+## Installation & Setup
+
+### Prerequisites
+
+Install the context retrieval persona with its dependencies:
+
+```bash
+pip install -e ".[context_retriever]"
+```
+
+This installs:
+
+- `agno` - Multi-agent framework
+- `boto3` - AWS Bedrock integration
+- `langchain` & `langchain-core` & `langchain-community` - RAG framework
+- `sentence-transformers` - Embedding models
+- `chromadb` - Vector database
+- `nbformat` - Jupyter notebook reading
+
+### Setup Python Data Science Handbook
+
+```bash
+# Clone the handbook repository
+cd jupyter_ai_personas/context_retrieval_persona/
+git clone https://github.com/jakevdp/PythonDataScienceHandbook.git
+```
+
+### AWS Configuration
+
+Configure AWS credentials for Bedrock access:
+
+```bash
+aws configure
+# or set environment variables:
+export AWS_ACCESS_KEY_ID=your_key
+export AWS_SECRET_ACCESS_KEY=your_secret
+export AWS_DEFAULT_REGION=us-east-1
+```
+
+## Usage
+
+### Basic Usage
+
+In Jupyter AI chat, use the @ mention to activate the persona:
+
+```
+@ContextRetrievalPersona notebook: /path/to/your/notebook.ipynb
+Analyze my machine learning workflow and find relevant handbook resources
+```
+
+### Workflow Example
+
+1. **User Request**: Provides notebook path and description
+2. **NotebookAnalyzer**: Reads and analyzes notebook content
+3. **KnowledgeSearcher**: Performs 4-5 targeted searches in handbook
+4. **MarkdownGenerator**: Creates comprehensive `repo_context.md` report
+
+### Terminal Output
+
+During processing, you'll see clean RAG search logs:
+
+```
+🔍 RAG SEARCH: 'sklearn RandomForest classification'
+📚 Found 2 relevant notebooks:
+  1. 05.08-Random-Forests.ipynb (15 cells, 12450 chars)
+  2. 05.03-Hyperparameters-and-Model-Validation.ipynb (22 cells, 18920 chars)
+```
+
+### Generated Report Structure
+
+The `repo_context.md` file includes:
+
+- **Executive Summary**: Overview of findings and connections
+- **Current Notebook Analysis**: Libraries, stage, domain, objectives from your notebook
+- **Comprehensive Handbook Resources**: Full code examples and explanations from retrieved notebooks
+- **Detailed Code Examples**: Complete implementations from handbook
+- **Cross-References and Learning Paths**: Connections between your work and handbook content
+- **Actionable Implementation Steps**: Specific next steps based on analysis
+
+## Technical Details
+
+### RAG Implementation
+
+- **Embedding Model**: `sentence-transformers/all-MiniLM-L6-v2`
+- **Vector Store**: ChromaDB with persistent storage
+- **Search Strategy**: Similarity search returning complete notebooks (not fragments)
+- **Results per Search**: 2 most relevant complete notebooks
+- **Cell-Based Chunking**: Uses notebook cells as natural document boundaries
+
+### Optimizations
+
+- **Handbook-Only Search**: Avoids redundant notebook content in RAG results
+- **Complete Notebook Retrieval**: Returns full notebooks instead of fragments for better context
+- **One-Time Loading**: Vector store loaded once per session with handbook_loaded flag
+- **Clean Logging**: Minimal terminal output showing only essential search information
+- **JSON Validation Fix**: Uses `capture_validation_error=None` to suppress nbformat warnings
+
+## File Structure
+
+```
+context_retrieval_persona/
+├── README.md                      # This documentation
+├── persona.py                     # Main persona class with three-agent system
+├── rag_tool.py                   # RAG and RAGTool classes for handbook search
+├── file_reader_tool.py            # NotebookReaderTool for content extraction
+├── __init__.py                    # Package initialization
+├── repo_context.md               # Generated markdown reports
+├── PythonDataScienceHandbook/     # Cloned handbook repository
+│   └── notebooks/                 # 100+ handbook notebooks
+└── vector_stores/                 # ChromaDB vector storage
+    └── rag/                       # Renamed from simple_rag
+        ├── chroma.sqlite3
+        └── [vector files]
+```
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Missing Dependencies**: Install all required packages
+
+   ```bash
+   pip install -e ".[context_retriever]"
+   ```
+
+2. **Handbook Not Found**: Clone the handbook repository
+
+   ```bash
+   cd jupyter_ai_personas/context_retrieval_persona/
+   git clone https://github.com/jakevdp/PythonDataScienceHandbook.git
+   ```
+
+3. **AWS/Bedrock Issues**: Configure AWS credentials
+
+   ```bash
+   aws configure
+   ```
+
+4. **JSON Validation Warnings**: These are now suppressed with `capture_validation_error=None`
+
+5. **Vector Store Loading**: First run builds the vector store (5-10 minutes), subsequent runs are fast
+
+## Contributing
+
+To extend the system:
+
+1. **Enhance RAG Search**: Modify `RAGTool` class in `rag_tool.py`
+2. **Improve Context Extraction**: Update `NotebookReaderTool` in `file_reader_tool.py`
+3. **Refine Agent Instructions**: Update agent prompts in `persona.py`
+4. **Add New Analysis Capabilities**: Extend the three-agent system workflow
diff --git a/jupyter_ai_personas/context_retrieval_persona/__init__.py b/jupyter_ai_personas/context_retrieval_persona/__init__.py
@@ -0,0 +1 @@
+"""Data Science Persona package for Jupyter AI."""
diff --git a/jupyter_ai_personas/context_retrieval_persona/file_reader_tool.py b/jupyter_ai_personas/context_retrieval_persona/file_reader_tool.py
@@ -0,0 +1,160 @@
+import json
+import os
+from typing import Dict, Any, List, Optional
+from agno.tools import Toolkit
+
+class NotebookReaderTool(Toolkit):
+    """Tool for reading and extracting complete content from Jupyter notebooks."""
+
+    def __init__(self):
+        super().__init__(name="notebook_reader")
+        self.register(self.extract_rag_context)
+
+    def extract_rag_context(self, notebook_path: str) -> str:
+        """
+        Extract complete content from a Jupyter notebook for RAG context.
+
+        Args:
+            notebook_path: Path to the .ipynb notebook file
+
+        Returns:
+            str: Formatted string containing all notebook content including cells,
+                 outputs, markdown, and metadata
+        """
+        try:
+            if not os.path.exists(notebook_path):
+                return f"Error: Notebook file not found at {notebook_path}"
+
+            if not notebook_path.endswith('.ipynb'):
+                return f"Error: File must be a .ipynb notebook file, got {notebook_path}"
+
+            with open(notebook_path, 'r', encoding='utf-8') as f:
+                notebook = json.load(f)
+
+            # Extract notebook metadata and cells
+            context = f"=== NOTEBOOK ANALYSIS ===\n"
+            context += f"File: {notebook_path}\n"
+            context += f"Kernel: {notebook.get('metadata', {}).get('kernelspec', {}).get('display_name', 'Unknown')}\n"
+            context += f"Language: {notebook.get('metadata', {}).get('kernelspec', {}).get('language', 'Unknown')}\n\n"
+            cells = notebook.get('cells', [])
+            context += f"=== NOTEBOOK CONTENT ({len(cells)} cells) ===\n\n"
+
+            for i, cell in enumerate(cells, 1):
+                cell_type = cell.get('cell_type', 'unknown')
+                context += f"--- Cell {i} ({cell_type.upper()}) ---\n"
+                source = cell.get('source', [])
+                if isinstance(source, list):
+                    source_text = ''.join(source)
+                else:
+                    source_text = str(source)
+
+                context += f"SOURCE:\n{source_text}\n"
+
+                # Get cell outputs for code cells
+                if cell_type == 'code':
+                    outputs = cell.get('outputs', [])
+                    if outputs:
+                        context += f"OUTPUTS:\n"
+                        for j, output in enumerate(outputs):
+                            output_type = output.get('output_type', 'unknown')
+                            context += f"  Output {j+1} ({output_type}):\n"
+                            if output_type == 'stream':
+                                text = ''.join(output.get('text', []))
+                                context += f"    {text}\n"
+                            elif output_type == 'execute_result' or output_type == 'display_data':
+                                data = output.get('data', {})
+                                for mime_type, content in data.items():
+                                    if mime_type == 'text/plain':
+                                        if isinstance(content, list):
+                                            content = ''.join(content)
+                                        context += f"    {content}\n"
+                                    elif mime_type == 'text/html':
+                                        context += f"    [HTML OUTPUT]\n"
+                                    elif 'image' in mime_type:
+                                        context += f"    [IMAGE: {mime_type}]\n"
+                            elif output_type == 'error':
+                                ename = output.get('ename', 'Error')
+                                evalue = output.get('evalue', '')
+                                context += f"    ERROR: {ename}: {evalue}\n"
+
+                context += "\n"
+
+            # Extract imports and library usage
+            imports = self._extract_imports(notebook)
+            if imports:
+                context += f"=== DETECTED LIBRARIES ===\n"
+                for imp in imports:
+                    context += f"- {imp}\n"
+                context += "\n"
+
+            # Extract data science context
+            ds_context = self._extract_data_science_context(notebook)
+            if ds_context:
+                context += f"=== DATA SCIENCE CONTEXT ===\n{ds_context}\n"
+
+            return context
+
+        except json.JSONDecodeError:
+            return f"Error: Invalid JSON in notebook file {notebook_path}"
+        except Exception as e:
+            return f"Error reading notebook {notebook_path}: {str(e)}"
+
+    def _extract_imports(self, notebook: Dict[str, Any]) -> List[str]:
+        """Extract import statements from notebook cells."""
+        imports = []
+        cells = notebook.get('cells', [])
+
+        for cell in cells:
+            if cell.get('cell_type') == 'code':
+                source = cell.get('source', [])
+                if isinstance(source, list):
+                    source_text = ''.join(source)
+                else:
+                    source_text = str(source)
+
+                lines = source_text.split('\n')
+                for line in lines:
+                    line = line.strip()
+                    if line.startswith('import ') or line.startswith('from '):
+                        imports.append(line)
+
+        return list(set(imports))
+
+    def _extract_data_science_context(self, notebook: Dict[str, Any]) -> str:
+        """Extract data science context from notebook content."""
+        context_items = []
+        cells = notebook.get('cells', [])
+
+        ds_patterns = {
+            'pandas': ['pd.read_', 'DataFrame', '.head()', '.describe()', '.info()'],
+            'numpy': ['np.array', 'np.mean', 'np.std', 'numpy'],
+            'matplotlib': ['plt.', 'matplotlib', '.plot()', '.show()'],
+            'seaborn': ['sns.', 'seaborn'],
+            'sklearn': ['sklearn', 'fit()', 'predict()', 'score()'],
+            'analysis': ['correlation', 'regression', 'classification', 'clustering'],
+            'data_ops': ['merge', 'join', 'groupby', 'pivot', 'melt']
+        }
+
+        detected = {category: [] for category in ds_patterns.keys()}
+
+        for cell in cells:
+            if cell.get('cell_type') == 'code':
+                source = cell.get('source', [])
+                if isinstance(source, list):
+                    source_text = ''.join(source)
+                else:
+                    source_text = str(source)
+
+                for category, patterns in ds_patterns.items():
+                    for pattern in patterns:
+                        if pattern.lower() in source_text.lower():
+                            detected[category].append(pattern)
+
+        active_categories = {k: list(set(v)) for k, v in detected.items() if v}
+
+        if active_categories:
+            context_items.append("Analysis stage indicators:")
+            for category, patterns in active_categories.items():
+                context_items.append(f"  {category}: {', '.join(patterns[:3])}")
+
+        return '\n'.join(context_items) if context_items else ""
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		"""Data Science Persona package for Jupyter AI."""