Notebook Processing Tools

This directory contains tools for processing Jupyter notebooks and setting up data sources for hybrid RAG pipelines.

remove_images.py

A Python script that uses regular expressions to remove embedded base64-encoded images from Python files that were converted from Jupyter notebooks using jupytext.

Features

Removes base64 data URL images (e.g., ![Screenshot 1](data:image/png;base64,...))
Cleans up extra empty lines left behind after image removal
Can either overwrite the original file or create a new cleaned file
Provides detailed feedback on the number of images found and removed

Usage

# Remove images from a file (overwrites original)
python remove_images.py <input_file.py>

# Remove images and save to a new file
python remove_images.py <input_file.py> <output_file.py>

Examples

# Clean the converted notebook file in-place
python remove_images.py ../donor-notebooks/S3_to_Qdrant_Workflow_using_Unstructured_API.py

# Create a cleaned copy
python remove_images.py notebook.py cleaned_notebook.py

Requirements

Python 3.6+
No external dependencies (uses only standard library modules: re, sys, os, pathlib)

How it works

The script uses a regular expression pattern to identify and remove markdown-style image references with base64 data URLs:

image_pattern = r'!\[.*?\]\(data:image/[^;]+;base64,[A-Za-z0-9+/=]+\)'

This pattern matches:

![...] - Markdown image syntax
(data:image/...) - Data URL with image MIME type
;base64, - Base64 encoding indicator
[A-Za-z0-9+/=]+ - Base64 encoded data

Workflow Example

Convert Jupyter notebook to Python file using jupytext:
```
jupytext --to py notebook.ipynb
```
Remove embedded images from the converted file:
```
python remove_images.py notebook.py
```

The result is a clean Python file without embedded base64 images, making it more readable and reducing file size significantly.

elasticsearch_setup.py

A comprehensive Python script that creates and populates an Elasticsearch index with NER-rich synthetic sales data for Bose products. This data serves as one of the source connectors in a hybrid RAG pipeline.

Features

Elastic Cloud Integration: Connects directly to your Elasticsearch Cloud deployment
NER-Optimized Data: Generates synthetic sales records rich in named entities (people, organizations, locations, prices, dates)
Semantic Text Support: Uses semantic_text field type for enhanced search capabilities
Bose Product Focus: Covers SoundSport, OpenAudio, and QuietComfort product lines
Realistic Sales Scenarios: Creates contextual sales interactions with detailed customer information

Setup

Install Dependencies:
```
pip install -r requirements.txt
```

Configure Environment:

# Copy the template and add your credentials
cp env_template.txt .env
# Edit .env and add your ELASTIC_API_KEY

Run the Setup:
```
python elasticsearch_setup.py
```

Generated Data Structure

Each sales record contains rich named entities perfect for NER extraction:

PERSON: Customer names, sales representatives
ORG: Retailers (Best Buy, Target, Amazon, etc.)
LOCATION: Cities and regions across the US
MONEY: Product prices and revenue potential
DATE: Timestamps, quarters, months
PRODUCT: Bose product lines and specific models

Example Generated Record

{
  "customer_name": "Jennifer Martinez",
  "sales_representative": "Michael Chen", 
  "product_model": "SoundSport Free",
  "price": 149,
  "retailer": "Best Buy",
  "location_city": "New York, NY",
  "interaction_text": "Customer Jennifer Martinez from New York, NY called to inquire about purchasing the SoundSport Free. Sales rep Michael Chen provided detailed product information and quoted $149. Customer is comparing with similar products at Best Buy.",
  "text": "Customer Jennifer Martinez from New York, NY called to inquire about purchasing the SoundSport Free. Sales rep Michael Chen provided detailed product information and quoted $149. Customer is comparing with similar products at Best Buy."
}

Integration with Unstructured Workflow

This Elasticsearch index can be used as a source connector in the Unstructured Workflow Endpoint alongside S3 technical documentation to create a hybrid RAG system:

S3 Source: Technical manuals, troubleshooting guides, MSDS PDFs
Elasticsearch Source: Synthetic sales data (this script)
NER Enrichment: Extract named entities from both sources
Qdrant Destination: Combined processed data for RAG queries

Requirements

See requirements.txt for Python dependencies:

elasticsearch>=8.0.0
python-dotenv>=0.19.0
faker>=15.0.0

verify_elasticsearch_data.py

A comprehensive verification script that inspects and validates the synthetic sales data in your Elasticsearch index. Use this script to confirm that data was successfully uploaded and is ready for NER processing.

Features

Connection Testing: Verifies Elasticsearch cluster connectivity
Index Validation: Confirms the index exists and contains data
Data Statistics: Provides comprehensive metrics on document count, index size, and distribution
Sample Document Display: Shows actual records with key fields
NER Readiness Check: Validates that data contains rich named entities
Search Query Testing: Tests various search patterns to ensure data accessibility

Usage

python verify_elasticsearch_data.py

What It Checks

Basic Connectivity: Tests connection to your Elasticsearch cluster
Index Existence: Confirms the sales-records index exists
Document Count: Reports total number of indexed documents
Data Distribution: Analyzes breakdown by:
- Product lines (SoundSport, OpenAudio, QuietComfort)
- Product models
- Retailers (Best Buy, Target, Amazon, etc.)
- Geographic regions
- Interaction types
- Price and revenue statistics
- Temporal distribution (by year)
NER Entity Validation: Confirms presence of:
- Person names (customers, sales reps)
- Organizations (retailers)
- Locations (cities, regions)
- Monetary values (prices)
- Dates (timestamps)
- Rich text content
Search Functionality: Tests sample queries to verify data is searchable

Sample Output

🚀 Starting Elasticsearch Data Verification
============================================================
🔧 Testing Elasticsearch connection...
✅ Connected to Elasticsearch cluster: instance-0000000000
   Version: 8.11.0
   Cluster: 2371b9a1d2ad40c590fd1e22652a8236

✅ Index 'sales-records' exists
📊 Getting statistics for index 'sales-records'...
   📄 Total documents: 500
   💾 Index size: 245,760 bytes (0.23 MB)
   🔧 Primary shards: 12

📋 Retrieving 5 sample documents...
📄 Document 1:
   🆔 ID: abc123-def456
   👤 Customer: Jennifer Martinez
   🏷️ Product: SoundSport Free
   💰 Price: $149
   🏪 Retailer: Best Buy
   📍 Location: New York, NY
   📅 Date: 2023-11-15T14:30:00
   📝 Text: Customer Jennifer Martinez from New York, NY called to inquire about purchasing the SoundSport...

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
donor-notebooks		donor-notebooks
elastic-search-index-setup		elastic-search-index-setup
notebook-processing		notebook-processing
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Notebook Processing Tools

remove_images.py

Features

Usage

Examples

Requirements

How it works

Workflow Example

elasticsearch_setup.py

Features

Setup

Generated Data Structure

Example Generated Record

Integration with Unstructured Workflow

Requirements

verify_elasticsearch_data.py

Features

Usage

What It Checks

Sample Output

About

Uh oh!

Releases

Packages

Languages

Unstructured-IO/rag-over-hybrid-data-sources

Folders and files

Latest commit

History

Repository files navigation

Notebook Processing Tools

remove_images.py

Features

Usage

Examples

Requirements

How it works

Workflow Example

elasticsearch_setup.py

Features

Setup

Generated Data Structure

Example Generated Record

Integration with Unstructured Workflow

Requirements

verify_elasticsearch_data.py

Features

Usage

What It Checks

Sample Output

About

Resources

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages