Sources Guide

Open Notebook serves as your central hub for research materials, supporting a wide variety of content formats. This guide covers everything you need to know about adding, managing, and organizing sources in your notebooks.

Supported File Types and Formats

Open Notebook leverages the powerful content-core library to process various content types with intelligent engine selection.

📄 Documents

PDF - Research papers, reports, books
EPUB - E-books and digital publications
Microsoft Office:
- Word documents (.docx, .doc)
- PowerPoint presentations (.pptx, .ppt)
- Excel spreadsheets (.xlsx, .xls)
Text files - Plain text (.txt), Markdown (.md)
HTML - Web pages and HTML files

🎥 Media Files

Video formats:
- MP4, AVI, MOV, WMV
- Automatic transcription to text
Audio formats:
- MP3, WAV, M4A, AAC
- Speech-to-text conversion

🌐 Web Content

URLs - Any web page, blog post, or article
YouTube videos - Automatic transcript extraction
News articles - Automatic content extraction

🖼️ Images

JPG, PNG, TIFF - With OCR text recognition
Screenshots - Perfect for capturing visual information

📦 Archives

ZIP, TAR, GZ - Compressed file support

Adding Sources Step-by-Step

Method 1: Adding Links

Navigate to your notebook
Click "Add Source"
Select "Link" option
Enter the URL in the text field
Configure options (see Configuration Options below)
Click "Process"

Examples:

Research articles: https://arxiv.org/abs/2301.00001
YouTube videos: https://www.youtube.com/watch?v=dQw4w9WgXcQ
News articles: https://example.com/article
Blog posts: https://blog.example.com/post

Method 2: Uploading Files

Navigate to your notebook
Click "Add Source"
Select "Upload" option
Click "Choose File" and select your document
Configure options (see Configuration Options below)
Click "Process"

Supported formats:

Documents: PDF, DOCX, PPTX, XLSX, EPUB, TXT, MD
Media: MP4, MP3, WAV, M4A (requires speech-to-text model)
Images: JPG, PNG, TIFF (with OCR)
Archives: ZIP, TAR, GZ

Method 3: Adding Text Content

Navigate to your notebook
Click "Add Source"
Select "Text" option
Paste or type your content in the text area
Configure options (see Configuration Options below)
Click "Process"

Use cases:

Meeting notes or transcripts
Research findings
Interview transcripts
Code snippets or documentation

Configuration Options

Transformations

Apply AI-powered transformations to extract insights from your sources:

Summary - Generate concise summaries
Key Points - Extract main ideas and takeaways
Questions - Generate questions for further research
Analysis - Provide detailed analysis of content
Custom transformations - Create your own prompts

Embedding Options

Choose how content should be embedded for vector search:

Ask every time - Prompt for each source
Always embed - Automatically embed all sources
Never embed - Skip embedding (can be done later)

Note: Embedding enables AI-powered search and context retrieval but uses tokens from your AI provider.

File Management

Delete after processing - Remove uploaded files from server after processing
Keep files - Retain files on server (useful for archival)

Source Management and Organization

Viewing Source Details

Click the "Expand" button on any source to view:

Full extracted content
Generated insights (transformations)
Processing metadata
Embedded chunk information

Context Configuration

Control how sources are included in AI conversations:

🚫 Not in Context - Exclude from AI context
📄 Summary - Include summary only (recommended)
📋 Full Content - Include complete content (uses more tokens)

Source Metadata

Each source includes:

Title - Extracted or custom title
Topics - Automatically detected or manually added tags
Created/Updated - Timestamps for tracking
Embedded chunks - Number of vector embeddings
Insights count - Number of generated insights

Searching Sources

Use the search functionality to find specific sources:

Text search - Search titles and content
Vector search - Semantic similarity search
Filter by notebook - View sources from specific notebooks
Filter by type - URLs, uploads, or text content

Source Processing and Transformation

Content Extraction Engines

Open Notebook uses intelligent engine selection:

Docling - PDF and Office documents (default)
PyMuPDF - Lightweight PDF processing
Firecrawl - Enhanced web scraping
Jina - Advanced content extraction
BeautifulSoup - Standard web scraping

Processing Workflow

Upload/URL submission - Source is received
Engine selection - Best extraction method chosen
Content extraction - Text and metadata extracted
Transformation application - AI insights generated
Embedding creation - Vector embeddings for search
Storage - Content saved to database

Speech-to-Text Processing

For audio and video files:

Audio extraction - Video converted to audio
Transcription - Speech converted to text
Content processing - Standard text processing applied

Requirements:

Speech-to-text model configured (OpenAI Whisper, etc.)
Compatible audio/video format

Best Practices

Content Organization

Use descriptive titles - Edit auto-generated titles for clarity
Add relevant topics - Tag sources for better categorization
Group related sources - Keep related materials in same notebook
Regular cleanup - Remove outdated or irrelevant sources

Performance Optimization

Selective embedding - Only embed sources you'll search
Context management - Use summary context when possible
Batch processing - Add multiple sources at once
File cleanup - Enable automatic file deletion

Cost Management

Monitor token usage - Track embedding and transformation costs
Use summary context - Reduce token consumption in conversations
Selective transformations - Only apply needed transformations
Provider selection - Choose cost-effective AI providers

Limitations and Considerations

File Size Limits

Maximum upload size - Depends on server configuration
Processing time - Large files take longer to process
Memory usage - Very large files may cause processing issues

Format Limitations

Scanned PDFs - May require OCR processing
Password-protected files - Cannot be processed
Corrupted files - Will fail processing gracefully
Proprietary formats - Some formats may not be supported

Language Support

YouTube transcripts - Configurable preferred languages
Multi-language content - Supported by AI models
OCR accuracy - Varies by image quality and language

Privacy and Security

File storage - Temporary files deleted after processing
Content persistence - Extracted text stored in database
AI processing - Content sent to configured AI providers
Access control - Password protection available

Troubleshooting Source Issues

Common Problems and Solutions

"Unsupported file type" error

Solution:

Check the supported formats list above
Ensure file is not corrupted
Try converting to a supported format

"No transcript found" for YouTube videos

Solution:

Verify video has captions/subtitles
Check YouTube transcript language preferences
Try manually uploading audio if available

"Processing failed" for documents

Solution:

Ensure file is not password-protected
Check file size (try smaller files)
Verify file is not corrupted
Try different processing engine in settings

"Audio/video upload disabled" warning

Solution:

Configure speech-to-text model in Models
Ensure provider API keys are set
Check model availability

Embedding fails or takes too long

Solution:

Check embedding model configuration
Verify API key and quota limits
Try processing without embedding first
Check content length (very long content may fail)

Getting Help

Check server logs - Enable debug logging for detailed error info
GitHub Issues - Report bugs or request features
Discord Community - Get help from other users
Documentation - Review setup and configuration guides

Advanced Features

Custom Transformations

Create your own AI-powered transformations:

Navigate to Settings → Transformations
Click "Create New"
Define your prompt template
Set default application preferences
Test with sample content

Bulk Operations

Multiple file upload - Select multiple files at once
Batch transformations - Apply to multiple sources
Bulk embedding - Process multiple sources for search

API Integration

Use the REST API for programmatic source management:

Create sources - POST /api/sources
List sources - GET /api/sources
Get source details - GET /api/sources/{id}
Update source - PUT /api/sources/{id}
Delete source - DELETE /api/sources/{id}

Automation

Auto-embedding - Configure default embedding behavior
Default transformations - Apply specific transformations to all sources
File cleanup - Automatic deletion of temporary files
Regular processing - Schedule source updates

Integration Examples

Research Workflow

Add research papers (PDF uploads)
Include relevant articles (URL links)
Add meeting notes (text content)
Apply analysis transformation to extract insights
Enable embedding for cross-source search
Use summary context for efficient AI conversations

Content Creation Workflow

Gather reference materials (mixed formats)
Apply summary transformations for quick overviews
Extract key points for outline creation
Use full content context for detailed writing
Search across sources for specific information

Learning and Study Workflow

Upload course materials (PDFs, videos)
Add supplementary articles (web links)
Create study notes (text content)
Apply question generation for self-testing
Use vector search for concept lookup
Generate summaries for review

This comprehensive sources guide should help you make the most of Open Notebook's powerful content processing capabilities. Remember to experiment with different configurations to find the workflow that works best for your specific use case.

FilesExpand file tree

sources.md

Latest commit

History

sources.md

File metadata and controls

Sources Guide

Supported File Types and Formats

📄 Documents

🎥 Media Files

🌐 Web Content

🖼️ Images

📦 Archives

Adding Sources Step-by-Step

Method 1: Adding Links

Method 2: Uploading Files

Method 3: Adding Text Content

Configuration Options

Transformations

Embedding Options

File Management

Source Management and Organization

Viewing Source Details

Context Configuration

Source Metadata

Searching Sources

Source Processing and Transformation

Content Extraction Engines

Processing Workflow

Speech-to-Text Processing

Best Practices

Content Organization

Performance Optimization

Cost Management

Limitations and Considerations

File Size Limits

Format Limitations

Language Support

Privacy and Security

Troubleshooting Source Issues

Common Problems and Solutions

"Unsupported file type" error

"No transcript found" for YouTube videos

"Processing failed" for documents

"Audio/video upload disabled" warning

Embedding fails or takes too long

Getting Help

Advanced Features

Custom Transformations

Bulk Operations

API Integration

Automation

Integration Examples

Research Workflow

Content Creation Workflow

Learning and Study Workflow