Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
213 changes: 81 additions & 132 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,76 +1,103 @@
# 📄 Unsiloed AI Document Data extractor

A super simple way to extract text from documents for for intelligent document processing, extraction, and chunking with multi-threaded processing capabilities.
A super simple way to extract text from documents for intelligent document processing, extraction, and chunking with multi-threaded processing capabilities.

## 🚀 Features

### 📊 Document Chunking

- **Supported File Types**: PDF, DOCX, PPTX
- **Chunking Strategies**:
- **Fixed Size**: Splits text into chunks of specified size with optional overlap
- **Page-based**: Splits PDF by pages (PDF only, falls back to paragraph for other file types)
- **Semantic**: Uses Multi-Modal Model to identify meaningful semantic chunks
- **Semantic**: Uses LLM to identify meaningful semantic chunks
- **Paragraph**: Splits text by paragraphs
- **Heading**: Splits text by identified headings

### 🤖 Model Provider Support

- **OpenAI**: GPT-4 and other OpenAI models
- **Anthropic**: Claude models
- **HuggingFace**: Local inference with transformers
- **Configurable**: Easy to add new model providers

## 🔧 Technical Details

### 🧠 OpenAI Integration
- Uses OpenAI GPT-4o for semantic chunking
- Handles authentication via API key from environment variables
### 🧠 LLM Integration

- Supports multiple model providers:
- OpenAI GPT-4 for semantic chunking
- Anthropic Claude models
- HuggingFace models for local inference
- Handles authentication via API keys from environment variables
- Implements automatic retries and timeout handling
- Provides structured JSON output for semantic chunks

### 🔄 Parallel Processing

- Multi-threaded processing for improved performance
- Parallel page extraction from PDFs
- Distributes processing of large documents across multiple threads

### 📝 Document Processing

- Extracts text from PDF, DOCX, and PPTX files
- Handles image encoding for vision-based models
- Generates extraction prompts for structured data extraction

## ⚙️ Configuration

### Environmental Variables

- `OPENAI_API_KEY`: Your OpenAI API key
- `ANTHROPIC_API_KEY`: Your Anthropic API key (for Claude models)

## 🛑 Constraints & Limitations

### File Handling

- Temporary files are created during processing and deleted afterward
- Files are processed in-memory where possible

### Text Processing

- Long text (>25,000 characters) is automatically split and processed in parallel for semantic chunking
- Maximum token limit of 4000 for OpenAI responses
- Maximum token limit of 4000 for model responses

### API Constraints

- Request timeout set to 60 seconds
- Maximum of 3 retries for OpenAI API calls
- Maximum of 3 retries for API calls

## 📋 Request Parameters

### Document Chunking Endpoint

- `document_file`: The document file to process (PDF, DOCX, PPTX)
- `strategy`: Chunking strategy to use (default: "semantic")
- Options: "fixed", "page", "semantic", "paragraph", "heading"
- `chunk_size`: Size of chunks for fixed strategy in characters (default: 1000)
- `overlap`: Overlap size for fixed strategy in characters (default: 100)

- `model_provider`: Type of model provider to use (default: "openai")
- Options: "openai", "anthropic", "huggingface"
- `model_config`: Additional configuration for the model provider (optional)

## 📦 Installation

### Using pip

```bash
pip install unsiloed
```


### Requirements

Unsiloed requires Python 3.8 or higher and has the following dependencies:

- openai
- anthropic
- transformers
- torch
- PyPDF2
- python-docx
- python-pptx
Expand All @@ -79,27 +106,35 @@ Unsiloed requires Python 3.8 or higher and has the following dependencies:

## 🔑 Environment Setup

Before using Unsiloed, set up your OpenAI API key:
Before using Unsiloed, set up your API keys:

### Using environment variables

```bash
# Linux/macOS
export OPENAI_API_KEY="your-api-key-here"
export OPENAI_API_KEY="your-openai-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"

# Windows (Command Prompt)
set OPENAI_API_KEY=your-api-key-here
set OPENAI_API_KEY=your-openai-api-key
set ANTHROPIC_API_KEY=your-anthropic-api-key

# Windows (PowerShell)
$env:OPENAI_API_KEY="your-api-key-here"
$env:OPENAI_API_KEY="your-openai-api-key"
$env:ANTHROPIC_API_KEY="your-anthropic-api-key"
```

### Using a .env file

Create a `.env` file in your project directory:

```
OPENAI_API_KEY=your-api-key-here
OPENAI_API_KEY=your-openai-api-key
ANTHROPIC_API_KEY=your-anthropic-api-key
```

Then in your Python code:

```python
from dotenv import load_dotenv
load_dotenv() # This loads the variables from .env
Expand All @@ -113,7 +148,7 @@ load_dotenv() # This loads the variables from .env
import os
import Unsiloed

# Example 1: Semantic chunking (default)
# Example 1: Semantic chunking with OpenAI (default)
result = Unsiloed.process_sync({
"filePath": "./test.pdf",
"credentials": {
Expand All @@ -124,148 +159,63 @@ result = Unsiloed.process_sync({
"overlap": 100
})

# Print the result
print(result)

# Example 2: Fixed-size chunking
fixed_result = Unsiloed.process_sync({
"filePath": "./test.pdf", #path to your file
# Example 2: Semantic chunking with Anthropic Claude
claude_result = Unsiloed.process_sync({
"filePath": "./test.pdf",
"credentials": {
"apiKey": os.environ.get("OPENAI_API_KEY")
"apiKey": os.environ.get("ANTHROPIC_API_KEY")
},
"strategy": "semantic",
"modelProvider": "anthropic",
"modelConfig": {
"model": "claude-3-opus-20240229"
}
})

# Example 3: Semantic chunking with HuggingFace
hf_result = Unsiloed.process_sync({
"filePath": "./test.pdf",
"strategy": "semantic",
"modelProvider": "huggingface",
"modelConfig": {
"model_name": "mistralai/Mistral-7B-Instruct-v0.2"
}
})

# Example 4: Fixed-size chunking
fixed_result = Unsiloed.process_sync({
"filePath": "./test.pdf",
"strategy": "fixed",
"chunkSize": 1500,
"overlap": 150
})

# Example 3: Page-based chunking (PDF only)
# Example 5: Page-based chunking (PDF only)
page_result = Unsiloed.process_sync({
"filePath": "./test.pdf",
"credentials": {
"apiKey": os.environ.get("OPENAI_API_KEY")
},
"strategy": "page"
})

# Example 4: Paragraph chunking
# Example 6: Paragraph chunking
paragraph_result = Unsiloed.process_sync({
"filePath": "./document.docx",
"credentials": {
"apiKey": os.environ.get("OPENAI_API_KEY")
},
"strategy": "paragraph"
})

# Example 5: Heading chunking
# Example 7: Heading chunking
heading_result = Unsiloed.process_sync({
"filePath": "./presentation.pptx",
"credentials": {
"apiKey": os.environ.get("OPENAI_API_KEY")
},
"strategy": "heading"
})
```

## 🛠️ Development Setup

### Prerequisites
## 🤝 Contributing

- Python 3.8 or higher
- pip (Python package installer)
- git
Contributions are welcome! Please feel free to submit a Pull Request.

### Setting Up Local Development Environment
## 📝 License

1. Clone the repository:
```bash
git clone https://github.com/Unsiloed-opensource/Unsiloed.git
cd Unsiloed
```

2. Create a virtual environment:
```bash
# Using venv
python -m venv venv

# Activate the virtual environment
# On Windows
venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate
```

3. Install dependencies:
```bash
pip install -r requirements.txt
```

4. Set up your environment variables:
```bash
# Create a .env file
echo "OPENAI_API_KEY=your-api-key-here" > .env
```

5. Run the FastAPI server locally:
```bash
uvicorn Unsiloed.app:app --reload
```

6. Access the API documentation:
Open your browser and go to `http://localhost:8000/docs`



## 👨‍💻 Contributing

We welcome contributions to Unsiloed! Here's how you can help:

### Setting Up Development Environment

1. Fork the repository and clone your fork:
```bash
git clone https://github.com/YOUR_USERNAME/Unsiloed.git
cd Unsiloed
```

2. Install development dependencies:
```bash
pip install -r requirements.txt
```


### Making Changes

1. Create a new branch for your feature:
```bash
git checkout -b feature/your-feature-name
```

2. Make your changes and write tests if applicable


4. Commit your changes:
```bash
git commit -m "Add your meaningful commit message here"
```

5. Push to your fork:
```bash
git push origin feature/your-feature-name
```

6. Create a Pull Request from your fork to the main repository

### Code Style and Standards

- We follow PEP 8 for Python code style
- Use type hints where appropriate
- Document functions and classes with docstrings
- Write tests for new features


## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
This project is licensed under the MIT License - see the LICENSE file for details.

## 🌐 Community and Support

Expand All @@ -275,7 +225,6 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
- **Issues**: For bug reports and feature requests
- **Pull Requests**: For contributing to the codebase


### Staying Updated

- **Star** the repository to show support
Expand Down
Loading