Unsiloed-AI · harsh-791 · May 12, 2025
diff --git a/README.md b/README.md
@@ -1,76 +1,103 @@
 # 📄 Unsiloed AI Document Data extractor
 
-A super simple way to extract text from documents for  for intelligent document processing, extraction, and chunking with multi-threaded processing capabilities.
+A super simple way to extract text from documents for intelligent document processing, extraction, and chunking with multi-threaded processing capabilities.
 
 ## 🚀 Features
 
 ### 📊 Document Chunking
+
 - **Supported File Types**: PDF, DOCX, PPTX
 - **Chunking Strategies**:
   - **Fixed Size**: Splits text into chunks of specified size with optional overlap
   - **Page-based**: Splits PDF by pages (PDF only, falls back to paragraph for other file types)
-  - **Semantic**: Uses Multi-Modal Model to identify meaningful semantic chunks
+  - **Semantic**: Uses LLM to identify meaningful semantic chunks
   - **Paragraph**: Splits text by paragraphs
   - **Heading**: Splits text by identified headings
 
+### 🤖 Model Provider Support
+
+- **OpenAI**: GPT-4 and other OpenAI models
+- **Anthropic**: Claude models
+- **HuggingFace**: Local inference with transformers
+- **Configurable**: Easy to add new model providers
+
 ## 🔧 Technical Details
 
-### 🧠 OpenAI Integration
-- Uses OpenAI GPT-4o for semantic chunking
-- Handles authentication via API key from environment variables
+### 🧠 LLM Integration
+
+- Supports multiple model providers:
+  - OpenAI GPT-4 for semantic chunking
+  - Anthropic Claude models
+  - HuggingFace models for local inference
+- Handles authentication via API keys from environment variables
 - Implements automatic retries and timeout handling
 - Provides structured JSON output for semantic chunks
 
 ### 🔄 Parallel Processing
+
 - Multi-threaded processing for improved performance
 - Parallel page extraction from PDFs
 - Distributes processing of large documents across multiple threads
 
 ### 📝 Document Processing
+
 - Extracts text from PDF, DOCX, and PPTX files
 - Handles image encoding for vision-based models
 - Generates extraction prompts for structured data extraction
 
 ## ⚙️ Configuration
 
 ### Environmental Variables
+
 - `OPENAI_API_KEY`: Your OpenAI API key
+- `ANTHROPIC_API_KEY`: Your Anthropic API key (for Claude models)
 
 ## 🛑 Constraints & Limitations
 
 ### File Handling
+
 - Temporary files are created during processing and deleted afterward
 - Files are processed in-memory where possible
 
 ### Text Processing
+
 - Long text (>25,000 characters) is automatically split and processed in parallel for semantic chunking
-- Maximum token limit of 4000 for OpenAI responses
+- Maximum token limit of 4000 for model responses
 
 ### API Constraints
+
 - Request timeout set to 60 seconds
-- Maximum of 3 retries for OpenAI API calls
+- Maximum of 3 retries for API calls
 
 ## 📋 Request Parameters
 
 ### Document Chunking Endpoint
+
 - `document_file`: The document file to process (PDF, DOCX, PPTX)
 - `strategy`: Chunking strategy to use (default: "semantic")
   - Options: "fixed", "page", "semantic", "paragraph", "heading"
 - `chunk_size`: Size of chunks for fixed strategy in characters (default: 1000)
 - `overlap`: Overlap size for fixed strategy in characters (default: 100)
-
+- `model_provider`: Type of model provider to use (default: "openai")
+  - Options: "openai", "anthropic", "huggingface"
+- `model_config`: Additional configuration for the model provider (optional)
 
 ## 📦 Installation
 
 ### Using pip
+
 ```bash
 pip install unsiloed
 ```
 
-
 ### Requirements
+
 Unsiloed requires Python 3.8 or higher and has the following dependencies:
+
 - openai
+- anthropic
+- transformers
+- torch
 - PyPDF2
 - python-docx
 - python-pptx
@@ -79,27 +106,35 @@ Unsiloed requires Python 3.8 or higher and has the following dependencies:
 
 ## 🔑 Environment Setup
 
-Before using Unsiloed, set up your OpenAI API key:
+Before using Unsiloed, set up your API keys:
 
 ### Using environment variables
+
 ```bash
 # Linux/macOS
-export OPENAI_API_KEY="your-api-key-here"
+export OPENAI_API_KEY="your-openai-api-key"
+export ANTHROPIC_API_KEY="your-anthropic-api-key"
 
 # Windows (Command Prompt)
-set OPENAI_API_KEY=your-api-key-here
+set OPENAI_API_KEY=your-openai-api-key
+set ANTHROPIC_API_KEY=your-anthropic-api-key
 
 # Windows (PowerShell)
-$env:OPENAI_API_KEY="your-api-key-here"
+$env:OPENAI_API_KEY="your-openai-api-key"
+$env:ANTHROPIC_API_KEY="your-anthropic-api-key"
 ```
 
 ### Using a .env file
+
 Create a `.env` file in your project directory:
+
 ```
-OPENAI_API_KEY=your-api-key-here
+OPENAI_API_KEY=your-openai-api-key
+ANTHROPIC_API_KEY=your-anthropic-api-key
 ```
 
 Then in your Python code:
+
 ```python
 from dotenv import load_dotenv
 load_dotenv()  # This loads the variables from .env
@@ -113,7 +148,7 @@ load_dotenv()  # This loads the variables from .env
 import os
 import Unsiloed
 
-# Example 1: Semantic chunking (default)
+# Example 1: Semantic chunking with OpenAI (default)
 result = Unsiloed.process_sync({
     "filePath": "./test.pdf",
     "credentials": {
@@ -124,148 +159,63 @@ result = Unsiloed.process_sync({
     "overlap": 100
 })
 
-# Print the result
-print(result)
-
-# Example 2: Fixed-size chunking
-fixed_result = Unsiloed.process_sync({
-    "filePath": "./test.pdf", #path to your file
+# Example 2: Semantic chunking with Anthropic Claude
+claude_result = Unsiloed.process_sync({
+    "filePath": "./test.pdf",
     "credentials": {
-        "apiKey": os.environ.get("OPENAI_API_KEY")
+        "apiKey": os.environ.get("ANTHROPIC_API_KEY")
     },
+    "strategy": "semantic",
+    "modelProvider": "anthropic",
+    "modelConfig": {
+        "model": "claude-3-opus-20240229"
+    }
+})
+
+# Example 3: Semantic chunking with HuggingFace
+hf_result = Unsiloed.process_sync({
+    "filePath": "./test.pdf",
+    "strategy": "semantic",
+    "modelProvider": "huggingface",
+    "modelConfig": {
+        "model_name": "mistralai/Mistral-7B-Instruct-v0.2"
+    }
+})
+
+# Example 4: Fixed-size chunking
+fixed_result = Unsiloed.process_sync({
+    "filePath": "./test.pdf",
     "strategy": "fixed",
     "chunkSize": 1500,
     "overlap": 150
 })
 
-# Example 3: Page-based chunking (PDF only)
+# Example 5: Page-based chunking (PDF only)
 page_result = Unsiloed.process_sync({
     "filePath": "./test.pdf",
-    "credentials": {
-        "apiKey": os.environ.get("OPENAI_API_KEY")
-    },
     "strategy": "page"
 })
 
-# Example 4: Paragraph chunking
+# Example 6: Paragraph chunking
 paragraph_result = Unsiloed.process_sync({
     "filePath": "./document.docx",
-    "credentials": {
-        "apiKey": os.environ.get("OPENAI_API_KEY")
-    },
     "strategy": "paragraph"
 })
 
-# Example 5: Heading chunking
+# Example 7: Heading chunking
 heading_result = Unsiloed.process_sync({
     "filePath": "./presentation.pptx",
-    "credentials": {
-        "apiKey": os.environ.get("OPENAI_API_KEY")
-    },
     "strategy": "heading"
 })
 ```
 
-## 🛠️ Development Setup
-
-### Prerequisites
+## 🤝 Contributing
 
-- Python 3.8 or higher
-- pip (Python package installer)
-- git
+Contributions are welcome! Please feel free to submit a Pull Request.
 
-### Setting Up Local Development Environment
+## 📝 License
 
-1. Clone the repository:
-```bash
-git clone https://github.com/Unsiloed-opensource/Unsiloed.git
-cd Unsiloed
-```
-
-2. Create a virtual environment:
-```bash
-# Using venv
-python -m venv venv
-
-# Activate the virtual environment
-# On Windows
-venv\Scripts\activate
-# On macOS/Linux
-source venv/bin/activate
-```
-
-3. Install dependencies:
-```bash
-pip install -r requirements.txt
-```
-
-4. Set up your environment variables:
-```bash
-# Create a .env file
-echo "OPENAI_API_KEY=your-api-key-here" > .env
-```
-
-5. Run the FastAPI server locally:
-```bash
-uvicorn Unsiloed.app:app --reload
-```
-
-6. Access the API documentation:
-Open your browser and go to `http://localhost:8000/docs`
-
-
-
-## 👨‍💻 Contributing
-
-We welcome contributions to Unsiloed! Here's how you can help:
-
-### Setting Up Development Environment
-
-1. Fork the repository and clone your fork:
-```bash
-git clone https://github.com/YOUR_USERNAME/Unsiloed.git
-cd Unsiloed
-```
-
-2. Install development dependencies:
-```bash
-pip install -r requirements.txt
-```
-
-
-### Making Changes
-
-1. Create a new branch for your feature:
-```bash
-git checkout -b feature/your-feature-name
-```
-
-2. Make your changes and write tests if applicable
-
-
-4. Commit your changes:
-```bash
-git commit -m "Add your meaningful commit message here"
-```
-
-5. Push to your fork:
-```bash
-git push origin feature/your-feature-name
-```
-
-6. Create a Pull Request from your fork to the main repository
-
-### Code Style and Standards
-
-- We follow PEP 8 for Python code style
-- Use type hints where appropriate
-- Document functions and classes with docstrings
-- Write tests for new features
-
-
-## 📄 License
-
-This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+This project is licensed under the MIT License - see the LICENSE file for details.
 
 ## 🌐 Community and Support
 
@@ -275,7 +225,6 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
 - **Issues**: For bug reports and feature requests
 - **Pull Requests**: For contributing to the codebase
 
-
 ### Staying Updated
 
 - **Star** the repository to show support