- RAFT Toolkit
- π Table of Contents
- π Overview
- π¦ Installation
- π Usage
- π RAFT Training Guide
- π Template System
- π§ Advanced Configuration
- ποΈ Architecture & Development
- π§ͺ Testing
- π οΈ Command Line Tools
- π οΈ Fine-tuning & Evaluation
- π Deployment
- π Documentation
RAFT (Retrieval Augmented Fine-Tuning) is a technique that trains language models to better utilize retrieved documents when answering questions. Unlike traditional RAG systems that rely on frozen pre-trained models, RAFT fine-tunes models specifically for document-based reasoning tasks.
The RAFT Toolkit automates the creation of training datasets by generating {question, answer, documents} triplets from your documents, enabling you to fine-tune models that excel at retrieval-augmented generation tasks.
graph TD
A[π Input Sources<br/>Local, S3, SharePoint] --> B{π§ RAFT Toolkit<br/>CLI or Web UI}
B --> C[π Document Chunking<br/>Semantic/Fixed/Sentence]
C --> D[β Question Generation<br/>LLM-powered Q&A creation]
D --> E[π Answer Generation<br/>Context-based responses]
E --> F[π Distractor Addition<br/>Irrelevant docs for robustness]
F --> G[π Training Dataset<br/>JSONL/Parquet format]
G --> H[π€ Model Fine-tuning<br/>OpenAI/HuggingFace/Azure]
H --> I[π― Fine-tuned Model<br/>Domain-optimized LLM]
G --> J{π οΈ Analysis Tools}
J --> K[π Dataset Evaluation<br/>eval.py]
J --> L[π¬ Answer Generation<br/>answer.py]
J --> M[π PromptFlow Analysis<br/>pfeval_*.py]
K --> N[π Performance Metrics]
L --> O[π Model Comparison]
M --> P[π Quality Assessment]
N --> Q[β¨ Production Model<br/>Optimized for RAG tasks]
O --> Q
P --> Q
style B fill:#e1f5fe,color:#000000
style J fill:#f3e5f5,color:#000000
style Q fill:#e8f5e8,color:#000000
π§ Toolkit Components:
- Core Engine: Document processing and dataset generation
- Analysis Tools: Six evaluation and comparison utilities
- Web Interface: Visual workflow management and monitoring
- CLI Tools: Scriptable automation and batch processing
Features:
- π Dual Interface: Command-line tool and modern web interface
- π οΈ Analysis Tools Suite: Evaluation, answer generation, and PromptFlow analysis
- ποΈ 12-Factor Architecture: Cloud-native, scalable design
- π Multi-Format Support: PDF, TXT, JSON, PPTX, and API documentation
- βοΈ Multiple Input Sources: Local files, Amazon S3, SharePoint Online
- π Enterprise Authentication: AWS credentials, Azure AD, SharePoint integration
- π― Flexible Output: HuggingFace, OpenAI completion/chat, and evaluation formats
- β‘ Parallel Processing: Configurable workers for optimal performance
- π Enhanced Logging: Production-ready logging with progress tracking, external service integration (Sentry, DataDog), and structured output
- π Observability: Optional LangWatch integration for LLM call tracing and performance monitoring
- π§ͺ Comprehensive Testing: Unit, integration, API, and CLI test suites
- π³ Container Ready: Docker support for easy deployment
- π Kubernetes Ready: Complete Kubernetes deployment configurations
| Aspect | Traditional RAG | RAFT Fine-Tuning |
|---|---|---|
| Model Training | Uses frozen pre-trained models | Fine-tunes models on domain-specific data |
| Document Utilization | May ignore or misuse retrieved documents | Learns to effectively use retrieved information |
| Performance | Depends on base model's retrieval reasoning | Optimized for specific document types/domains |
| Latency | Requires runtime retrieval + inference | Faster inference with better document integration |
| Setup Complexity | Lower initial setup | Higher setup (requires training data generation) |
| Customization | Limited to prompt engineering | Deep customization through fine-tuning |
When to Use RAFT vs Traditional RAG:
Use RAFT Fine-Tuning When:
- You have consistent document types/formats
- Performance on document reasoning is critical
- You can invest time in data generation and training
- You need predictable, high-quality outputs
- Latency optimization is important
Use Traditional RAG When:
- Working with diverse, changing document types
- Quick prototyping or proof-of-concept needed
- Limited resources for training data generation
- Documents change frequently
- General-purpose question answering is sufficient
π Complete Installation Guide: For detailed installation instructions, prerequisites, Docker setup, and advanced configuration options, see docs/INSTALLATION_GUIDE.md.
# Clone the repository
git clone https://github.com/your-repo/raft-toolkit.git
cd raft-toolkit
# Set up environment
cp .env.example .env
# Edit .env with your OpenAI API key
# Fast installation (core functionality only)
pip install .
# Or standard installation (recommended)
pip install .[standard]
# Test installation
python -m cli.main --datapath sample_data/sample.pdf --output ./output --previewChoose the installation that best fits your needs:
pip install .Includes: Basic CLI, document processing, OpenAI integration
Use cases: Quick testing, lightweight deployments, basic CI
pip install .[standard]Includes: Full AI/ML functionality, embeddings, LangChain ecosystem
Use cases: Production deployments, full RAFT functionality
pip install .[complete]Includes: Standard + cloud services + observability
Use cases: Enterprise deployments, cloud integration
pip install .[all]Includes: Everything + development tools
Use cases: Contributing, local development, full testing
# Web interface with AI
pip install .[standard,web]
# Cloud deployment with tracing
pip install .[ai,langchain,cloud,tracing]
# Development with specific features
pip install .[standard,dev]docker compose up -dπ Performance Note: The optimized dependency structure provides 70-80% faster CI builds compared to previous versions. See CI Optimization Guide for details.
π Installation Resources:
- Complete Installation Guide - Detailed setup instructions
- Requirements Management - Dependency structure and installation patterns
π CLI Documentation:
- CLI Reference Guide - Comprehensive CLI parameter documentation
- CLI Quick Reference - Quick reference card for CLI parameters
π See also: Web Interface Guide for detailed documentation on all web UI features, analysis tools, and job management.
# Start the web server
python run_web.py
# Or with custom configuration
python run_web.py --host 0.0.0.0 --port 8080 --debug
# Open http://localhost:8000 in your browserWeb UI Features:
- π€ Dataset Generation: Drag & drop file upload with visual configuration
- π οΈ Analysis Tools: Six powerful evaluation and analysis tools
- βοΈ Visual Configuration: Interactive forms for all settings
- π Live Preview: See processing estimates before running
- π Job Management: Track multiple processing jobs with real-time updates
- π₯ Download Results: Direct download of generated datasets and analysis results
- π Results Visualization: Comprehensive display of metrics and statistics
Analysis Tools Available:
- Dataset Evaluation: Evaluate model performance with configurable metrics
- Answer Generation: Generate high-quality answers using various LLMs
- PromptFlow Analysis: Multi-dimensional evaluation (relevance, groundedness, fluency, coherence)
- Dataset Analysis: Statistical analysis and quality metrics
- Model Comparison: Side-by-side performance comparison
- Batch Processing: Automated workflows for multiple datasets
π Complete CLI Documentation:
- CLI Reference Guide - Comprehensive documentation of all CLI parameters and options
- CLI Quick Reference - Quick reference card for common commands and use cases
The tools/ directory contains powerful standalone evaluation utilities:
# Navigate to tools directory
cd tools/
# Install tool dependencies
pip install -r requirements.txt
# Run dataset evaluation
python eval.py --question-file dataset.jsonl --answer-file answers.jsonl
# Generate answers for evaluation
python answer.py --input questions.jsonl --output answers.jsonl --workers 8
# Run PromptFlow evaluation
python pfeval_chat.py --input dataset.jsonl --output evaluation.jsonSee the tools/README.md for comprehensive documentation on all available tools.
Basic Workflow:
- Chunk Generation: Document is split into chunks
- QA Generation: LLM generates N questions and answers per chunk
- Distractor Appending: Random chunks are added as distractors for each QA pair
- Dataset Export: Data is saved in the specified format for fine-tuning
Tips:
- Use a
.envfile for OpenAI/Azure keys - For Azure, set deployment names with
--completion-modeland--embedding-model - Use
--chunking-strategyand--chunking-paramsfor best results on your data
You can use Ollama as a local OpenAI-compatible API for running models like Llama 3, Mistral, and others. This allows you to run RAFT without cloud API keys.
1. Start Ollama with your desired model:
ollama run llama32. Set the OpenAI-compatible endpoint in your environment:
export OPENAI_API_BASE_URL="http://localhost:11434/v1"
export OPENAI_API_KEY="ollama-anything" # Any non-empty stringOr add these to your .env file:
OPENAI_API_BASE_URL=http://localhost:11434/v1
OPENAI_API_KEY=ollama-anything3. Run RAFT as usual:
python3 raft.py \
--datapath sample_data/United_States_PDF.pdf \
--output ./sample_ds4 \
--distractors 4 \
--doctype pdf \
--chunk_size 512 \
--questions 5 \
--openai_key $OPENAI_API_KEYNote:
- Ollama's API is compatible with the OpenAI API, but some advanced features may not be supported.
- You can specify different models by running
ollama run <model_name>and setting the appropriate model in your RAFT command if needed.
π See also: Complete Configuration Guide for advanced RAFT configuration options and best practices.
- Quality Over Quantity: Use high-quality, authoritative documents
- Consistent Format: Maintain consistent document structure and formatting
- Domain Relevance: Focus on documents representative of target use cases
- Optimal Length: Use documents of 1,000-10,000 tokens for best chunking results
- Diverse Question Types: Include factual, analytical, and inferential questions
- Appropriate Difficulty: Match question complexity to intended use case
- Natural Language: Generate questions that users would realistically ask
- Coverage: Ensure questions cover all important document sections
- Distractor Ratio: Use 3-5 distractor documents per training example
- Oracle Probability: Include source document 80-100% of the time
- Balanced Difficulty: Mix easy, medium, and hard questions
- Size Guidelines: Aim for 1,000-10,000 training examples minimum
- Manual Review: Sample and manually verify question-answer pairs
- Consistency Checks: Ensure answers are actually derivable from context
- Bias Detection: Check for dataset biases and systematic errors
- Evaluation Split: Reserve 10-20% of data for evaluation
Effective chunking is critical for RAFT success. Choose your strategy based on document type and use case:
| Document Type | Recommended Chunk Size | Reasoning |
|---|---|---|
| Technical Documentation | 300-512 tokens | Preserves complete concepts and code examples |
| Legal Documents | 512-768 tokens | Maintains clause/section coherence |
| Medical Literature | 256-512 tokens | Balances detail with focused topics |
| Research Papers | 512-1024 tokens | Captures complete paragraphs and findings |
| FAQ/Knowledge Base | 128-256 tokens | Each chunk = one question/topic |
| News Articles | 256-512 tokens | Preserves story coherence |
| Overlap % | Use Case | Trade-offs |
|---|---|---|
| 0% | Distinct topics, FAQ | Clean separation, no redundancy |
| 10-20% | Technical docs | Minimal context preservation |
| 20-40% | Narrative content | Good context flow, some redundancy |
| 40-60% | Complex topics | Maximum context, high redundancy |
# Low overlap for distinct topics
--chunking-params '{"overlap": 0}'
# Medium overlap for connected content
--chunking-params '{"overlap": 100}' # ~20% of 512 tokens
# High overlap for complex documents
--chunking-params '{"overlap": 200}' # ~40% of 512 tokens| Questions/Chunk | Use Case | Quality vs Quantity |
|---|---|---|
| 1-2 | High-quality, focused datasets | Maximum quality, minimal redundancy |
| 3-5 | Balanced approach (recommended) | Good quality, reasonable coverage |
| 6-10 | Comprehensive coverage | Risk of lower quality questions |
# Focused, high-quality
--questions 2 --chunk_size 512
# Balanced approach (recommended)
--questions 5 --chunk_size 384
# Comprehensive coverage
--questions 8 --chunk_size 256| Distractors | Training Benefit | Dataset Size Impact |
|---|---|---|
| 2-3 | Basic robustness | Moderate increase |
| 4-6 | Strong robustness (recommended) | 5-7x dataset size |
| 7-10 | Maximum robustness | 8-11x dataset size |
# Recommended configuration
--distractors 4 --questions 5 --chunk_size 512
# Resource-constrained
--distractors 2 --questions 3 --chunk_size 384
# Maximum robustness
--distractors 6 --questions 3 --chunk_size 256π§ Semantic Chunking (Recommended)
--chunking-strategy semantic --chunk_size 512 \
--chunking-params '{"overlap": 50, "min_chunk_size": 200}'- Best for: Most document types, preserves meaning
- Overlap: 50-100 tokens for context preservation
- Min size: 200 tokens to ensure meaningful chunks
π Fixed Chunking
--chunking-strategy fixed --chunk_size 384 \
--chunking-params '{"overlap": 75}'- Best for: Consistent processing, structured documents
- Overlap: 15-25% of chunk size
- Use when: Semantic understanding less critical
π Sentence Chunking
--chunking-strategy sentence --chunk_size 256 \
--chunking-params '{"overlap": 0}'- Best for: Natural language, narrative content
- Overlap: Usually 0 (sentence boundaries are natural breaks)
- Chunk size: Maximum tokens per chunk (actual size varies)
# Generate RAFT training dataset
python raft.py --datapath documents/ --output training_data/- Document Chunking: Split documents into semantic chunks
- Question Generation: Create relevant questions for each chunk
- Answer Generation: Generate accurate answers using the source chunk
- Distractor Addition: Include irrelevant documents to improve robustness
- Format Conversion: Export in format suitable for fine-tuning platforms
# Example with OpenAI fine-tuning
openai api fine_tunes.create \
-t training_data.jsonl \
-m gpt-3.5-turbo \
--suffix "raft-medical-docs"- Platform Selection: Choose fine-tuning platform (OpenAI, HuggingFace, etc.)
- Model Selection: Start with instruction-tuned base models
- Training Configuration: Set learning rate, epochs, batch size
- Validation: Monitor training metrics and validation performance
# Evaluate fine-tuned model
python tools/eval.py --model ft:gpt-3.5-turbo:suffix --question-file eval.jsonl- Performance Testing: Compare against baseline models
- Error Analysis: Identify common failure patterns
- Data Augmentation: Generate additional training examples for weak areas
- Iterative Improvement: Refine dataset and retrain
RAFT Toolkit includes a comprehensive template system for customizing prompts used in embedding generation and question-answer pair creation. Templates can be customized to improve quality and relevance for specific domains.
No Configuration Required: RAFT Toolkit works out of the box with intelligent defaults:
- Automatically selects appropriate templates based on model type (GPT, Llama, etc.)
- Provides robust fallback mechanisms if custom templates are not found
- Includes multiple layers of default templates for different complexity levels
- Gracefully handles missing template directories or malformed template files
# Works immediately with defaults - no template configuration needed
python raft.py --datapath docs/ --output training_data/embedding_prompt_template.txt: Default template for embedding generation- Provides context and instructions for generating document embeddings
- Supports variables:
{content},{document_type},{metadata} - Customizable for domain-specific embedding optimization
gpt_template.txt: GPT-style question-answering template with reasoning and citationsgpt_qa_template.txt: GPT question generation template with content filteringllama_template.txt: Llama-style question-answering template optimized for Llama modelsllama_qa_template.txt: Llama question generation template with complexity guidelines
Environment Variables:
# Custom prompt templates
export RAFT_EMBEDDING_PROMPT_TEMPLATE="/path/to/templates/my_embedding_template.txt"
export RAFT_QA_PROMPT_TEMPLATE="/path/to/templates/my_qa_template.txt"
export RAFT_ANSWER_PROMPT_TEMPLATE="/path/to/templates/my_answer_template.txt"
# Templates directory
export RAFT_TEMPLATES="/path/to/templates/"CLI Arguments:
# Use custom templates
python raft.py --datapath docs/ --output training_data/ \
--embedding-prompt-template "/path/to/custom_embedding.txt" \
--qa-prompt-template "/path/to/custom_qa.txt" \
--answer-prompt-template "/path/to/custom_answer.txt"
# Use custom templates directory
python raft.py --datapath docs/ --output training_data/ \
--templates "/path/to/custom/templates/"Programmatic Configuration:
config = RAFTConfig(
templates="./templates",
embedding_prompt_template="templates/my_custom_embedding.txt",
qa_prompt_template="templates/gpt_qa_template.txt",
answer_prompt_template="templates/gpt_template.txt"
){content}: The document content to be embedded{document_type}: File type (pdf, txt, json, pptx, etc.){metadata}: Additional document metadata{chunk_index}: Index of the current chunk within the document{chunking_strategy}: The chunking method used
{question}: The question to be answered (for answer templates){context}: The context/chunk for question generation%s: Placeholder for number of questions to generate
Generate embeddings for medical literature that capture:
- Clinical terminology and procedures
- Drug names and dosages
- Symptoms and diagnoses
- Treatment protocols and outcomes
Content: {content}
Generate embeddings for legal documents focusing on:
- Legal terminology and concepts
- Case citations and precedents
- Statutory references
- Contractual terms and conditions
Document Type: {document_type}
Content: {content}
Generate embeddings for technical documentation emphasizing:
- API endpoints and parameters
- Code examples and syntax
- Configuration options
- Error messages and troubleshooting
Content: {content}
Metadata: {metadata}
See the templates/README.md for comprehensive template documentation and customization examples.
The RAFT Toolkit includes comprehensive rate limiting to handle the constraints imposed by cloud-based AI services. Rate limiting is disabled by default to maintain backward compatibility, but is highly recommended for production use to avoid hitting API limits and reduce costs.
Common Issues Without Rate Limiting:
- API rate limit errors (HTTP 429) causing processing failures
- Unexpected costs from burst API usage
- Inconsistent processing times due to throttling
- Failed batches requiring expensive reprocessing
Benefits of Rate Limiting:
- Predictable Costs: Control API spending with token and request limits
- Reliable Processing: Avoid rate limit errors through intelligent throttling
- Optimized Performance: Adaptive strategies adjust to service response times
- Better Monitoring: Detailed statistics on API usage and throttling
Using Preset Configurations:
# OpenAI GPT-4 with recommended limits
python raft.py --datapath docs/ --output training_data/ \
--rate-limit --rate-limit-preset openai_gpt4
# Azure OpenAI with conservative limits
python raft.py --datapath docs/ --output training_data/ \
--rate-limit --rate-limit-preset azure_openai_standard
# Anthropic Claude with aggressive processing
python raft.py --datapath docs/ --output training_data/ \
--rate-limit --rate-limit-preset anthropic_claudeCustom Rate Limiting:
# Custom limits for your specific API tier
python raft.py --datapath docs/ --output training_data/ \
--rate-limit \
--rate-limit-strategy sliding_window \
--rate-limit-requests-per-minute 100 \
--rate-limit-tokens-per-minute 5000 \
--rate-limit-max-burst 20
# Adaptive rate limiting (adjusts based on response times)
python raft.py --datapath docs/ --output training_data/ \
--rate-limit --rate-limit-strategy adaptive \
--rate-limit-requests-per-minute 200-
Sliding Window (Recommended)
- Best for: Most production use cases
- How it works: Tracks requests over a rolling time window
- Advantages: Smooth rate distribution, handles bursts well
-
Fixed Window
- Best for: Simple rate limiting scenarios
- How it works: Resets limits at fixed intervals (every minute)
- Advantages: Simple to understand, predictable behavior
-
Token Bucket
- Best for: Bursty workloads with occasional high throughput needs
- How it works: Accumulates "tokens" over time, consumes them for requests
- Advantages: Allows controlled bursts above average rate
-
Adaptive
- Best for: Unknown or variable API performance
- How it works: Automatically adjusts rate based on response times
- Advantages: Self-tuning, optimizes for service performance
| Preset | Service | Requests/min | Tokens/min | Burst | Use Case |
|---|---|---|---|---|---|
openai_gpt4 |
OpenAI GPT-4 | 500 | 10,000 | 50 | Production GPT-4 |
openai_gpt35_turbo |
OpenAI GPT-3.5 Turbo | 3,500 | 90,000 | 100 | High-throughput GPT-3.5 |
azure_openai_standard |
Azure OpenAI | 120 | 6,000 | 20 | Standard Azure tier |
anthropic_claude |
Anthropic Claude | 1,000 | 100,000 | 50 | Claude API |
conservative |
Any service | 60 | 2,000 | 10 | Safe/cautious processing |
aggressive |
Any service | 1,000 | 50,000 | 100 | Fast processing |
The RAFT Toolkit features a comprehensive logging system designed for production use, debugging, and integration with external monitoring tools.
Docker with Enhanced Logging:
# docker-compose.yml
version: '3.8'
services:
raft-toolkit:
environment:
RAFT_LOG_LEVEL: INFO
RAFT_LOG_FORMAT: json
RAFT_LOG_OUTPUT: both
RAFT_SENTRY_DSN: ${SENTRY_DSN}
volumes:
- ./logs:/app/logsKubernetes ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: raft-logging-config
data:
RAFT_LOG_LEVEL: "INFO"
RAFT_LOG_FORMAT: "json"
RAFT_LOG_OUTPUT: "both"
RAFT_LOG_STRUCTURED: "true"-
Split large JSONL files:
from raft_toolkit.core.utils.file_utils import split_jsonl_file split_jsonl_file('yourfile.jsonl', max_size=50_000_000)
-
Extract random rows:
from raft_toolkit.core.utils.file_utils import extract_random_jsonl_rows extract_random_jsonl_rows('yourfile.jsonl', 100, 'sampled_output.jsonl')
raft-toolkit/
βββ π raft_toolkit/ # Main package
β βββ π core/ # Core business logic
β β βββ clients/ # External API clients
β β βββ config.py # Configuration management
β β βββ formatters/ # Dataset format converters
β β βββ models.py # Data models and schemas
β β βββ raft_engine.py # Main orchestration engine
β β βββ security.py # Security utilities
β β βββ services/ # Business services
β β βββ dataset_service.py # Dataset operations
β β βββ document_service.py # Document processing
β β βββ llm_service.py # LLM interactions
β βββ π cli/ # Command-line interface
β β βββ main.py # CLI entry point
β βββ π web/ # Web interface
β β βββ app.py # FastAPI application
β β βββ static/ # Frontend assets
β βββ π tools/ # Standalone evaluation tools
β β βββ eval.py # Dataset evaluation
β β βββ answer.py # Answer generation
β β βββ pfeval_*.py # PromptFlow evaluations
β βββ π templates/ # Prompt templates
βββ π tests/ # Comprehensive test suite
β βββ unit/ # Unit tests
β βββ integration/ # Integration tests
β βββ api/ # API tests
β βββ cli/ # CLI tests
βββ π docs/ # Documentation
β βββ WEB_INTERFACE.md # Web UI guide
β βββ DEPLOYMENT.md # Deployment instructions
β βββ CONFIGURATION.md # Configuration reference
β βββ TEST_DIRECTORIES.md # Test configuration guide
βββ π .github/ # CI/CD workflows
β βββ workflows/
β βββ build.yml # Build & code quality
β βββ test.yml # Comprehensive testing
β βββ release.yml # Release automation
β βββ security.yml # Security scanning
βββ π³ docker-compose.yml # Multi-service orchestration
βββ π³ docker-compose.test.yml # Testing environment
βββ π³ Dockerfile # Multi-stage container builds
βββ π§ requirements*.txt # Python dependencies
βββ βοΈ .env.example # Environment template
βββ βοΈ .env.test.example # Test configuration template
βββ π§ͺ run_tests.py # Test runner with configurable directories
βββ π run_web.py # Web server launcher
βββ π raft.py # Legacy CLI entry point
βββ π README.md # This documentation
This toolkit follows 12-factor app principles with a modular architecture:
raft-toolkit/
βββ raft_toolkit/ # Main package
β βββ core/ # Shared business logic
β β βββ config.py # Configuration management
β β βββ models.py # Data models
β β βββ raft_engine.py # Main orchestration
β β βββ services/ # Business services
β βββ cli/ # Command-line interface
β βββ web/ # Web interface & API
β βββ tools/ # Evaluation tools
βββ raft.py # CLI entry point
βββ run_web.py # Web entry point
βββ docker-compose.yml # Container orchestration
Benefits:
- β Separation of Concerns: UI and business logic decoupled
- β Environment Parity: Same code for dev/prod
- β Configuration via Environment: 12-factor compliance
- β Horizontal Scaling: Stateless design
- β Container Ready: Docker & Kubernetes support
See ARCHITECTURE.md for detailed technical documentation.
The toolkit includes a comprehensive test suite covering unit tests, integration tests, API tests, and CLI tests.
# Install test dependencies
pip install -r requirements-test.txt
# Run all tests
python run_tests.py
# Run specific test categories
python run_tests.py --unit # Unit tests only
python run_tests.py --integration # Integration tests only
python run_tests.py --api # API tests only
python run_tests.py --cli # CLI tests only
# Run with coverage
python run_tests.py --coverage
# Run with verbose output
python run_tests.py --verbose- Unit Tests: Core functionality and business logic
- Integration Tests: Service interactions and data flow
- API Tests: Web interface endpoints and responses
- CLI Tests: Command-line interface validation
Configurable Test Directories:
Configure test directories via CLI arguments or environment variables:
# Custom directories via CLI
python run_tests.py --integration \
--output-dir ./ci-results \
--temp-dir /tmp/fast-ssd \
--coverage-dir ./coverage
# Via environment variables
export TEST_OUTPUT_DIR=./my-results
export TEST_TEMP_DIR=/tmp/my-temp
export TEST_COVERAGE_DIR=./coverage
python run_tests.py --coverage
# Docker testing with custom directories
export HOST_TEST_RESULTS_DIR=/shared/test-results
docker compose -f docker-compose.test.yml upSee Test Directories Configuration Guide for complete configuration guide.
If you encounter dependency conflicts during installation:
# Run dependency checker
python scripts/check_dependencies.py
# Check for conflicts
pip check
# Clean installation
pip install -r requirements.txt --force-reinstallSee Dependency Troubleshooting Guide for comprehensive troubleshooting guide.
# Run tests in Docker environment
docker compose -f docker-compose.test.yml up --abort-on-container-exit
# Specific test suites
docker compose -f docker-compose.test.yml run raft-test-unit
docker compose -f docker-compose.test.yml run raft-test-integration# Install code quality tools
pip install -r requirements-test.txt
# Run linting
flake8 .
black --check .
isort --check-only .
mypy .
# Auto-format code
black .
isort .# Install security tools
pip install bandit safety
# Run security scans
bandit -r . -f json -o security-report.json
safety scan -r requirements.txtSee TESTING.md for detailed testing documentation.
The RAFT Toolkit includes powerful command-line tools for evaluating and analyzing datasets. These tools are automatically installed as console commands when you install the package.
After installation, the following tools are available from anywhere in your terminal:
raft-eval- Dataset evaluation with parallel processingraft-answer- Answer generation for evaluation datasetsraft-pfeval-chat- PromptFlow chat format evaluationraft-pfeval-completion- PromptFlow completion evaluationraft-pfeval-local- Local evaluation without API calls
# Evaluate model performance on a dataset
raft-eval --question-file questions.jsonl --workers 8
# Generate answers using different models
raft-answer --input questions.jsonl --output answers.jsonl --model gpt-4
# Advanced PromptFlow evaluation
raft-pfeval-chat --input dataset.jsonl --output detailed_results.json# 1. Generate dataset with main RAFT toolkit
raft --datapath document.pdf --output evaluation_data
# 2. Generate answers using the tools
raft-answer --input evaluation_data/questions.jsonl --output generated_answers.jsonl --workers 8
# 3. Evaluate performance
raft-eval --question-file evaluation_data/questions.jsonl --answer-file generated_answers.jsonl
# 4. Advanced PromptFlow evaluation
raft-pfeval-chat --input generated_answers.jsonl --output detailed_evaluation.jsonπ Complete Tools Documentation: For detailed usage instructions, configuration options, and advanced workflows, see docs/TOOLS.md.
- See Deployment Guide for Azure AI Studio fine-tuning guidance
- Use generated datasets with popular fine-tuning frameworks:
- HuggingFace Transformers
- OpenAI Fine-tuning API
- Azure AI Studio
- Local training with LoRA/QLoRA
The original Python scripts are still available in the tools/ directory:
# Navigate to tools directory
cd tools/
# Basic evaluation
python eval.py --question-file YOUR_EVAL_FILE.jsonl --answer-file YOUR_ANSWER_FILE
# PromptFlow evaluations
python pfeval_chat.py --input dataset.jsonl --output results.json
python pfeval_completion.py --input dataset.jsonl --output results.json
python pfeval_local.py --input dataset.jsonl --output results.json --mode local
# Answer generation
python answer.py --input questions.jsonl --output answers.jsonl --model gpt-4Evaluation Metrics:
- Relevance: How relevant is the answer to the question?
- Groundedness: Is the answer grounded in the provided context?
- Fluency: How fluent and natural is the language?
- Coherence: How coherent and logical is the response?
- Similarity: How similar is the answer to reference answers?
π Complete Deployment Guide: For detailed deployment instructions including Docker, Kubernetes, cloud platforms, CI/CD integration, and production configurations, see docs/DEPLOYMENT.md.
Quick Deployment Options:
- π³ Docker:
docker compose up -dfor containerized deployment - βΈοΈ Kubernetes: Multi-cloud support for production scaling
- βοΈ Cloud Platforms: AWS ECS, Azure Container Apps, Google Cloud Run
- π CI/CD: GitHub Actions, GitLab CI, Jenkins integration
- π Security: Container scanning, network policies, secret management
Local Development:
# Development mode with auto-reload
python run_web.py --debug
# Production mode
python run_web.py --host 0.0.0.0 --port 8000See the Deployment Guide for comprehensive deployment instructions.