A high-performance Python tool for processing large batches of inputs through Ollama LLM with concurrent workers, real-time progress tracking, and performance metrics.
- ✅ Concurrent Processing: Process multiple requests in parallel with configurable worker count
- ✅ Real-time Progress: Live progress bar with ETA, percentage, and average response time
- ✅ Line Preservation: Output file maintains exact line correspondence with input file
- ✅ Error Handling: Automatic JSON parsing error logging to separate error file
- ✅ Performance Metrics: Detailed statistics including throughput, avg response time, and total time
- ✅ Benchmark Mode: Test different worker counts to find optimal performance
- ✅ Robust JSON Extraction: Handles LLM responses with extra text around JSON
-
Install Ollama (if not already installed):
- Visit https://ollama.ai and follow installation instructions
- Start Ollama service
-
Pull the model (example with gemma3:1b):
ollama pull gemma3:1b
-
Install Python dependencies:
pip install -r requirements.txt
Edit config.py to customize settings:
# Ollama Configuration
OLLAMA_MODEL = "gemma3:1b" # Change to your preferred model
OLLAMA_BASE_URL = "http://localhost:11434"
OLLAMA_CONTEXT = 4096
OLLAMA_KEEP_ALIVE = 30 # Minutes to keep model in memory
# File Paths
PROMPT_FILE = "prompt.txt"
INPUT_FILE = "input.txt"
OUTPUT_FILE = "output.jsonl"
ERROR_FILE = "errors.log"
# Performance Settings
PARALLEL_WORKERS = 5 # Adjust based on your system
REQUEST_TIMEOUT = 120 # Seconds per request-
Prepare your input file (
input.txt):- One input item per line
- Each line will be processed separately
-
Customize your prompt (
prompt.txt):- Use
{INPUT}as a placeholder for each line from input file - Example:
You are an expert classifier. Analyze this input: {INPUT} Return valid JSON with your analysis.
- Use
-
Run the processor:
python main.py
- output.jsonl: JSONL file with one JSON object per line (matching input line numbers)
- errors.log: JSON log entries for any failed parsing attempts
While running, you'll see:
[12/17] 70.6% | Avg: 2.34s | ETA: 0:00:12 | Elapsed: 0:00:28
[12/17]: Current item / Total items70.6%: Completion percentageAvg: 2.34s: Average response time per itemETA: 0:00:12: Estimated time to completionElapsed: 0:00:28: Total time elapsed
Use the benchmark script to test different worker counts:
python benchmark.pyThis will:
- Test with multiple worker counts (1, 3, 5, 10, 15, 20)
- Measure throughput and response times
- Recommend the optimal worker count for your system
- Save detailed results to
benchmark_results.json
-
System Resources:
- CPU cores available
- RAM (models need memory)
- Disk I/O speed
-
Model Size:
- Smaller models (1b-7b): Can handle more workers
- Larger models (13b+): Need fewer workers due to memory/CPU constraints
-
Ollama Configuration:
- Ensure Ollama has sufficient resources allocated
- Consider running multiple Ollama instances for extreme parallelism
- Start with 5 workers and adjust based on benchmark results
- Monitor system resources during processing
- Larger models may benefit from fewer workers (3-5)
- Smaller models can handle more workers (10-20+)
- SSD vs HDD: Faster storage helps with model loading
The included example classifies music file paths:
Input (input.txt):
C:\Users\Sam\Music\10cc 20th Anniversary\CD14\12 24 Hours (Edit).opus
Prompt (prompt.txt):
You are an expert music classifier.
Extract metadata from this file path: {INPUT}
Return JSON: {"artist": "", "album": "", "year": "", "track_number": "", "track_name": ""}
Output (output.jsonl):
{
"artist": "10cc",
"album": "20th Anniversary - CD14",
"year": "",
"track_number": "12",
"track_name": "24 Hours (Edit)"
}Errors are logged to errors.log in JSON format:
{
"timestamp": "2025-11-13 14:23:45",
"line_number": 5,
"input": "problematic input line",
"error": "JSON parse error: Expecting value: line 1 column 1"
}Failed items leave an empty line in output.jsonl to maintain line correspondence.
- Ensure Ollama is running:
ollama serve - Check Ollama URL in config matches your setup
- Verify model is pulled:
ollama list
- Run
benchmark.pyto find optimal worker count - Reduce
PARALLEL_WORKERSif system is overloaded - Increase
REQUEST_TIMEOUTfor slow responses - Check system resources (CPU, RAM, disk)
- Check
errors.logfor specific failures - Improve prompt to ensure LLM returns valid JSON
- The system automatically extracts JSON from surrounding text
The system automatically finds JSON in LLM responses:
- Searches for first
{and last} - Extracts and parses the JSON portion
- Handles extra text before/after JSON
The system guarantees:
- Output line N corresponds to input line N
- Failed items produce empty lines (errors logged separately)
- JSONL format for easy line-by-line processing
With the example configuration (gemma3:1b, 5 workers):
- Throughput: 2-5 items/second (depending on system)
- Response time: 0.5-2 seconds per item
- Scaling: Near-linear up to CPU core count
This project is provided as-is for batch processing tasks with Ollama.
Feel free to submit issues or pull requests for improvements!