Agentic Data Preprocessor 🤖

A modular, class-based data preprocessing pipeline powered by local LLM agents. This project demonstrates modern Agentic AI architecture with three specialized agents working sequentially to analyze, clean, and evaluate datasets.

🌟 Key Features

🌐 Interactive Web Interface: Beautiful Streamlit dashboard for easy dataset upload and visualization
🤖 Multi-Agent Architecture: Three specialized agents with clear responsibilities
🏠 Local LLM: Uses Ollama (llama3.1:8b, deepseek-r1, etc.) - no API keys needed
📦 Modular & Extensible: Clean class-based design for easy customization
📊 Comprehensive Reporting: Each agent generates detailed JSON reports
⏱️ Timestamped Outputs: Organized output folders for each pipeline run
🔄 Backward Compatible: Original script (agentic_preprocessor.py) preserved

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                    PIPELINE ORCHESTRATOR                     │
│                        (main.py)                             │
└───────────────────────┬─────────────────────────────────────┘
                        │
        ┌───────────────┼───────────────┐
        ▼               ▼               ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Preprocessor │ │   Cleaner    │ │  Evaluator   │
│    Agent     │ │    Agent     │ │    Agent     │
│  (LLM-based) │ │ (Logic-based)│ │  (LLM-based) │
└──────────────┘ └──────────────┘ └──────────────┘

Agent Responsibilities

PreprocessorAgent 🔍
- Analyzes dataset schema
- Uses Ollama LLM to decide preprocessing steps
- Outputs: schema_summary.json, decision_report.json
CleanerAgent 🧹
- Reads preprocessing decisions
- Applies recommended cleaning steps (pure logic, no LLM)
- Outputs: cleaned_report.json, <filename>_cleaned.csv
EvaluatorAgent 📊
- Compares before/after metrics
- Validates applied steps
- Uses Ollama LLM for quality assessment
- Outputs: evaluation_report.json (final report)

📂 Project Structure

agentic-preprocessor-demo/
│
├── src/
│   ├── main.py                       # 🚀 Pipeline orchestrator (NEW)
│   ├── config.py                     # ⚙️ Configuration & prompts (NEW)
│   │
│   ├── agents/                       # 🤖 Agent modules
│   │   ├── preprocessor_agent.py     # Schema analysis + LLM decisions
│   │   ├── cleaner_agent.py          # Data cleaning operations
│   │   └── evaluator_agent.py        # Quality assessment
│   │
│   ├── models/                       # 🔗 LLM integration
│   │   └── ollama_client.py          # Ollama API wrapper
│   │
│   ├── utils/                        # 🛠️ Utility modules
│   │   ├── data_loader.py            # CSV loading
│   │   ├── schema_analyzer.py        # Schema extraction
│   │   ├── report_generator.py       # Report formatting
│   │   └── file_manager.py           # File operations
│
├── data/
│   └── sample_data.csv
│
├── output/                           # Timestamped run folders
│   └── <dataset>_<timestamp>/
│       ├── 1_preprocessor/
│       ├── 2_cleaner/
│       └── 3_evaluator/
│
├── requirements.txt
└── README.md

📋 Prerequisites

1. Python 3.8+

2. Ollama (running locally)

Install Ollama: https://ollama.ai

# Pull a model (choose one or both)
ollama pull llama3.1:8b
ollama pull deepseek-r1

# Verify Ollama is running
ollama list

🚀 Quick Start

1. Clone the Repository

git clone 
cd agentic_data_preprocessor

2. Create Virtual Environment & Install Dependencies

python -m venv venv

# Windows
venv\Scripts\activate

# Mac/Linux
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

3. (Optional) Configure Settings

Edit src/config.py to customize:

# Model selection
PREPROCESSOR_MODEL = "llama3.1:8b"  # or "deepseek-r1"
EVALUATOR_MODEL = "llama3.1:8b"

# Ollama connection (default is fine for local)
OLLAMA_BASE_URL = "http://localhost:11434"

# You can also customize system prompts here

4. Run the Pipeline

Option A: Streamlit Web Interface 🌐

cd src
streamlit run streamlit_app.py

Then upload your CSV file through the web interface and click "Run Pipeline".

Option B: Command Line

cd src
python main.py --input ../data/sample_data.csv --target label

💻 Usage

Streamlit Web Interface (Recommended) 🌐

The easiest way to use the pipeline is through the interactive Streamlit web interface:

cd src
streamlit run streamlit_app.py

This will open a web browser with an interactive dashboard where you can:

📁 Upload CSV Files: Drag and drop your dataset
🎛️ Configure LLM Settings: Choose between Ollama and OpenAI/OpenRouter providers
🔧 Adjust Parameters: Customize preprocessing settings (IQR multiplier, imputation strategies, etc.)
🎯 Select Target Column: Specify your target variable
▶️ Run Pipeline: Execute the full preprocessing pipeline with one click
📊 View Results: Interactive visualizations and reports
- Original vs. Cleaned data comparison
- Missing data analysis
- Quality metrics
- Agent decisions and reports
💾 Download Results: Export cleaned CSV and JSON reports

Features:

Real-time progress updates during pipeline execution
Side-by-side data comparison
Interactive Plotly charts
Download cleaned datasets and reports
Support for multiple LLM providers (Ollama, OpenAI, OpenRouter)

Command Line Interface

# Basic usage
python main.py --input ../data/mydata.csv

# Specify target column
python main.py --input ../data/mydata.csv --target label

# Use different models for different agents
python main.py --input ../data/mydata.csv \
  --preprocessor-model llama3.1:8b \
  --evaluator-model deepseek-r1

# Get help
python main.py --help

Programmatic Usage

from main import Pipeline

# Create and execute pipeline
pipeline = Pipeline(
    input_csv="../data/sample_data.csv",
    target_col="label",
    preprocessor_model="llama3.1:8b",
    evaluator_model="llama3.1:8b"
)

results = pipeline.execute()

# Access results
print(f"Quality Score: {results['evaluator']['quality_score']}/100")
print(f"Cleaned CSV: {results['cleaner']['cleaned_csv_path']}")
print(f"Output Directory: {results['output_directory']}")

Using Individual Agents

from agents import PreprocessorAgent, CleanerAgent, EvaluatorAgent
from utils.data_loader import load_csv
from pathlib import Path

# Load data
df = load_csv("../data/mydata.csv")
output_dir = Path("../output/my_run")

# 1. Use PreprocessorAgent
preprocessor = PreprocessorAgent(output_dir, model_name="llama3.1:8b")
prep_results = preprocessor.run(df)

# 2. Use CleanerAgent
cleaner = CleanerAgent(output_dir)
clean_results = cleaner.run(
    df=df,
    decision_report_path=prep_results["output_paths"]["decision_report"],
    target_col="label"
)

# 3. Use EvaluatorAgent
evaluator = EvaluatorAgent(output_dir, model_name="llama3.1:8b")
eval_results = evaluator.run(
    original_df=df,
    cleaned_df=clean_results["cleaned_df"],
    preprocessor_dir=prep_results["output_dir"],
    cleaner_dir=clean_results["output_dir"]
)

📊 Output Structure

Each pipeline run creates a timestamped folder:

output/sample_data_2025-10-06_14-30-45/
│
├── 1_preprocessor/
│   ├── schema_summary.json           # Dataset analysis
│   └── decision_report.json          # LLM recommendations
│
├── 2_cleaner/
│   ├── cleaned_report.json           # Cleaning operations log
│   └── sample_data_cleaned.csv  # Cleaned data
│
└── 3_evaluator/
    └── evaluation_report.json        # Final quality assessment

🎯 Preprocessing Steps

The pipeline can apply these steps (decided by PreprocessorAgent):

missing_data: Impute missing values (numeric → median, categorical → mode)
dtype_conversion: Convert data types (parse dates, categorize low-cardinality)
deduplicate: Remove duplicate rows
outliers: Cap outliers using IQR method
categorical_encoding: One-Hot Encoding for categorical columns

⚙️ Configuration

All settings are in src/config.py:

# Model Configuration
PREPROCESSOR_MODEL = "llama3.1:8b"
EVALUATOR_MODEL = "llama3.1:8b"
OLLAMA_BASE_URL = "http://localhost:11434"
TEMPERATURE = 0.2

# Preprocessing Parameters
IQR_MULTIPLIER = 1.5
CATEGORY_THRESHOLD = 50
NUMERIC_IMPUTATION_STRATEGY = "median"
CATEGORICAL_IMPUTATION_STRATEGY = "most_frequent"

# Paths
DATA_DIR = "../data"
OUTPUT_DIR = "../output"
DEFAULT_TARGET_COL = "label"

# System prompts for agents (fully customizable)
PREPROCESSOR_SYSTEM_PROMPT = """..."""
EVALUATOR_SYSTEM_PROMPT = """..."""

🔧 Troubleshooting

Ollama Connection Error

Failed to connect to Ollama. Please ensure Ollama is running at http://localhost:11434

Solution: Start Ollama

Windows: Open Ollama app from Start menu
Mac: Open Ollama from Applications
Linux: Run ollama serve

Model Not Found

Model 'llama3.1:8b' not found. Please pull it using: ollama pull llama3.1:8b

Solution: Pull the model

ollama pull llama3.1:8b

Import Errors

Import "pandas" could not be resolved

Solution: Install dependencies

pip install -r requirements.txt

🆚 Original vs. New Architecture

Aspect	Original (`agentic_preprocessor.py`)	New (Modular)
Structure	Single script	Multi-agent classes
LLM	OpenAI GPT (cloud)	Ollama (local)
Modularity	Monolithic	Highly modular
Agents	1 (combined)	3 (specialized)
Reports	1 text file	3 JSON files per agent
Reusability	Limited	High (import classes)
Extensibility	Difficult	Easy (add new agents)

📚 Class Documentation

OllamaClient

class OllamaClient:
    """Wrapper for Ollama API interactions"""
    
    def __init__(model_name, base_url, temperature)
    def generate(prompt, system_prompt, temperature) -> str
    def generate_json(prompt, system_prompt) -> dict
    def parse_json_response(response) -> dict
    def health_check() -> bool

PreprocessorAgent

class PreprocessorAgent:
    """Schema analysis and preprocessing decision agent"""
    
    def __init__(output_dir, model_name, llm_client)
    def analyze_schema(df) -> dict
    def decide_steps(schema) -> dict
    def save_reports() -> dict
    def run(df) -> dict

CleanerAgent

class CleanerAgent:
    """Data cleaning operations agent (no LLM)"""
    
    def __init__(output_dir)
    def apply_dtype_conversion(df) -> tuple
    def apply_deduplication(df) -> tuple
    def apply_missing_data_imputation(df, target) -> tuple
    def apply_outlier_capping(df, target) -> tuple
    def apply_categorical_encoding(df, target) -> tuple
    def clean(df, decision_report, target) -> tuple
    def save_outputs(cleaned_df, report, filename) -> dict
    def run(df, decision_report_path, target) -> dict

EvaluatorAgent

class EvaluatorAgent:
    """Quality evaluation and validation agent"""
    
    def __init__(output_dir, model_name, llm_client)
    def load_reports(preprocessor_dir, cleaner_dir) -> dict
    def compare_metrics(original_df, cleaned_df) -> dict
    def validate_steps(decision_report, cleaning_report) -> dict
    def generate_evaluation(comparison, validation, reports) -> dict
    def save_report(evaluation) -> Path
    def run(original_df, cleaned_df, preprocessor_dir, cleaner_dir) -> dict

🤝 Contributing

Contributions are welcome! Areas for improvement:

Add more preprocessing strategies
Support additional LLM backends (Anthropic, Groq, etc.)
Add visualization agent for data quality reports
Implement parallel processing for large datasets

📝 License

MIT License - feel free to use for learning, portfolios, or production!

💡 Use Cases

📈 Data Science Projects: Automate preprocessing for ML pipelines
🎓 Learning: Study modular agent architecture
💼 Portfolios: Demonstrate Agentic AI expertise
🏢 Production: Adapt for enterprise data quality workflows

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
src		src
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
README.md		README.md
requirements.txt		requirements.txt

chindathorn/agentic_data_preprocessor

Folders and files

Latest commit

History

Repository files navigation