Skip to content

chindathorn/agentic_data_preprocessor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Agentic Data Preprocessor πŸ€–

A modular, class-based data preprocessing pipeline powered by local LLM agents. This project demonstrates modern Agentic AI architecture with three specialized agents working sequentially to analyze, clean, and evaluate datasets.

🌟 Key Features

  • 🌐 Interactive Web Interface: Beautiful Streamlit dashboard for easy dataset upload and visualization
  • πŸ€– Multi-Agent Architecture: Three specialized agents with clear responsibilities
  • 🏠 Local LLM: Uses Ollama (llama3.1:8b, deepseek-r1, etc.) - no API keys needed
  • πŸ“¦ Modular & Extensible: Clean class-based design for easy customization
  • πŸ“Š Comprehensive Reporting: Each agent generates detailed JSON reports
  • ⏱️ Timestamped Outputs: Organized output folders for each pipeline run
  • πŸ”„ Backward Compatible: Original script (agentic_preprocessor.py) preserved

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    PIPELINE ORCHESTRATOR                     β”‚
β”‚                        (main.py)                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β–Ό               β–Ό               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Preprocessor β”‚ β”‚   Cleaner    β”‚ β”‚  Evaluator   β”‚
β”‚    Agent     β”‚ β”‚    Agent     β”‚ β”‚    Agent     β”‚
β”‚  (LLM-based) β”‚ β”‚ (Logic-based)β”‚ β”‚  (LLM-based) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Agent Responsibilities

  1. PreprocessorAgent πŸ”

    • Analyzes dataset schema
    • Uses Ollama LLM to decide preprocessing steps
    • Outputs: schema_summary.json, decision_report.json
  2. CleanerAgent 🧹

    • Reads preprocessing decisions
    • Applies recommended cleaning steps (pure logic, no LLM)
    • Outputs: cleaned_report.json, <filename>_cleaned.csv
  3. EvaluatorAgent πŸ“Š

    • Compares before/after metrics
    • Validates applied steps
    • Uses Ollama LLM for quality assessment
    • Outputs: evaluation_report.json (final report)

πŸ“‚ Project Structure

agentic-preprocessor-demo/
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.py                       # πŸš€ Pipeline orchestrator (NEW)
β”‚   β”œβ”€β”€ config.py                     # βš™οΈ Configuration & prompts (NEW)
β”‚   β”‚
β”‚   β”œβ”€β”€ agents/                       # πŸ€– Agent modules
β”‚   β”‚   β”œβ”€β”€ preprocessor_agent.py     # Schema analysis + LLM decisions
β”‚   β”‚   β”œβ”€β”€ cleaner_agent.py          # Data cleaning operations
β”‚   β”‚   └── evaluator_agent.py        # Quality assessment
β”‚   β”‚
β”‚   β”œβ”€β”€ models/                       # πŸ”— LLM integration
β”‚   β”‚   └── ollama_client.py          # Ollama API wrapper
β”‚   β”‚
β”‚   β”œβ”€β”€ utils/                        # πŸ› οΈ Utility modules
β”‚   β”‚   β”œβ”€β”€ data_loader.py            # CSV loading
β”‚   β”‚   β”œβ”€β”€ schema_analyzer.py        # Schema extraction
β”‚   β”‚   β”œβ”€β”€ report_generator.py       # Report formatting
β”‚   β”‚   └── file_manager.py           # File operations
β”‚
β”œβ”€β”€ data/
β”‚   └── sample_data.csv
β”‚
β”œβ”€β”€ output/                           # Timestamped run folders
β”‚   └── <dataset>_<timestamp>/
β”‚       β”œβ”€β”€ 1_preprocessor/
β”‚       β”œβ”€β”€ 2_cleaner/
β”‚       └── 3_evaluator/
β”‚
β”œβ”€β”€ requirements.txt
└── README.md

πŸ“‹ Prerequisites

1. Python 3.8+

2. Ollama (running locally)

Install Ollama: https://ollama.ai

# Pull a model (choose one or both)
ollama pull llama3.1:8b
ollama pull deepseek-r1

# Verify Ollama is running
ollama list

πŸš€ Quick Start

1. Clone the Repository

git clone 
cd agentic_data_preprocessor

2. Create Virtual Environment & Install Dependencies

python -m venv venv

# Windows
venv\Scripts\activate

# Mac/Linux
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

3. (Optional) Configure Settings

Edit src/config.py to customize:

# Model selection
PREPROCESSOR_MODEL = "llama3.1:8b"  # or "deepseek-r1"
EVALUATOR_MODEL = "llama3.1:8b"

# Ollama connection (default is fine for local)
OLLAMA_BASE_URL = "http://localhost:11434"

# You can also customize system prompts here

4. Run the Pipeline

Option A: Streamlit Web Interface 🌐

cd src
streamlit run streamlit_app.py

Then upload your CSV file through the web interface and click "Run Pipeline".

Option B: Command Line

cd src
python main.py --input ../data/sample_data.csv --target label

πŸ’» Usage

Streamlit Web Interface (Recommended) 🌐

The easiest way to use the pipeline is through the interactive Streamlit web interface:

cd src
streamlit run streamlit_app.py

This will open a web browser with an interactive dashboard where you can:

  • πŸ“ Upload CSV Files: Drag and drop your dataset
  • πŸŽ›οΈ Configure LLM Settings: Choose between Ollama and OpenAI/OpenRouter providers
  • πŸ”§ Adjust Parameters: Customize preprocessing settings (IQR multiplier, imputation strategies, etc.)
  • 🎯 Select Target Column: Specify your target variable
  • ▢️ Run Pipeline: Execute the full preprocessing pipeline with one click
  • πŸ“Š View Results: Interactive visualizations and reports
    • Original vs. Cleaned data comparison
    • Missing data analysis
    • Quality metrics
    • Agent decisions and reports
  • πŸ’Ύ Download Results: Export cleaned CSV and JSON reports

Features:

  • Real-time progress updates during pipeline execution
  • Side-by-side data comparison
  • Interactive Plotly charts
  • Download cleaned datasets and reports
  • Support for multiple LLM providers (Ollama, OpenAI, OpenRouter)

Command Line Interface

# Basic usage
python main.py --input ../data/mydata.csv

# Specify target column
python main.py --input ../data/mydata.csv --target label

# Use different models for different agents
python main.py --input ../data/mydata.csv \
  --preprocessor-model llama3.1:8b \
  --evaluator-model deepseek-r1

# Get help
python main.py --help

Programmatic Usage

from main import Pipeline

# Create and execute pipeline
pipeline = Pipeline(
    input_csv="../data/sample_data.csv",
    target_col="label",
    preprocessor_model="llama3.1:8b",
    evaluator_model="llama3.1:8b"
)

results = pipeline.execute()

# Access results
print(f"Quality Score: {results['evaluator']['quality_score']}/100")
print(f"Cleaned CSV: {results['cleaner']['cleaned_csv_path']}")
print(f"Output Directory: {results['output_directory']}")

Using Individual Agents

from agents import PreprocessorAgent, CleanerAgent, EvaluatorAgent
from utils.data_loader import load_csv
from pathlib import Path

# Load data
df = load_csv("../data/mydata.csv")
output_dir = Path("../output/my_run")

# 1. Use PreprocessorAgent
preprocessor = PreprocessorAgent(output_dir, model_name="llama3.1:8b")
prep_results = preprocessor.run(df)

# 2. Use CleanerAgent
cleaner = CleanerAgent(output_dir)
clean_results = cleaner.run(
    df=df,
    decision_report_path=prep_results["output_paths"]["decision_report"],
    target_col="label"
)

# 3. Use EvaluatorAgent
evaluator = EvaluatorAgent(output_dir, model_name="llama3.1:8b")
eval_results = evaluator.run(
    original_df=df,
    cleaned_df=clean_results["cleaned_df"],
    preprocessor_dir=prep_results["output_dir"],
    cleaner_dir=clean_results["output_dir"]
)

πŸ“Š Output Structure

Each pipeline run creates a timestamped folder:

output/sample_data_2025-10-06_14-30-45/
β”‚
β”œβ”€β”€ 1_preprocessor/
β”‚   β”œβ”€β”€ schema_summary.json           # Dataset analysis
β”‚   └── decision_report.json          # LLM recommendations
β”‚
β”œβ”€β”€ 2_cleaner/
β”‚   β”œβ”€β”€ cleaned_report.json           # Cleaning operations log
β”‚   └── sample_data_cleaned.csv  # Cleaned data
β”‚
└── 3_evaluator/
    └── evaluation_report.json        # Final quality assessment

🎯 Preprocessing Steps

The pipeline can apply these steps (decided by PreprocessorAgent):

  • missing_data: Impute missing values (numeric β†’ median, categorical β†’ mode)
  • dtype_conversion: Convert data types (parse dates, categorize low-cardinality)
  • deduplicate: Remove duplicate rows
  • outliers: Cap outliers using IQR method
  • categorical_encoding: One-Hot Encoding for categorical columns

βš™οΈ Configuration

All settings are in src/config.py:

# Model Configuration
PREPROCESSOR_MODEL = "llama3.1:8b"
EVALUATOR_MODEL = "llama3.1:8b"
OLLAMA_BASE_URL = "http://localhost:11434"
TEMPERATURE = 0.2

# Preprocessing Parameters
IQR_MULTIPLIER = 1.5
CATEGORY_THRESHOLD = 50
NUMERIC_IMPUTATION_STRATEGY = "median"
CATEGORICAL_IMPUTATION_STRATEGY = "most_frequent"

# Paths
DATA_DIR = "../data"
OUTPUT_DIR = "../output"
DEFAULT_TARGET_COL = "label"

# System prompts for agents (fully customizable)
PREPROCESSOR_SYSTEM_PROMPT = """..."""
EVALUATOR_SYSTEM_PROMPT = """..."""

πŸ”§ Troubleshooting

Ollama Connection Error

Failed to connect to Ollama. Please ensure Ollama is running at http://localhost:11434

Solution: Start Ollama

  • Windows: Open Ollama app from Start menu
  • Mac: Open Ollama from Applications
  • Linux: Run ollama serve

Model Not Found

Model 'llama3.1:8b' not found. Please pull it using: ollama pull llama3.1:8b

Solution: Pull the model

ollama pull llama3.1:8b

Import Errors

Import "pandas" could not be resolved

Solution: Install dependencies

pip install -r requirements.txt

πŸ†š Original vs. New Architecture

Aspect Original (agentic_preprocessor.py) New (Modular)
Structure Single script Multi-agent classes
LLM OpenAI GPT (cloud) Ollama (local)
Modularity Monolithic Highly modular
Agents 1 (combined) 3 (specialized)
Reports 1 text file 3 JSON files per agent
Reusability Limited High (import classes)
Extensibility Difficult Easy (add new agents)

πŸ“š Class Documentation

OllamaClient

class OllamaClient:
    """Wrapper for Ollama API interactions"""
    
    def __init__(model_name, base_url, temperature)
    def generate(prompt, system_prompt, temperature) -> str
    def generate_json(prompt, system_prompt) -> dict
    def parse_json_response(response) -> dict
    def health_check() -> bool

PreprocessorAgent

class PreprocessorAgent:
    """Schema analysis and preprocessing decision agent"""
    
    def __init__(output_dir, model_name, llm_client)
    def analyze_schema(df) -> dict
    def decide_steps(schema) -> dict
    def save_reports() -> dict
    def run(df) -> dict

CleanerAgent

class CleanerAgent:
    """Data cleaning operations agent (no LLM)"""
    
    def __init__(output_dir)
    def apply_dtype_conversion(df) -> tuple
    def apply_deduplication(df) -> tuple
    def apply_missing_data_imputation(df, target) -> tuple
    def apply_outlier_capping(df, target) -> tuple
    def apply_categorical_encoding(df, target) -> tuple
    def clean(df, decision_report, target) -> tuple
    def save_outputs(cleaned_df, report, filename) -> dict
    def run(df, decision_report_path, target) -> dict

EvaluatorAgent

class EvaluatorAgent:
    """Quality evaluation and validation agent"""
    
    def __init__(output_dir, model_name, llm_client)
    def load_reports(preprocessor_dir, cleaner_dir) -> dict
    def compare_metrics(original_df, cleaned_df) -> dict
    def validate_steps(decision_report, cleaning_report) -> dict
    def generate_evaluation(comparison, validation, reports) -> dict
    def save_report(evaluation) -> Path
    def run(original_df, cleaned_df, preprocessor_dir, cleaner_dir) -> dict

🀝 Contributing

Contributions are welcome! Areas for improvement:

  • Add more preprocessing strategies
  • Support additional LLM backends (Anthropic, Groq, etc.)
  • Add visualization agent for data quality reports
  • Implement parallel processing for large datasets

πŸ“ License

MIT License - feel free to use for learning, portfolios, or production!

πŸ’‘ Use Cases

  • πŸ“ˆ Data Science Projects: Automate preprocessing for ML pipelines
  • πŸŽ“ Learning: Study modular agent architecture
  • πŸ’Ό Portfolios: Demonstrate Agentic AI expertise
  • 🏒 Production: Adapt for enterprise data quality workflows

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages