A modular, class-based data preprocessing pipeline powered by local LLM agents. This project demonstrates modern Agentic AI architecture with three specialized agents working sequentially to analyze, clean, and evaluate datasets.
- π Interactive Web Interface: Beautiful Streamlit dashboard for easy dataset upload and visualization
- π€ Multi-Agent Architecture: Three specialized agents with clear responsibilities
- π Local LLM: Uses Ollama (llama3.1:8b, deepseek-r1, etc.) - no API keys needed
- π¦ Modular & Extensible: Clean class-based design for easy customization
- π Comprehensive Reporting: Each agent generates detailed JSON reports
- β±οΈ Timestamped Outputs: Organized output folders for each pipeline run
- π Backward Compatible: Original script (
agentic_preprocessor.py) preserved
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PIPELINE ORCHESTRATOR β
β (main.py) β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββΌββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Preprocessor β β Cleaner β β Evaluator β
β Agent β β Agent β β Agent β
β (LLM-based) β β (Logic-based)β β (LLM-based) β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
-
PreprocessorAgent π
- Analyzes dataset schema
- Uses Ollama LLM to decide preprocessing steps
- Outputs:
schema_summary.json,decision_report.json
-
CleanerAgent π§Ή
- Reads preprocessing decisions
- Applies recommended cleaning steps (pure logic, no LLM)
- Outputs:
cleaned_report.json,<filename>_cleaned.csv
-
EvaluatorAgent π
- Compares before/after metrics
- Validates applied steps
- Uses Ollama LLM for quality assessment
- Outputs:
evaluation_report.json(final report)
agentic-preprocessor-demo/
β
βββ src/
β βββ main.py # π Pipeline orchestrator (NEW)
β βββ config.py # βοΈ Configuration & prompts (NEW)
β β
β βββ agents/ # π€ Agent modules
β β βββ preprocessor_agent.py # Schema analysis + LLM decisions
β β βββ cleaner_agent.py # Data cleaning operations
β β βββ evaluator_agent.py # Quality assessment
β β
β βββ models/ # π LLM integration
β β βββ ollama_client.py # Ollama API wrapper
β β
β βββ utils/ # π οΈ Utility modules
β β βββ data_loader.py # CSV loading
β β βββ schema_analyzer.py # Schema extraction
β β βββ report_generator.py # Report formatting
β β βββ file_manager.py # File operations
β
βββ data/
β βββ sample_data.csv
β
βββ output/ # Timestamped run folders
β βββ <dataset>_<timestamp>/
β βββ 1_preprocessor/
β βββ 2_cleaner/
β βββ 3_evaluator/
β
βββ requirements.txt
βββ README.md
Install Ollama: https://ollama.ai
# Pull a model (choose one or both)
ollama pull llama3.1:8b
ollama pull deepseek-r1
# Verify Ollama is running
ollama listgit clone
cd agentic_data_preprocessorpython -m venv venv
# Windows
venv\Scripts\activate
# Mac/Linux
source venv/bin/activate
# Install dependencies
pip install -r requirements.txtEdit src/config.py to customize:
# Model selection
PREPROCESSOR_MODEL = "llama3.1:8b" # or "deepseek-r1"
EVALUATOR_MODEL = "llama3.1:8b"
# Ollama connection (default is fine for local)
OLLAMA_BASE_URL = "http://localhost:11434"
# You can also customize system prompts herecd src
streamlit run streamlit_app.pyThen upload your CSV file through the web interface and click "Run Pipeline".
cd src
python main.py --input ../data/sample_data.csv --target labelThe easiest way to use the pipeline is through the interactive Streamlit web interface:
cd src
streamlit run streamlit_app.pyThis will open a web browser with an interactive dashboard where you can:
- π Upload CSV Files: Drag and drop your dataset
- ποΈ Configure LLM Settings: Choose between Ollama and OpenAI/OpenRouter providers
- π§ Adjust Parameters: Customize preprocessing settings (IQR multiplier, imputation strategies, etc.)
- π― Select Target Column: Specify your target variable
βΆοΈ Run Pipeline: Execute the full preprocessing pipeline with one click- π View Results: Interactive visualizations and reports
- Original vs. Cleaned data comparison
- Missing data analysis
- Quality metrics
- Agent decisions and reports
- πΎ Download Results: Export cleaned CSV and JSON reports
Features:
- Real-time progress updates during pipeline execution
- Side-by-side data comparison
- Interactive Plotly charts
- Download cleaned datasets and reports
- Support for multiple LLM providers (Ollama, OpenAI, OpenRouter)
# Basic usage
python main.py --input ../data/mydata.csv
# Specify target column
python main.py --input ../data/mydata.csv --target label
# Use different models for different agents
python main.py --input ../data/mydata.csv \
--preprocessor-model llama3.1:8b \
--evaluator-model deepseek-r1
# Get help
python main.py --helpfrom main import Pipeline
# Create and execute pipeline
pipeline = Pipeline(
input_csv="../data/sample_data.csv",
target_col="label",
preprocessor_model="llama3.1:8b",
evaluator_model="llama3.1:8b"
)
results = pipeline.execute()
# Access results
print(f"Quality Score: {results['evaluator']['quality_score']}/100")
print(f"Cleaned CSV: {results['cleaner']['cleaned_csv_path']}")
print(f"Output Directory: {results['output_directory']}")from agents import PreprocessorAgent, CleanerAgent, EvaluatorAgent
from utils.data_loader import load_csv
from pathlib import Path
# Load data
df = load_csv("../data/mydata.csv")
output_dir = Path("../output/my_run")
# 1. Use PreprocessorAgent
preprocessor = PreprocessorAgent(output_dir, model_name="llama3.1:8b")
prep_results = preprocessor.run(df)
# 2. Use CleanerAgent
cleaner = CleanerAgent(output_dir)
clean_results = cleaner.run(
df=df,
decision_report_path=prep_results["output_paths"]["decision_report"],
target_col="label"
)
# 3. Use EvaluatorAgent
evaluator = EvaluatorAgent(output_dir, model_name="llama3.1:8b")
eval_results = evaluator.run(
original_df=df,
cleaned_df=clean_results["cleaned_df"],
preprocessor_dir=prep_results["output_dir"],
cleaner_dir=clean_results["output_dir"]
)Each pipeline run creates a timestamped folder:
output/sample_data_2025-10-06_14-30-45/
β
βββ 1_preprocessor/
β βββ schema_summary.json # Dataset analysis
β βββ decision_report.json # LLM recommendations
β
βββ 2_cleaner/
β βββ cleaned_report.json # Cleaning operations log
β βββ sample_data_cleaned.csv # Cleaned data
β
βββ 3_evaluator/
βββ evaluation_report.json # Final quality assessment
The pipeline can apply these steps (decided by PreprocessorAgent):
missing_data: Impute missing values (numeric β median, categorical β mode)dtype_conversion: Convert data types (parse dates, categorize low-cardinality)deduplicate: Remove duplicate rowsoutliers: Cap outliers using IQR methodcategorical_encoding: One-Hot Encoding for categorical columns
All settings are in src/config.py:
# Model Configuration
PREPROCESSOR_MODEL = "llama3.1:8b"
EVALUATOR_MODEL = "llama3.1:8b"
OLLAMA_BASE_URL = "http://localhost:11434"
TEMPERATURE = 0.2
# Preprocessing Parameters
IQR_MULTIPLIER = 1.5
CATEGORY_THRESHOLD = 50
NUMERIC_IMPUTATION_STRATEGY = "median"
CATEGORICAL_IMPUTATION_STRATEGY = "most_frequent"
# Paths
DATA_DIR = "../data"
OUTPUT_DIR = "../output"
DEFAULT_TARGET_COL = "label"
# System prompts for agents (fully customizable)
PREPROCESSOR_SYSTEM_PROMPT = """..."""
EVALUATOR_SYSTEM_PROMPT = """..."""Failed to connect to Ollama. Please ensure Ollama is running at http://localhost:11434
Solution: Start Ollama
- Windows: Open Ollama app from Start menu
- Mac: Open Ollama from Applications
- Linux: Run
ollama serve
Model 'llama3.1:8b' not found. Please pull it using: ollama pull llama3.1:8b
Solution: Pull the model
ollama pull llama3.1:8bImport "pandas" could not be resolved
Solution: Install dependencies
pip install -r requirements.txt| Aspect | Original (agentic_preprocessor.py) |
New (Modular) |
|---|---|---|
| Structure | Single script | Multi-agent classes |
| LLM | OpenAI GPT (cloud) | Ollama (local) |
| Modularity | Monolithic | Highly modular |
| Agents | 1 (combined) | 3 (specialized) |
| Reports | 1 text file | 3 JSON files per agent |
| Reusability | Limited | High (import classes) |
| Extensibility | Difficult | Easy (add new agents) |
class OllamaClient:
"""Wrapper for Ollama API interactions"""
def __init__(model_name, base_url, temperature)
def generate(prompt, system_prompt, temperature) -> str
def generate_json(prompt, system_prompt) -> dict
def parse_json_response(response) -> dict
def health_check() -> boolclass PreprocessorAgent:
"""Schema analysis and preprocessing decision agent"""
def __init__(output_dir, model_name, llm_client)
def analyze_schema(df) -> dict
def decide_steps(schema) -> dict
def save_reports() -> dict
def run(df) -> dictclass CleanerAgent:
"""Data cleaning operations agent (no LLM)"""
def __init__(output_dir)
def apply_dtype_conversion(df) -> tuple
def apply_deduplication(df) -> tuple
def apply_missing_data_imputation(df, target) -> tuple
def apply_outlier_capping(df, target) -> tuple
def apply_categorical_encoding(df, target) -> tuple
def clean(df, decision_report, target) -> tuple
def save_outputs(cleaned_df, report, filename) -> dict
def run(df, decision_report_path, target) -> dictclass EvaluatorAgent:
"""Quality evaluation and validation agent"""
def __init__(output_dir, model_name, llm_client)
def load_reports(preprocessor_dir, cleaner_dir) -> dict
def compare_metrics(original_df, cleaned_df) -> dict
def validate_steps(decision_report, cleaning_report) -> dict
def generate_evaluation(comparison, validation, reports) -> dict
def save_report(evaluation) -> Path
def run(original_df, cleaned_df, preprocessor_dir, cleaner_dir) -> dictContributions are welcome! Areas for improvement:
- Add more preprocessing strategies
- Support additional LLM backends (Anthropic, Groq, etc.)
- Add visualization agent for data quality reports
- Implement parallel processing for large datasets
MIT License - feel free to use for learning, portfolios, or production!
- π Data Science Projects: Automate preprocessing for ML pipelines
- π Learning: Study modular agent architecture
- πΌ Portfolios: Demonstrate Agentic AI expertise
- π’ Production: Adapt for enterprise data quality workflows