Skip to content

mpraes/data_scout

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataScout

A "one-click" data diagnostic tool that ingests raw datasets, runs automated cleaning, computes KPIs, and uses an LLM to turn cold metrics into a concise business narrative.

Architecture

data_scout/
├── backend/          # FastAPI application
│   ├── main.py       # API routes + static file serving
│   └── services/
│       ├── loader.py       # CSV / XLSX / Parquet ingestion
│       ├── cleaner.py      # Deduplication, imputation, outlier detection
│       ├── profiler.py     # Metadata extraction & column selection
│       ├── stats.py        # Deep statistical analysis
│       ├── kpis.py         # Volume, efficiency, trend & extra KPIs
│       ├── charts.py       # Chart payloads (time series, Pareto, distribution, …)
│       ├── analyzer.py     # Orchestrates the full analysis pipeline
│       └── storytelling.py # LLM-powered narrative generation (Groq)
└── frontend/         # Static HTML/CSS/JS client
    └── index.html

API

Method Endpoint Description
GET /api/health Health check
POST /api/analyze Upload a file and get full analysis
GET /app Serves the frontend

POST /api/analyze accepts multipart/form-data:

  • file — dataset file (.csv, .xlsx, .parquet)
  • use_llm — boolean, whether to call the LLM for storytelling (default: true)

Response includes: cleaning summary, cleaning log, metadata, KPIs, chart payloads, and a story block.

Setup

Requirements: Python ≥ 3.12, uv

uv sync

Set environment variables (optional — only needed for LLM storytelling):

export GROQ_API_KEY=your_key_here
export GROQ_MODEL=llama-3.3-70b-versatile  # optional override

Running

uv run uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000
  • Frontend: http://localhost:8000/app
  • API docs: http://localhost:8000/docs

Features

  • Multi-format ingestion — CSV, Excel (.xlsx), and Parquet
  • Automated cleaning — duplicate removal, missing value imputation, outlier detection
  • KPI triangulation — Volume, Efficiency, Trend + extra context KPIs
  • Six chart types — time series, Pareto, distribution, missing values heatmap, boxplot, correlation heatmap
  • LLM storytelling — headline, observations, and recommendations via Groq (falls back to a template when no API key is set)

About

A "one-click" data diagnostic tool that ingests raw datasets, runs automated cleaning, computes KPIs, and uses an LLM to turn cold metrics into a concise business narrative.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors