Food Establishments Inspection Analysis
Analyzed 1.9M+ food safety inspection records across Chicago and Dallas to identify violation patterns, risk factors, and operational insights for public health departments.
Problem Statement: Public health departments struggle to prioritize inspections and allocate resources effectively due to:
- High volume of establishments requiring routine inspections
- Inconsistent violation tracking across jurisdictions
- Lack of visibility into recurring violation patterns
- Manual processes for risk assessment and resource allocation
Business Impact: Inefficient inspection scheduling leads to delayed identification of high-risk establishments, putting public health at risk and wasting limited health department resources.
Proposed Solution: A dimensional data warehouse integrating food inspection data from multiple cities, enabling:
- Standardized violation tracking and analysis
- Risk-based inspection prioritization
- Comparative analysis across jurisdictions
- Automated reporting for health department leadership
Key Findings:
Chicago (267,866 inspections analyzed):
- 27.4% of records missing violation data - data quality issue requiring investigation
- 196,048 high-risk establishments (Risk Category 1 & 2) requiring priority inspections
- 73,445 violations documented across 166K+ unique violation types
- Top violation categories: Food temperature control, sanitation, pest management
Dallas (78,400 inspections analyzed):
- 72.2% missing data across violation fields - systematic data collection issue
- 42 duplicate inspection records identified and removed
- 25 violation categories consolidated into standardized format
- Average inspection score: 85/100 (passing threshold: 70)
Cross-City Insights:
- Violation patterns differ significantly between cities (coding standards inconsistent)
- Chicago has more granular violation tracking (better data quality)
- Dallas uses numeric scoring system (more quantitative assessment)
- Need for standardized national food safety violation taxonomy
Dataset Overview:
Chicago Food Inspections:
- Source: City of Chicago Open Data Portal
- Records: 267,866 inspections (2010-2024)
- Establishments: 32,165 unique food businesses
- Variables: 17 columns (Inspection ID, License, DBA Name, Address, Risk, Results, Violations, etc.)
Dallas Restaurant Inspections:
- Source: Dallas Open Data Portal
- Records: 78,400 inspections (2019-2024)
- Establishments: 9,101 unique restaurants
- Variables: 114 columns (25 violation slots, scores, addresses, etc.)
Data Quality Challenges:
- Missing values (1.8% Chicago, 72.2% Dallas)
- Inconsistent violation formats (free text vs structured codes)
- Duplicate records (Dallas: 42 duplicates)
- Non-standardized risk classifications
Tech Stack:
- ETL: Talend Open Studio (integration), Python (pandas)
- Database: Microsoft SQL Server
- Data Modeling: ER/Studio (dimensional design)
- Visualization: Tableau
- Data Profiling: ydata-profiling, Alteryx
Food-Establishments-Inspection/
├── README.md
├── Datasets/
│ ├── Raw/
│ │ ├── Chicago.tsv # Chicago food inspections (268K records)
│ │ └── Dallas.tsv # Dallas restaurant inspections (78K records)
│ └── Processed/
│ ├── Chicago - Cleaned.csv # Normalized violation data
│ └── Dallas - Cleaned.csv # Standardized format
├── Python/
│ └── Staged.py # Python cleaning script
├── ETL/
│ ├── Chicago/ # Chicago ETL workflows
│ ├── Dallas/ # Dallas ETL workflows
│ ├── Dimension/ # Dimension table workflows
│ ├── Facts/ # Fact table workflows
│ └── ETL_Workflow.docx # Talend workflow documentation
├── Profiling/
│ ├── Chicago_Profiling.yxmd # Alteryx data profiling workflow
│ ├── Dallas_Profiling.yxmd # Alteryx data profiling workflow
│ └── Profiling_Workflows.docx # Profiling process documentation
├── Model/
│ ├── Dimensional_Model.png # Star schema diagram
│ └── Model_Script.docx # DDL documentation
├── Viz/
│ ├── Food Establishments Inspections # Tableau Document
| ├── Master Script.py # Python script for Combined dataset
│ └── Master Load .csv # Dataset(Combined)
└──
Database Schema: Star Schema Design
Fact Tables (2):
Fact_Inspection- Core inspection events with scores and resultsFact_Violation- Individual violations linked to inspections
Dimension Tables (7):
Dim_Business_License- Establishment details (name, address, license)Dim_Date- Date hierarchy (year, quarter, month, week)Dim_Facility_Type- Restaurant, grocery, school cafeteria, etc.Dim_Inspection_Type- Routine, complaint-driven, license, reinspectionDim_Risk_Category- Risk levels (High, Medium, Low)Dim_Food_Inspections_Results- Pass, Fail, Pass with ConditionsDim_Violation- Violation codes, descriptions, comments
Key Relationships:
Dim_Business_License ──┐
Dim_Date ──────────────┤
Dim_Facility_Type ─────┤
Dim_Inspection_Type ───┼──> Fact_Inspection ──> Fact_Violation
Dim_Risk_Category ─────┤ │
Dim_Results ───────────┘ │
│
Dim_Violation ────────────────────────────────────────┘
Total Records in Warehouse:
- 346,266 inspections
- 1.2M+ violation records (after normalization)
- 40K+ unique establishments
Technical Implementation:
ETL Pipeline:
Phase 1: Data Extraction:
- Downloaded TSV files from city open data portals
- Initial profiling with ydata-profiling and Alteryx
Phase 2: Data Quality Assessment: Identified issues:
- Missing violation data (27% Chicago, 72% Dallas)
- Inconsistent violation formats (text blocks vs structured fields)
- Duplicate records
- Invalid date formats
- Truncated text fields
Phase 3: Data Transformation:
Chicago Processing:
- Normalized violations column (split pipe-delimited text into separate rows)
- Extracted violation code, description, comment using regex
- Added DI metadata (Process ID, Current Date, Workflow name)
- Calculated risk scores and violation points
Dallas Processing:
- Consolidated 25 violation columns into single normalized structure
- Mapped violation points to standardized scale
- Standardized facility types
- Created default risk categories
Phase 4: Data Loading:
- Loaded staged data into SQL Server
- Populated dimensional model using Talend
- Applied surrogate keys and SCD Type 1 logic
- Validated referential integrity
Data Profiling Results:
Chicago Data Quality:
- 17 variables, 267,866 observations
- 84,184 missing cells (1.8%)
- 0 duplicates
- Key issues: Violations column missing in 27% of records
Dallas Data Quality:
- 114 variables, 78,400 observations
- 6.4M missing cells (72.2% - sparse violation matrix)
- 42 duplicate records removed
- Key issues: 99%+ missing in violation detail fields 11-25
SQL Capabilities Demonstrated:
- Dimensional Modeling: Star schema with 7 dimensions, 2 facts
- DDL Design: Foreign keys, constraints, indexes
- Data Validation: Integrity checks, duplicate detection
- ETL Logic: Staging → integration → dimensional model
- Data Quality: Profiling, missing value handling, normalization
Visualizations:
Tableau Dashboards:
- Executive Overview - Total inspections, pass/fail rates, trend analysis
- Violation Analysis - Top violations by frequency and severity
- Geographic Distribution - Heat map of high-risk establishments by neighborhood
- Comparative Analysis - Chicago vs Dallas metrics side-by-side
For dashboard previews, see Tableau folder
Installation & Reproducibility:
Prerequisites:
- Python 3.8+ (pandas, re, datetime)
- Microsoft SQL Server
- Tableau Desktop (for .twb file) or Tableau Public
Setup Instructions:
1. Clone repository:
git clone https://github.com/lakshmi14k/Food-Establishments-Inspection.git2. Run Python cleaning script:
cd data/raw
python ../../python/clean_data.pyOutput: Chicago - Cleaned.csv, Dallas - Cleaned.csv in data/processed/
3. Create database and load schema:
-- Run in SQL Server Management Studio
CREATE DATABASE FoodInspections;
GO
-- Execute DDL script
USE FoodInspections;
GO
-- Run sql/DDL_Script.sql4. Load data using ETL tool (Talend):
- Original implementation: Talend Open Studio (workflows in
etl/folder) - Alternative: Use Python pandas to load CSVs directly to SQL Server
5. Run validation checks:
-- Execute sql/Validation_Script.sql6. Open Tableau dashboards:
- Connect to SQL Server database OR
- Use processed CSV files from
data/processed/
Results:
Data Pipeline:
- 346K+ inspection records integrated
- 1.2M+ violation records normalized
- 40K+ unique establishments profiled
- 9 dimensional/fact tables populated
Insights Delivered:
- Identified systematic data quality issues (72% missing Dallas data)
- Documented top 10 violation categories per city
- Mapped high-risk establishment concentrations
- Enabled trend analysis (inspection frequency, pass rates over time)
Business Value:
- Prioritized inspection scheduling based on risk scores
- Identified establishments requiring follow-up (failed inspections)
- Enabled resource allocation optimization
- Provided comparative benchmarking across cities
Key Technical Achievements
- ETL Pipeline: Multi-source integration (2 cities, different schemas)
- Data Quality: Handled 72% missing data without losing analytical value
- Normalization: Transformed wide format (114 columns) to normalized structure
- Dimensional Modeling: Star schema with proper grain and relationships
- Data Profiling: Comprehensive quality assessment before transformation
- Regex Parsing: Extracted structured data from unstructured text fields
Built with SQL Server, Python, Alteryx, Talend, and Tableau