Skip to content

Analyzed food safety inspection records across Chicago and Dallas to identify violation patterns, risk factors, and operational insights for public health departments.

Notifications You must be signed in to change notification settings

lakshmi14k/Food-Establishments-Inspection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Food Establishments Inspection Analysis

Analyzed 1.9M+ food safety inspection records across Chicago and Dallas to identify violation patterns, risk factors, and operational insights for public health departments.

Problem Statement: Public health departments struggle to prioritize inspections and allocate resources effectively due to:

  • High volume of establishments requiring routine inspections
  • Inconsistent violation tracking across jurisdictions
  • Lack of visibility into recurring violation patterns
  • Manual processes for risk assessment and resource allocation

Business Impact: Inefficient inspection scheduling leads to delayed identification of high-risk establishments, putting public health at risk and wasting limited health department resources.

Proposed Solution: A dimensional data warehouse integrating food inspection data from multiple cities, enabling:

  • Standardized violation tracking and analysis
  • Risk-based inspection prioritization
  • Comparative analysis across jurisdictions
  • Automated reporting for health department leadership

Key Findings:

Chicago (267,866 inspections analyzed):

  • 27.4% of records missing violation data - data quality issue requiring investigation
  • 196,048 high-risk establishments (Risk Category 1 & 2) requiring priority inspections
  • 73,445 violations documented across 166K+ unique violation types
  • Top violation categories: Food temperature control, sanitation, pest management

Dallas (78,400 inspections analyzed):

  • 72.2% missing data across violation fields - systematic data collection issue
  • 42 duplicate inspection records identified and removed
  • 25 violation categories consolidated into standardized format
  • Average inspection score: 85/100 (passing threshold: 70)

Cross-City Insights:

  • Violation patterns differ significantly between cities (coding standards inconsistent)
  • Chicago has more granular violation tracking (better data quality)
  • Dallas uses numeric scoring system (more quantitative assessment)
  • Need for standardized national food safety violation taxonomy

Dataset Overview:

Chicago Food Inspections:

  • Source: City of Chicago Open Data Portal
  • Records: 267,866 inspections (2010-2024)
  • Establishments: 32,165 unique food businesses
  • Variables: 17 columns (Inspection ID, License, DBA Name, Address, Risk, Results, Violations, etc.)

Dallas Restaurant Inspections:

  • Source: Dallas Open Data Portal
  • Records: 78,400 inspections (2019-2024)
  • Establishments: 9,101 unique restaurants
  • Variables: 114 columns (25 violation slots, scores, addresses, etc.)

Data Quality Challenges:

  • Missing values (1.8% Chicago, 72.2% Dallas)
  • Inconsistent violation formats (free text vs structured codes)
  • Duplicate records (Dallas: 42 duplicates)
  • Non-standardized risk classifications

Tech Stack:

  • ETL: Talend Open Studio (integration), Python (pandas)
  • Database: Microsoft SQL Server
  • Data Modeling: ER/Studio (dimensional design)
  • Visualization: Tableau
  • Data Profiling: ydata-profiling, Alteryx

Project Structure

Food-Establishments-Inspection/
├── README.md
├── Datasets/
│   ├── Raw/
│   │   ├── Chicago.tsv                 # Chicago food inspections (268K records)
│   │   └── Dallas.tsv                  # Dallas restaurant inspections (78K records)
│   └── Processed/
│       ├── Chicago - Cleaned.csv       # Normalized violation data
│       └── Dallas - Cleaned.csv        # Standardized format
├── Python/
│   └── Staged.py                       # Python cleaning script
├── ETL/
│   ├── Chicago/                        # Chicago ETL workflows
│   ├── Dallas/                         # Dallas ETL workflows
│   ├── Dimension/                      # Dimension table workflows
│   ├── Facts/                          # Fact table workflows
│   └── ETL_Workflow.docx               # Talend workflow documentation
├── Profiling/
│   ├── Chicago_Profiling.yxmd          # Alteryx data profiling workflow
│   ├── Dallas_Profiling.yxmd           # Alteryx data profiling workflow
│   └── Profiling_Workflows.docx        # Profiling process documentation
├── Model/
│   ├── Dimensional_Model.png           # Star schema diagram
│   └── Model_Script.docx               # DDL documentation
├── Viz/
│   ├── Food Establishments Inspections # Tableau Document
|   ├── Master Script.py                # Python script for Combined dataset
│   └── Master Load .csv                # Dataset(Combined)
└── 

Database Schema: Star Schema Design

Fact Tables (2):

  • Fact_Inspection - Core inspection events with scores and results
  • Fact_Violation - Individual violations linked to inspections

Dimension Tables (7):

  • Dim_Business_License - Establishment details (name, address, license)
  • Dim_Date - Date hierarchy (year, quarter, month, week)
  • Dim_Facility_Type - Restaurant, grocery, school cafeteria, etc.
  • Dim_Inspection_Type - Routine, complaint-driven, license, reinspection
  • Dim_Risk_Category - Risk levels (High, Medium, Low)
  • Dim_Food_Inspections_Results - Pass, Fail, Pass with Conditions
  • Dim_Violation - Violation codes, descriptions, comments

Key Relationships:

Dim_Business_License ──┐
Dim_Date ──────────────┤
Dim_Facility_Type ─────┤
Dim_Inspection_Type ───┼──> Fact_Inspection ──> Fact_Violation
Dim_Risk_Category ─────┤                              │
Dim_Results ───────────┘                              │
                                                      │
Dim_Violation ────────────────────────────────────────┘

Total Records in Warehouse:

  • 346,266 inspections
  • 1.2M+ violation records (after normalization)
  • 40K+ unique establishments

Technical Implementation:

ETL Pipeline:

Phase 1: Data Extraction:

  • Downloaded TSV files from city open data portals
  • Initial profiling with ydata-profiling and Alteryx

Phase 2: Data Quality Assessment: Identified issues:

  • Missing violation data (27% Chicago, 72% Dallas)
  • Inconsistent violation formats (text blocks vs structured fields)
  • Duplicate records
  • Invalid date formats
  • Truncated text fields

Phase 3: Data Transformation:

Chicago Processing:

  • Normalized violations column (split pipe-delimited text into separate rows)
  • Extracted violation code, description, comment using regex
  • Added DI metadata (Process ID, Current Date, Workflow name)
  • Calculated risk scores and violation points

Dallas Processing:

  • Consolidated 25 violation columns into single normalized structure
  • Mapped violation points to standardized scale
  • Standardized facility types
  • Created default risk categories

Phase 4: Data Loading:

  • Loaded staged data into SQL Server
  • Populated dimensional model using Talend
  • Applied surrogate keys and SCD Type 1 logic
  • Validated referential integrity

Data Profiling Results:

Chicago Data Quality:

  • 17 variables, 267,866 observations
  • 84,184 missing cells (1.8%)
  • 0 duplicates
  • Key issues: Violations column missing in 27% of records

Dallas Data Quality:

  • 114 variables, 78,400 observations
  • 6.4M missing cells (72.2% - sparse violation matrix)
  • 42 duplicate records removed
  • Key issues: 99%+ missing in violation detail fields 11-25

SQL Capabilities Demonstrated:

  • Dimensional Modeling: Star schema with 7 dimensions, 2 facts
  • DDL Design: Foreign keys, constraints, indexes
  • Data Validation: Integrity checks, duplicate detection
  • ETL Logic: Staging → integration → dimensional model
  • Data Quality: Profiling, missing value handling, normalization

Visualizations:

Tableau Dashboards:

  1. Executive Overview - Total inspections, pass/fail rates, trend analysis
  2. Violation Analysis - Top violations by frequency and severity
  3. Geographic Distribution - Heat map of high-risk establishments by neighborhood
  4. Comparative Analysis - Chicago vs Dallas metrics side-by-side

For dashboard previews, see Tableau folder

Installation & Reproducibility:

Prerequisites:

  • Python 3.8+ (pandas, re, datetime)
  • Microsoft SQL Server
  • Tableau Desktop (for .twb file) or Tableau Public

Setup Instructions:

1. Clone repository:

git clone https://github.com/lakshmi14k/Food-Establishments-Inspection.git

2. Run Python cleaning script:

cd data/raw
python ../../python/clean_data.py

Output: Chicago - Cleaned.csv, Dallas - Cleaned.csv in data/processed/

3. Create database and load schema:

-- Run in SQL Server Management Studio
CREATE DATABASE FoodInspections;
GO

-- Execute DDL script
USE FoodInspections;
GO
-- Run sql/DDL_Script.sql

4. Load data using ETL tool (Talend):

  • Original implementation: Talend Open Studio (workflows in etl/ folder)
  • Alternative: Use Python pandas to load CSVs directly to SQL Server

5. Run validation checks:

-- Execute sql/Validation_Script.sql

6. Open Tableau dashboards:

  • Connect to SQL Server database OR
  • Use processed CSV files from data/processed/

Results:

Data Pipeline:

  • 346K+ inspection records integrated
  • 1.2M+ violation records normalized
  • 40K+ unique establishments profiled
  • 9 dimensional/fact tables populated

Insights Delivered:

  • Identified systematic data quality issues (72% missing Dallas data)
  • Documented top 10 violation categories per city
  • Mapped high-risk establishment concentrations
  • Enabled trend analysis (inspection frequency, pass rates over time)

Business Value:

  • Prioritized inspection scheduling based on risk scores
  • Identified establishments requiring follow-up (failed inspections)
  • Enabled resource allocation optimization
  • Provided comparative benchmarking across cities

Key Technical Achievements

  • ETL Pipeline: Multi-source integration (2 cities, different schemas)
  • Data Quality: Handled 72% missing data without losing analytical value
  • Normalization: Transformed wide format (114 columns) to normalized structure
  • Dimensional Modeling: Star schema with proper grain and relationships
  • Data Profiling: Comprehensive quality assessment before transformation
  • Regex Parsing: Extracted structured data from unstructured text fields

Built with SQL Server, Python, Alteryx, Talend, and Tableau

About

Analyzed food safety inspection records across Chicago and Dallas to identify violation patterns, risk factors, and operational insights for public health departments.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages