This project demonstrates a complete data-cleaning workflow on a real-world layoffs dataset using both MySQL and Pandas (Python). The dataset contains information on company layoffs across industries, including company names, locations, industries, funding, and layoff counts.
The goal is to transform the raw dataset (with duplicates, inconsistent formatting, and missing values) into a clean, standardized version ready for analysis or visualization.
MySQL → Data cleaning & transformation (SQL queries)
MySQL Workbench → Query execution & testing
Pandas (Python) → Alternative data-cleaning pipeline
GitHub → Project versioning & sharing
SQL: Used ROW_NUMBER() OVER(PARTITION BY …) to detect duplicates and deleted them.
Pandas: Used df.drop_duplicates() to remove exact duplicate rows.
Trimmed extra spaces from company names.
Standardized industry names (e.g., "Crypto%" → "Crypto").
Corrected country names ("United States%" → "United States").
Converted date column from text → proper DATE / datetime format.
update layoffs_staging set company = trim(company);
update layoffs_staging set industry = 'Crypto' where industry like 'Crypto%';
df["company"] = df["company"].str.strip() df["industry"] = df["industry"].replace(r"^Crypto.", "Crypto", regex=True) df["country"] = df["country"].replace(r"United States.", "United States", regex=True) df["date"] = pd.to_datetime(df["date"], format="%m/%d/%Y", errors="coerce")
Replaced missing industry values using self-join (SQL) or groupby().transform(ffill().bfill()) (Pandas).
Deleted rows where both total_laid_off and percentage_laid_off were null.
update layoffs_staging t1 join layoffs_staging t2 on t1.company = t2.company set t1.industry = t2.industry where (t1.industry is null or t1.industry = '') and t2.industry is not null;
df["industry"] = df.groupby("company")["industry"].transform(lambda x: x.ffill().bfill()) df = df.dropna(subset=["total_laid_off", "percentage_laid_off"], how="all")
Dropped helper columns (e.g., row_num in SQL).
Ensured consistent column types (DATE, INT, VARCHAR / Pandas dtypes).
Exported cleaned dataset to CSV.
df.to_csv("cleaned_layoffs_by_pandas.csv", index=False)
Raw data: Contained duplicates, inconsistent text, and null values.
Cleaned data: Standardized company/industry names, formatted dates, removed duplicates, minimized nulls.
Cleaned_layoffs.csv (from SQL pipeline)
cleaned_layoffs_by_pandas.csv (from Pandas pipeline)
Perform Exploratory Data Analysis (EDA) in Python or SQL.
Build dashboards in Tableau / Power BI / Looker Studio.
Analyze layoff trends by year, industry, country, funding stage.
How to structure a SQL-based data-cleaning pipeline.
How to replicate the same cleaning process in Pandas.
Importance of standardization in categorical data.
Handling duplicates and null values in real-world datasets.