Layoffs Data Cleaning (SQL + Pandas Project)

This project demonstrates a complete data-cleaning workflow on a real-world layoffs dataset using both MySQL and Pandas (Python). The dataset contains information on company layoffs across industries, including company names, locations, industries, funding, and layoff counts.

The goal is to transform the raw dataset (with duplicates, inconsistent formatting, and missing values) into a clean, standardized version ready for analysis or visualization.

Tools Used:

MySQL → Data cleaning & transformation (SQL queries)

MySQL Workbench → Query execution & testing

Pandas (Python) → Alternative data-cleaning pipeline

GitHub → Project versioning & sharing

Data Cleaning Steps:

1. Remove Duplicates

SQL: Used ROW_NUMBER() OVER(PARTITION BY …) to detect duplicates and deleted them.

Pandas: Used df.drop_duplicates() to remove exact duplicate rows.

2. Standardize Data

Trimmed extra spaces from company names.

Standardized industry names (e.g., "Crypto%" → "Crypto").

Corrected country names ("United States%" → "United States").

Converted date column from text → proper DATE / datetime format.

SQL Example:

update layoffs_staging set company = trim(company);

update layoffs_staging set industry = 'Crypto' where industry like 'Crypto%';

Pandas Example:

df["company"] = df["company"].str.strip() df["industry"] = df["industry"].replace(r"^Crypto.", "Crypto", regex=True) df["country"] = df["country"].replace(r"United States.", "United States", regex=True) df["date"] = pd.to_datetime(df["date"], format="%m/%d/%Y", errors="coerce")

3. Handle Null or Missing Values

Replaced missing industry values using self-join (SQL) or groupby().transform(ffill().bfill()) (Pandas).

Deleted rows where both total_laid_off and percentage_laid_off were null.

SQL Example:

update layoffs_staging t1 join layoffs_staging t2 on t1.company = t2.company set t1.industry = t2.industry where (t1.industry is null or t1.industry = '') and t2.industry is not null;

Pandas Example:

df["industry"] = df.groupby("company")["industry"].transform(lambda x: x.ffill().bfill()) df = df.dropna(subset=["total_laid_off", "percentage_laid_off"], how="all")

4. Final Cleanup

Dropped helper columns (e.g., row_num in SQL).

Ensured consistent column types (DATE, INT, VARCHAR / Pandas dtypes).

Exported cleaned dataset to CSV.

df.to_csv("cleaned_layoffs_by_pandas.csv", index=False)

Final Output

Raw data: Contained duplicates, inconsistent text, and null values.

Cleaned data: Standardized company/industry names, formatted dates, removed duplicates, minimized nulls.

Final dataset:

Cleaned_layoffs.csv (from SQL pipeline)

cleaned_layoffs_by_pandas.csv (from Pandas pipeline)

Next Steps

Perform Exploratory Data Analysis (EDA) in Python or SQL.

Build dashboards in Tableau / Power BI / Looker Studio.

Analyze layoff trends by year, industry, country, funding stage.

Key Learnings

How to structure a SQL-based data-cleaning pipeline.

How to replicate the same cleaning process in Pandas.

Importance of standardization in categorical data.

Handling duplicates and null values in real-world datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
data_cleaning_by_PostgreSQL.sql		data_cleaning_by_PostgreSQL.sql
layoffs.csv		layoffs.csv
layoffs_cleaned.csv		layoffs_cleaned.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Layoffs Data Cleaning (SQL + Pandas Project)

Tools Used:

Data Cleaning Steps:

1. Remove Duplicates

2. Standardize Data

SQL Example:

Pandas Example:

3. Handle Null or Missing Values

SQL Example:

Pandas Example:

4. Final Cleanup

Final Output

Final dataset:

Next Steps

Key Learnings

Author: Muhammad Akbar Khan

Master’s student in Applied Mathematics | Data Enthusiast

About

Uh oh!

Releases

Packages

AkbarTheAnalyst/data_wrangling_sql

Folders and files

Latest commit

History

Repository files navigation

Layoffs Data Cleaning (SQL + Pandas Project)

Tools Used:

Data Cleaning Steps:

1. Remove Duplicates

2. Standardize Data

SQL Example:

Pandas Example:

3. Handle Null or Missing Values

SQL Example:

Pandas Example:

4. Final Cleanup

Final Output

Final dataset:

Next Steps

Key Learnings

Author: Muhammad Akbar Khan

Master’s student in Applied Mathematics | Data Enthusiast

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages