This project demonstrates a comprehensive Extract, Transform, Load (ETL) pipeline applied to Airbnb listings data from New York City. The objective is to extract raw data, identify and document data quality issues, apply meaningful transformations, and prepare a clean, analysis-ready dataset for business intelligence purposes.
Interactive Dashboardincludes a Streamlit web application for real-time data exploration and visualization!
Live Demo: Interactive Dashboard
The project handles a realistically-sized dataset (31,995+ rows) and showcases industry-standard data engineering practices including:
- Data profiling and quality assessment
- Intelligent data merging strategies
- Context-aware missing value imputation
- Feature engineering and enrichment
- Data type standardization and formatting
- Interactive dashboard for data visualization and analysis
Dashboard link: (https://dsa2040aetexamaime232-e34haydw2ruur4izwjrhjo.streamlit.app/)
Dataset: Airbnb Listings - New York City
Source: Inside Airbnb
Description: Detailed listings data for Airbnb properties in New York City, including pricing, location, reviews, availability, and host information.
- Raw Data: 31,995 listings
- Incremental Data: 4,116 listings (recent additions/updates)
- Merged Dataset: ~34,000+ unique listings
- Original Columns: 18
- Final Columns: 25 (after transformations)
- Listing details (name, description, room type)
- Pricing information
- Geographic data (latitude, longitude, neighborhood)
- Review metrics (count, rating, frequency)
- Host information
- Availability data
Objective: Load, profile, and validate raw data while identifying quality issues.
- Data Loading
- Loaded both
raw_data.csv(31,995 rows) andincremental_data.csv(4,116 rows) - Used pandas for efficient data manipulation
- Data Profiling
- Generated
.head(),.info(), and.describe()summaries - Analyzed structure, data types, and statistical distributions
- Examined both numerical and categorical features
- Data Quality Assessment - Identified 5 Major Issues:
Issue 1: Missing Values
license: 92.84% missing (29,703 nulls)price: 45.20% missing (14,463 nulls)last_review: 34.97% missing (11,188 nulls)reviews_per_month: 34.97% missing (11,188 nulls)host_name: 3.63% missing (1,163 nulls)
Issue 2: Duplicate Records
- Raw data: Clean (0 duplicates)
- Incremental data: 2,526 duplicate IDs (61.4%)
- No overlapping IDs between datasets (completely separate listings)
Issue 3: Data Type Inconsistencies
last_reviewstored as object (string) instead of datetime- IDs stored as integers (should be categorical strings)
Issue 4: Data Range Anomalies
- Price range: $10 - $50,104 (210 listings > $10,000)
- Minimum nights: 1 - 1,124 nights (25 listings > 365 days)
- Outliers present but potentially valid
Issue 5: Text Inconsistencies
- Potential whitespace and case inconsistencies
- Categorical values need standardization
- Data Merging
- Strategy: Concatenated both datasets and removed duplicate IDs
- Rationale: Incremental data contains NEW listings (no overlap with raw data)
- Result: ~34,000 unique listings in merged dataset
- Why Merge?
- Creates complete market view combining historical and recent data
- Demonstrates incremental loading pattern (industry best practice)
- Enables comprehensive temporal analysis
- Data Validation & Storage
- Saved
validated_data.csvfor transformation phase - Preserved original files for reference
- Created reproducible data pipeline
Objective: Clean, standardize, enrich, and prepare data for analysis.
- Dropped
licensecolumn - Reason: 92.84% missing, low business value
- Impact: Streamlined dataset from 18 to 17 columns
- Before/After: Documented column comparison
- Price Missing Values
- Strategy: Median imputation grouped by
room_typeandneighbourhood - Rationale: Prices vary significantly by location and property type
- Before: 14,463 missing | After: 0 missing
- Reviews Per Month Missing Values
- Strategy: Filled with 0
- Rationale: Null means never reviewed = 0 reviews/month
- Before: 11,188 missing | After: 0 missing
- Last Review Missing Values
- Strategy: Created
has_been_reviewedflag, kept dates as NaT - Rationale: Preserves "never reviewed" status while enabling analysis
- Before: 11,188 missing | After: 11,188 NaT (but flagged)
- Host Name Missing Values
- Strategy: Filled with "Unknown Host"
- Rationale: Minimal impact (3.6%), maintains completeness
- Before: 1,163 missing | After: 0 missing
- Listing Name Missing Values
- Strategy: Filled with "Listing {id}"
- Rationale: Only 2 records, used ID as placeholder
- Before: 2 missing | After: 0 missing
Visualization: Bar chart comparing missing values before/after cleaning
- Date Type Conversion
- Converted
last_reviewfrom object → datetime64 - Enables temporal analysis and date arithmetic
- Text Field Standardization
- Standardized 5 text columns:
neighbourhood_group,neighbourhood,room_type,host_name,name - Applied
.strip()and.title()for consistency - Before: "Manhattan ", "manhattan" | After: "Manhattan"
- ID Type Conversion
- Converted
idandhost_idfrom int64 → string - Rationale: IDs are identifiers (categorical), not quantities
Output: Displayed data type comparison before/after
- Price Category (
price_category)
- Created 5-tier classification: Budget, Economy, Mid-Range, Premium, Luxury
- Bins: $0-50, $50-100, $100-200, $200-500, $500+
- Business Value: Enables market segmentation
- Days Since Review (
days_since_review)
- Calculated days between last review and current date
- Business Value: Listing freshness/activity metric
- Review Activity Level (
review_activity)
- Categorized as High (≥2/month), Medium (0.5-2/month), Low (<0.5/month)
- Business Value: Engagement indicator
- Availability Category (
availability_category)
- 4-tier classification: Low, Medium, High, Full
- Bins: 0-90, 90-180, 180-270, 270-365 days
- Business Value: Capacity planning metric
- Host Experience Level (
host_experience)
- Professional (≥10 listings), Experienced (3-9), Beginner (1-2)
- Business Value: Host sophistication indicator
- Is New Listing (
is_new_listing)
- Boolean flag for listings with 0 reviews
- Business Value: Identifies untested properties
Visualizations:
- Price category distribution (bar chart)
- Review activity by room type (stacked bar chart)
- Host experience distribution (pie chart)
- Availability category distribution (bar chart)
- Rental Duration Type (
rental_duration_type)
- Based on
minimum_nights: Short-Stay, Weekly, Monthly, Long-Term - Bins: 0-3, 3-7, 7-30, 30+ nights
- Business Value: Rental strategy segmentation
- Review Volume Tier (
review_volume_tier)
- Quartile-based: Q1-Low, Q2-Medium, Q3-High, Q4-VeryHigh
- Business Value: Popularity ranking
Objective: Provide interactive data exploration and visualization through a web-based dashboard.
- Interactive Data Explorer: Browse and filter the transformed dataset
- Real-time Visualizations: Dynamic charts and graphs
- Geographic Analysis: Location-based insights
- Price Analytics: Price distribution and trends by various dimensions
- Room Type Analysis: Comparative insights across property types
- Review Metrics: Engagement and activity analytics
- Custom Filters: Filter data by neighborhood, price range, room type, etc.
1. Geographic Analysis
Explore listings across NYC with:
- Interactive map colored by price category
- Borough-level distribution (Manhattan, Brooklyn, Queens, Bronx, Staten Island)
- Room type breakdown by location
Comprehensive pricing analytics including:
- Overall price distribution histogram
- Box plots comparing prices across room types
- Price range filters and statistical summaries
- Streamlit: Web application framework
- Plotly: Interactive visualizations
- Pandas: Data manipulation backend
ET_Exam_[FirstName]_[ID3]/
├── data/
│ ├── raw_data.csv # Original dataset (31,995 rows)
│ ├── incremental_data.csv # Incremental updates (4,116 rows)
│ ├── validated_data.csv # Merged & validated dataset
│ ├── raw_data_validated.csv # Validated copy of raw data
│ └── incremental_data_validated.csv # Validated copy of incremental
│
├── transformed/
│ ├── transformed_full.csv # Final transformed dataset
│ ├── transformed_incremental.csv # Transformed incremental subset
│ ├── missing_values_comparison.png # Visualization: Missing data before/after
│ ├── price_category_distribution.png # Visualization: Price categories
│ ├── review_activity_by_roomtype.png # Visualization: Review activity
│ ├── host_experience_distribution.png # Visualization: Host experience
│ └── availability_category_distribution.png # Visualization: Availability
│
├── screenshots/ # Dashboard screenshots NEW!
│ ├── dashboard_home.png # KPIs and overview
│ ├── geographic_analysis.png # Map and borough distribution
│ └── pricing_insights.png # Price analysis charts
│
├── etl_extract.ipynb # Extraction phase notebook
├── etl_transform.ipynb # Transformation phase notebook
├── dashboardapp.py # Streamlit dashboard application
├── requirements.txt # Python dependencies
├── README.md # Project documentation (this file)
└── .gitignore # Git ignore file
# Install all required packages at once
pip install -r requirements.txt# Install required packages manually
pip install pandas numpy matplotlib seaborn jupyter streamlit plotly-
Clone the Repository
git clone https://github.com/[YourUsername]/DSA2040A_ET_Exam_[FirstName]_[ID3].git cd DSA2040A_ET_Exam_[FirstName]_[ID3] -
Install Dependencies
pip install -r requirements.txt
-
Prepare Data
- Place
raw_data.csvandincremental_data.csvin thedata/folder - Ensure files are properly formatted CSV with headers
- Place
-
Run Extraction Phase
jupyter notebook etl_extract.ipynb
- Execute all cells in sequence (Cell → Run All)
- Review output for data quality issues
- Verify
validated_data.csvis created indata/folder
-
Run Transformation Phase
jupyter notebook etl_transform.ipynb
- Execute all cells in sequence (Cell → Run All)
- Review transformations and visualizations
- Verify output files in
transformed/folder
-
Launch Interactive Dashboard
streamlit run dashboardapp.py
- Dashboard will open automatically in your default browser
- Default URL: http://localhost:8501
- Explore data interactively through the web interface
-
Verify Outputs
- Check
transformed/transformed_full.csvexists - Review generated visualizations in
transformed/folder - Confirm final dataset has 25 columns and ~34,000 rows
- Test dashboard functionality and filters
- Check
Validated data loaded successfully
- Shape: 31,995 rows × 18 columns
- Memory usage: 4.37 MB
| Column | Missing Count | Missing Percentage |
|---|---|---|
| license | 29,703 | 92.84% |
| price | 14,463 | 45.20% |
| last_review | 11,188 | 34.97% |
TRANSFORMATION STATISTICS:
Original columns: 18
Columns dropped: 1 (license)
Columns added: 8
Final columns: 25
Total rows: 34,111
- Missing values reduced: 46,805 → 11,188 (75% reduction)
- Data completeness: 54.8% → 96.7%
- Usable columns: 17 core + 8 derived = 25 total
The Streamlit dashboard provides comprehensive analytics across multiple dimensions:
- Real-time Filtering: Dynamic filters for neighborhood, room type, price range, and host experience
- Geographic Visualization: Interactive map showing listing distribution by price category across NYC
- Pricing Analytics: Price distribution histograms and box plots by room type
- Review Intelligence: Activity levels and engagement metrics across property types
- Host Analytics: Experience level distribution and multi-property host identification
- Market Trends: Borough-level and neighborhood-level performance comparisons
- Data Export: Download filtered datasets for further analysis
- Completeness: Increased from 54.8% to 96.7% through intelligent imputation
- Consistency: Standardized all text fields and data types
- Usability: Removed low-value column (license) and added 8 high-value features
- Price Segmentation: Enables targeted marketing by price tier
- Host Profiling: Identifies professional vs. beginner hosts for tailored support
- Activity Metrics: Review activity and freshness indicators for ranking algorithms
- Availability Intelligence: Capacity planning and inventory optimization
- Dynamic Pricing: Use price categories and availability for rate optimization
- Host Onboarding: Target beginner hosts with low review activity for training
- Search Ranking: Prioritize listings with high review activity and recent reviews
- Market Analysis: Analyze trends by neighborhood, room type, and price tier
- Predictive Modeling: Clean, feature-rich dataset ready for ML applications
- Interactive Exploration: Stakeholders can explore data without technical skills
The loading phase has been implemented and you can access it here: https://github.com/DevAime/DSA2040A_Load_AimeMuganga_232
You can contact me through my email if any inquiry.
- Email: aimmug200507@gmail.com


