Citi Bike NYC Trip Data Analysis Using Apache Spark
This project analyzes Citi Bike trip data to understand demand patterns and system behavior. The analysis focuses on net flow forecasting, trip duration prediction, station clustering, and differences between electric and classic bikes to support operational insights.
The goal of this project is to support operational and strategic decision-making for NYC's bike-sharing system Citi Bike.
We aim to:
- Forecast net bike flow per station (1-hour ahead)
- Predict trip duration (ETA)
- Identify station usage patterns
- Compare E-bikes and classic bikes in terms of demand and behavior
- Big Data Frameworks: Apache Spark, PySpark
- Machine Learning Algorithms: Linear Regression, K-Means Clustering, Random Forest
- Data Preprocessing: Data cleaning, Feature Engineering, Target Construction, Feature Vector Assembly
- Programming Language: Python
- PySpark, pandas, matplotlib, MLlib
The data used is system data, publicly available at citibike's website. More specifically all data available for May 2025.
The analysis follows these steps:
- Loaded Citi Bike Trip Data .csv files from May 2025 into Spark.
- Removed incomplete or invalid trip records.
- Ensured correct data types for timestamps and numeric fields.
- Filtered unrealistic or corrupted trip duration values.
- Saved the cleaned dataset in parquet format for efficient reuse.
- Extracted time-based features (hour, weekday).
- Aggregated departures and arrivals per station.
- Computed net bike flow (arrivals − departures).
- Created next-hour target variable for forecasting.
- Built a Linear Regression model to predict next-hour net flow.
- Evaluated forecasting performance using regression metrics.
- Built a Linear Regression model to estimate trip duration (ETA).
- Analyzed prediction errors and model accuracy.
- Aggregated station-level usage features.
- Applied K-Means clustering to identify station types.
- Evaluated clustering quality using Silhouette Score.
- Created distinct geographical "zones" based on start locations to determine if e-bikes dominate specific areas of the city.
- Quantified exactly how much time an e-bike saves compared to a classic bike, while controlling for distance and location (Zone).
- Determined which factors drive a user to choose an electric bike over a classic bike.
- Plotted hourly net bike flow trends to illustrate demand fluctuations.
- Visualized predicted vs. actual values for net flow forecasting.
- Analyzed error distribution for trip duration predictions.
- Created bar plots comparing electric and classic bike usage.
- Visualized station clusters to highlight different usage patterns.
- Compared key metrics across objectives to support operational insights.
- Station-level demand follows clear hourly patterns, allowing short-term net flow forecasting.
- Certain stations consistently act as high-demand commuter hubs, while others show low and stable activity.
- Trip duration can be reasonably predicted for typical rides, though extreme trips remain more variable.
- Electric and classic bikes exhibit different usage behavior, suggesting distinct operational roles within the system.
- Net flow analysis highlights periods of potential bike shortages or dock saturation, supporting rebalancing decisions.
Citi Bike data is provided under the NYC Bike Data Use Policy for lawful purposes, including academic research.
The main legal and ethical issue is user privacy, as detailed trip data could potentially allow indirect identification of users.
To address this, the project follows the Citi Bike Data Use Policy, uses the data only for academic purposes, and reports results in aggregated form without sharing individual trip records.