NYCCitiBikeShare

Citi Bike NYC Trip Data Analysis Using Apache Spark

Project Overview

This project analyzes Citi Bike trip data to understand demand patterns and system behavior. The analysis focuses on net flow forecasting, trip duration prediction, station clustering, and differences between electric and classic bikes to support operational insights.

Project Goal

The goal of this project is to support operational and strategic decision-making for NYC's bike-sharing system Citi Bike.

We aim to:

Forecast net bike flow per station (1-hour ahead)
Predict trip duration (ETA)
Identify station usage patterns
Compare E-bikes and classic bikes in terms of demand and behavior

Technologies and Tools

Big Data Frameworks: Apache Spark, PySpark
Machine Learning Algorithms: Linear Regression, K-Means Clustering, Random Forest
Data Preprocessing: Data cleaning, Feature Engineering, Target Construction, Feature Vector Assembly
Programming Language: Python
PySpark, pandas, matplotlib, MLlib

Data Source

The data used is system data, publicly available at citibike's website. More specifically all data available for May 2025.

Analysis Process

The analysis follows these steps:

Data Cleaning

Loaded Citi Bike Trip Data .csv files from May 2025 into Spark.
Removed incomplete or invalid trip records.
Ensured correct data types for timestamps and numeric fields.
Filtered unrealistic or corrupted trip duration values.
Saved the cleaned dataset in parquet format for efficient reuse.

Feature Engineering

Extracted time-based features (hour, weekday).
Aggregated departures and arrivals per station.
Computed net bike flow (arrivals − departures).
Created next-hour target variable for forecasting.

Forecasting (Objective A)

Built a Linear Regression model to predict next-hour net flow.
Evaluated forecasting performance using regression metrics.

Trip Duration Modeling (Objective B)

Built a Linear Regression model to estimate trip duration (ETA).
Analyzed prediction errors and model accuracy.

Station Clustering (Objective C)

Aggregated station-level usage features.
Applied K-Means clustering to identify station types.
Evaluated clustering quality using Silhouette Score.

Bike Type Comparison (Objective D)

Created distinct geographical "zones" based on start locations to determine if e-bikes dominate specific areas of the city.
Quantified exactly how much time an e-bike saves compared to a classic bike, while controlling for distance and location (Zone).
Determined which factors drive a user to choose an electric bike over a classic bike.

Visualization & Interpretation:

Plotted hourly net bike flow trends to illustrate demand fluctuations.
Visualized predicted vs. actual values for net flow forecasting.
Analyzed error distribution for trip duration predictions.
Created bar plots comparing electric and classic bike usage.
Visualized station clusters to highlight different usage patterns.
Compared key metrics across objectives to support operational insights.

Results

Station-level demand follows clear hourly patterns, allowing short-term net flow forecasting.
Certain stations consistently act as high-demand commuter hubs, while others show low and stable activity.
Trip duration can be reasonably predicted for typical rides, though extreme trips remain more variable.
Electric and classic bikes exhibit different usage behavior, suggesting distinct operational roles within the system.
Net flow analysis highlights periods of potential bike shortages or dock saturation, supporting rebalancing decisions.

Legal and Ethical Considerations

Citi Bike data is provided under the NYC Bike Data Use Policy for lawful purposes, including academic research.
The main legal and ethical issue is user privacy, as detailed trip data could potentially allow indirect identification of users. To address this, the project follows the Citi Bike Data Use Policy, uses the data only for academic purposes, and reports results in aggregated form without sharing individual trip records.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
final_project.html		final_project.html
final_project.ipynb		final_project.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYCCitiBikeShare

Project Overview

Project Goal

Technologies and Tools

Data Source

Analysis Process

Data Cleaning

Feature Engineering

Forecasting (Objective A)

Trip Duration Modeling (Objective B)

Station Clustering (Objective C)

Bike Type Comparison (Objective D)

Visualization & Interpretation:

Results

Legal and Ethical Considerations

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NYCCitiBikeShare

Project Overview

Project Goal

Technologies and Tools

Data Source

Analysis Process

Data Cleaning

Feature Engineering

Forecasting (Objective A)

Trip Duration Modeling (Objective B)

Station Clustering (Objective C)

Bike Type Comparison (Objective D)

Visualization & Interpretation:

Results

Legal and Ethical Considerations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages