This project aims to analyze the correlation between traffic congestion and collisions in Chicago using big data techniques. By leveraging large datasets and distributed computing frameworks, we investigate the potential relationship between traffic congestion levels and the frequency or severity of collisions across different city areas.
-
Traffic Crash Data: https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if/about_data
-
Traffic Congestion Data: https://data.cityofchicago.org/Transportation/Chicago-Traffic-Tracker-Historical-Congestion-Esti/kf7e-cur8/about_data
-
Data Processing: We utilize PySpark, a distributed computing framework for big data processing, to handle and transform the large-scale traffic and collision datasets. This includes data cleaning, joining datasets, and performing relevant aggregations and transformations.
-
Exploratory Data Analysis: We conduct exploratory data analysis (EDA) to gain insights into the datasets, identify patterns, and visualize key variables related to traffic congestion and collisions.
-
Spatial Analysis: By leveraging GeoSpatial libraries like Geopandas, we analyze the spatial distribution of traffic congestion and collisions across different neighborhoods or regions in Chicago.
- PySpark: Distributed computing framework for big data processing.
- Azure Virtual Machine: Cloud computing platform for running PySpark jobs and managing data.
- MongoDB: NoSQL database for storing and querying data.
- Jupyter Notebook: Interactive environment for data analysis and visualization.
- Geopandas: Python library for working with geospatial data.
- John Olusetire
- Timothy Obuadey
- Anand Seshadri