This project demonstrates Big Data concepts and Apache Spark (PySpark) capabilities using Amazon sales data. The notebook covers data analysis, MapReduce logic, RDD vs DataFrame comparisons, Spark SQL queries, and real-time data streaming simulations.
The project follows these steps in the amazon_sales_v2.ipynb file:
- Loading the
Amazon Sale Report.csvdataset. - Analyzing row and column counts (Volume).
- inspecting data types (Variety).
- Observing data flow speed by checking the date range (Velocity).
- Data Cleaning: Handling missing values for
Amount,Order ID, etc. - Feature Engineering: Extracting
MonthandDay Namefrom dates.
- Simulating MapReduce logic using PySpark.
- Example: Calculating total sales amount for each category.
- Performing the same operation (e.g., total sales by category) using both RDD (Resilient Distributed Dataset) and DataFrame APIs.
- Demonstrating memory optimization techniques (
.cache(),.persist()).
Converting the DataFrame into a temporary SQL view (amazon_sales) and running the following SQL queries:
- Total sales amount based on category.
- The state with the highest sales.
- Percentage of "Cancelled" orders.
- Creating a folder (
stream_input_folder) to simulate real-time data flow. - Listening to this folder and processing new incoming data instantly to update category-based sales totals.
- Bar Chart: Total Sales by Category.
- Pie Chart: Order Status Distribution.
- Final project conclusion and insights.
To run this project, you need the following technologies:
- Python
- Apache Spark (PySpark)
- Jupyter Notebook or Google Colab
- Helper libraries:
pandas,matplotlib,seaborn
- Clone this repo or download the files.
- Ensure PySpark is installed in your environment (Local or Colab).
- Open the
amazon_sales.ipynbfile. - Make sure to update the path of the dataset (
Amazon Sale Report.csv) according to your environment (The notebook uses/content/drive/..., change this to your local path if running locally). - Run the cells sequentially to view the analyses.
The dataset used in this analysis contains over 128,000 rows of Amazon India sales data.
- Source: You can download the dataset from Kaggle: Amazon Sales Report Dataset
- Size: ~128k rows and 24 columns.
- Note: Due to GitHub's file size limits, the raw CSV file is not included in this repository. To run the notebook, please download the CSV from the link above and place it in the
/datadirectory.