Amazon Sales Data Analysis and Big Data Processing (PySpark)

This project demonstrates Big Data concepts and Apache Spark (PySpark) capabilities using Amazon sales data. The notebook covers data analysis, MapReduce logic, RDD vs DataFrame comparisons, Spark SQL queries, and real-time data streaming simulations.

📂 Project Structure

The project follows these steps in the amazon_sales_v2.ipynb file:

1. Data Loading and Inspection (Volume, Velocity, Variety)

Loading the Amazon Sale Report.csv dataset.
Analyzing row and column counts (Volume).
inspecting data types (Variety).
Observing data flow speed by checking the date range (Velocity).
Data Cleaning: Handling missing values for Amount, Order ID, etc.
Feature Engineering: Extracting Month and Day Name from dates.

2. Hadoop & MapReduce Simulation

Simulating MapReduce logic using PySpark.
Example: Calculating total sales amount for each category.

3. RDD vs DataFrame Comparison

Performing the same operation (e.g., total sales by category) using both RDD (Resilient Distributed Dataset) and DataFrame APIs.
Demonstrating memory optimization techniques (.cache(), .persist()).

4. Spark SQL Analysis

Converting the DataFrame into a temporary SQL view (amazon_sales) and running the following SQL queries:

Total sales amount based on category.
The state with the highest sales.
Percentage of "Cancelled" orders.

5. Spark Streaming

Creating a folder (stream_input_folder) to simulate real-time data flow.
Listening to this folder and processing new incoming data instantly to update category-based sales totals.

6. Advanced Visualization & Conclusion

Bar Chart: Total Sales by Category.
Pie Chart: Order Status Distribution.
Final project conclusion and insights.

🛠 Requirements

To run this project, you need the following technologies:

Python
Apache Spark (PySpark)
Jupyter Notebook or Google Colab
Helper libraries: pandas, matplotlib, seaborn

🚀 Installation and Execution

Clone this repo or download the files.
Ensure PySpark is installed in your environment (Local or Colab).
Open the amazon_sales.ipynb file.
Make sure to update the path of the dataset (Amazon Sale Report.csv) according to your environment (The notebook uses /content/drive/..., change this to your local path if running locally).
Run the cells sequentially to view the analyses.

📊 Dataset

The dataset used in this analysis contains over 128,000 rows of Amazon India sales data.

Source: You can download the dataset from Kaggle: Amazon Sales Report Dataset
Size: ~128k rows and 24 columns.
Note: Due to GitHub's file size limits, the raw CSV file is not included in this repository. To run the notebook, please download the CSV from the link above and place it in the /data directory.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
amazon_sales_v2.ipynb		amazon_sales_v2.ipynb
report.md		report.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon Sales Data Analysis and Big Data Processing (PySpark)

📂 Project Structure

1. Data Loading and Inspection (Volume, Velocity, Variety)

2. Hadoop & MapReduce Simulation

3. RDD vs DataFrame Comparison

4. Spark SQL Analysis

5. Spark Streaming

6. Advanced Visualization & Conclusion

🛠 Requirements

🚀 Installation and Execution

📊 Dataset

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Amazon Sales Data Analysis and Big Data Processing (PySpark)

📂 Project Structure

1. Data Loading and Inspection (Volume, Velocity, Variety)

2. Hadoop & MapReduce Simulation

3. RDD vs DataFrame Comparison

4. Spark SQL Analysis

5. Spark Streaming

6. Advanced Visualization & Conclusion

🛠 Requirements

🚀 Installation and Execution

📊 Dataset

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages