Dimensionality Reduction with PCA

Overview

This project demonstrates the use of Principal Component Analysis (PCA) to reduce the dimensionality of a complex dataset and visualize its inherent structure. It is a practical example of unsupervised learning and a key technique in data science for exploratory data analysis and feature engineering.

Features

Dimensionality Reduction: Uses PCA to transform a high-dimensional dataset into a 2D or 3D space.
Explained Variance Analysis: Generates a scree plot to show the cumulative variance explained by each principal component, helping to determine the optimal number of components to retain.
Interactive Visualization: Creates scatter plots (2D and 3D) of the transformed data, colored by their original labels to visualize how well PCA separates the classes.
Customizable: Allows you to specify the number of components and the dataset sample size from the command line.

Technologies Used

Python: The core programming language for the project.
Scikit-learn: The primary machine learning library for implementing PCA.
NumPy: Used for numerical operations and efficient handling of data arrays.
Matplotlib: A powerful plotting library used for all visualizations.
Seaborn: Built on Matplotlib, used for creating visually appealing statistical plots.

Data Analysis & Processing

The project utilizes the MNIST handwritten digits dataset, which consists of 784-dimensional images. Before applying PCA, the data undergoes two crucial preprocessing steps:

Subsampling: A subset of the data is used to ensure faster processing.
Standard Scaling: The data is scaled using StandardScaler to have a mean of 0 and a standard deviation of 1. This is a critical step because PCA is sensitive to the scale of features.

Model Used

The core of this project is Principal Component Analysis (PCA). PCA is an unsupervised learning algorithm that finds the directions of maximum variance in the data. These new directions, called principal components, are used to project the data into a lower-dimensional space while preserving as much variance (information) as possible.

Model Training

PCA is not "trained" in the traditional sense. Instead, it is fitted to the scaled data using the .fit() method. This process identifies the principal components and the amount of variance they explain. The .fit_transform() method then applies this transformation to the data, projecting it onto the selected number of components.

How to Run the Project

Clone the repository:

git clone <https://github.com/sjain2580/Loan-Approval-Prediction-with-Random-Forest>
cd <repository_name>

Create and activate a virtual environment (optional but recommended):python -m venv venv

On Windows:

.\venv\Scripts\activate

On macOS/Linux:

source venv/bin/activate

Install the required libraries:

pip install -r requirements.txt

Run the Script:

python pca_visualization.py

For 2D Visualization:

python pca_visualization.py --n_components 2 --sample_size 10000

For 3D Visualization:

python pca_visualization.py --n_components 3 --sample_size 10000

Visualization

The project generates two key plots:

Scree Plot: This plot displays the cumulative explained variance. It helps you see how much information is retained as you increase the number of principal components.2D/3D.
Scatter Plot: This is the final visualization of the transformed data. Each point represents a handwritten digit, and the colors correspond to their original class labels. This plot reveals if the digits form distinct, separable clusters in the lower-dimensional space.

Contributors

https://github.com/sjain2580 Feel free to fork this repository, submit issues, or pull requests to improve the project. Suggestions for model enhancement or additional visualizations are welcome!

Connect with Me

Feel free to reach out if you have any questions or just want to connect!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
Scatter_plot.png		Scatter_plot.png
Scree_plot.png		Scree_plot.png
pca_visualization.py		pca_visualization.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dimensionality Reduction with PCA

Overview

Features

Technologies Used

Data Analysis & Processing

Model Used

Model Training

How to Run the Project

Visualization

Contributors

Connect with Me

About

Uh oh!

Releases

Packages

Languages

sjain2580/Dimensionality-Reduction-with-PCA

Folders and files

Latest commit

History

Repository files navigation

Dimensionality Reduction with PCA

Overview

Features

Technologies Used

Data Analysis & Processing

Model Used

Model Training

How to Run the Project

Visualization

Contributors

Connect with Me

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages