This project demonstrates the use of Principal Component Analysis (PCA) to reduce the dimensionality of a complex dataset and visualize its inherent structure. It is a practical example of unsupervised learning and a key technique in data science for exploratory data analysis and feature engineering.
-
Dimensionality Reduction: Uses PCA to transform a high-dimensional dataset into a 2D or 3D space.
-
Explained Variance Analysis: Generates a scree plot to show the cumulative variance explained by each principal component, helping to determine the optimal number of components to retain.
-
Interactive Visualization: Creates scatter plots (2D and 3D) of the transformed data, colored by their original labels to visualize how well PCA separates the classes.
-
Customizable: Allows you to specify the number of components and the dataset sample size from the command line.
-
Python: The core programming language for the project.
-
Scikit-learn: The primary machine learning library for implementing PCA.
-
NumPy: Used for numerical operations and efficient handling of data arrays.
-
Matplotlib: A powerful plotting library used for all visualizations.
-
Seaborn: Built on Matplotlib, used for creating visually appealing statistical plots.
The project utilizes the MNIST handwritten digits dataset, which consists of 784-dimensional images. Before applying PCA, the data undergoes two crucial preprocessing steps:
- Subsampling: A subset of the data is used to ensure faster processing.
- Standard Scaling: The data is scaled using StandardScaler to have a mean of 0 and a standard deviation of 1. This is a critical step because PCA is sensitive to the scale of features.
The core of this project is Principal Component Analysis (PCA). PCA is an unsupervised learning algorithm that finds the directions of maximum variance in the data. These new directions, called principal components, are used to project the data into a lower-dimensional space while preserving as much variance (information) as possible.
PCA is not "trained" in the traditional sense. Instead, it is fitted to the scaled data using the .fit() method. This process identifies the principal components and the amount of variance they explain. The .fit_transform() method then applies this transformation to the data, projecting it onto the selected number of components.
- Clone the repository:
git clone <https://github.com/sjain2580/Loan-Approval-Prediction-with-Random-Forest>
cd <repository_name>- Create and activate a virtual environment (optional but recommended):python -m venv venv
- On Windows:
.\venv\Scripts\activate- On macOS/Linux:
source venv/bin/activate- Install the required libraries:
pip install -r requirements.txt- Run the Script:
python pca_visualization.pyFor 2D Visualization:
python pca_visualization.py --n_components 2 --sample_size 10000For 3D Visualization:
python pca_visualization.py --n_components 3 --sample_size 10000The project generates two key plots:
- Scree Plot: This plot displays the cumulative explained variance. It helps you see how much information is retained as you increase the number of principal components.2D/3D.

- Scatter Plot: This is the final visualization of the transformed data. Each point represents a handwritten digit, and the colors correspond to their original class labels. This plot reveals if the digits form distinct, separable clusters in the lower-dimensional space.

https://github.com/sjain2580 Feel free to fork this repository, submit issues, or pull requests to improve the project. Suggestions for model enhancement or additional visualizations are welcome!
Feel free to reach out if you have any questions or just want to connect!