PCA with Numpy

PCA (Principal Component Analysis) implementation from scratch using Numpy.

TURKISH

Numpy ile PCA (Temel Bileşen Analizi) işlemi

Türkçe açıklamalar ve kod için TURKISH klasörüne bakınız.

ENGLISH

What is PCA

PCA is a dimension reduction method. It reduces the dimension of the data we have by protecting the information as much as possible. It does this by looking at the variance of the values in the data. It takes the projection by storing the values that contain the most variance. At the end, the pca applied data has lower dimension than the original data, but preserving the information contained in the original data to a certain extent. This process also reduces the computational cost because we get a lower-than-normal size of data. It takes less processing power to make calculations with that data.

Why Use This Implementation

This project is built for educational purposes. Instead of using scikit-learn's PCA, you can see step by step how the algorithm works internally. The code is written with clarity in mind, using descriptive function names and comments. It is useful for students and practitioners who want to understand the mathematics behind PCA.

Project Structure

PCA-with-Numpy/
├── README.md              # English documentation
├── pcaWithNumpy.py        # Main implementation (English)
├── data.csv               # Sample dataset
├── LICENSE                # MIT License
└── TURKISH/
    ├── BENIOKU.md         # Turkish documentation
    ├── pcaWithNumpy.py    # Implementation with Turkish comments
    └── data.csv           # Sample dataset

Requirements

Python 3.x
numpy
pandas
matplotlib

Install dependencies:
```
 pip install numpy pandas matplotlib
```

Usage

Run the main script:

python pcaWithNumpy.py

The script will:

Load the sample data from data.csv
Display the cumulative variance plot
Apply PCA with 95% variance threshold
Generate scatter plots of component pairs
Export the reduced data to dataPcaApplied.csv

Using in Your Own Code

from pcaWithNumpy import PCA, findBestVariance, plotPcaComponents, exportData
import pandas as pd

# Load your data
data = pd.read_csv("your_data.csv", header=None).values

# Find the best variance threshold (shows cumulative variance plot)
findBestVariance(data)

# Apply PCA with desired variance (default 95%)
reducedData = PCA(data, variance=95)

# Plot components
plotPcaComponents(reducedData, compOne=1, compTwo=2)

# Export results
exportData(reducedData, file_name="output.csv")

Functions

Function	Description
`PCA(data, variance)`	Main function that applies PCA to data
`centeringData(data, axis)`	Centers data by subtracting the mean
`calculateCovarianceMatrix(data)`	Calculates the covariance matrix
`calculateEigens(covarianceMatrix)`	Finds eigenvalues and eigenvectors
`sortEigens(eigenValues, eigenVectors, order)`	Sorts by eigenvalue magnitude
`findVarianceExplained(eigenValues)`	Calculates variance percentage for each component
`findCumulativeVarExp(varianceExplained)`	Computes cumulative sum of variances
`findN(cumulativeVarianceExplained, variance)`	Finds number of components for target variance
`findBestVariance(data)`	Plots cumulative variance to help choose threshold
`plotPcaComponents(dataPcaApplied, compOne, compTwo)`	Scatter plot of two components
`plotCumulativeVarExp()`	Plots cumulative variance vs components
`exportData(dataPcaApplied, file_name)`	Exports reduced data to CSV

How PCA is Done

Centering the data

By substracting the mean of our data from itself.
Calculating covariance matrix of X

Further information Covariance matrix - Wikipedia
Calculating eigenvectors and eigenvalues

Further information Eigenvalues and eigenvectors - Wikipedia
We calculate how much variance each eigen value contains (in percentage). We then calculate the cumulative sum of variances. Thus, we can see the sum of the variances of the values that include the most variance. We determine the number of components that provide the total variance we want to cover (generally 95%) as the dimension (n) we want to reduce.
Finally, we perform the dot product of our X matrix and our eigenvector matrix, but we only take the first n columns of the eigenvector matrix. In this case, n is the size we want to reduce our original data to. Due to the inner product, the resulting matrix will have n columns. So we get a matrix that has 95 percent of the original information but has a lower dimension.

Further information Dot product - Wikipedia

Sample Data

The included data.csv contains 10 samples with 10 features each. Values are normalized to range [0, 1]. You can replace this with your own dataset for testing.

Output

After running PCA, the script generates:

Cumulative variance plot showing how many components are needed
Scatter plots comparing principal components (1 vs 2, 2 vs 3, 1 vs 3)
dataPcaApplied.csv containing the dimension-reduced data

License

Author

Selcuk Senturk - selcuksntrk@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PCA with Numpy

TURKISH

ENGLISH

What is PCA

Why Use This Implementation

Project Structure

Requirements

Usage

Using in Your Own Code

Functions

How PCA is Done

Sample Data

Output

License

Author

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
TURKISH		TURKISH
LICENSE		LICENSE
README.md		README.md
data.csv		data.csv
pcaWithNumpy.py		pcaWithNumpy.py

Folders and files

Latest commit

History

Repository files navigation

PCA with Numpy

TURKISH

ENGLISH

What is PCA

Why Use This Implementation

Project Structure

Requirements

Usage

Using in Your Own Code

Functions

How PCA is Done

Sample Data

Output

License

Author

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages