Skip to content
This repository was archived by the owner on Jun 20, 2023. It is now read-only.

Commit aa4f45d

Browse files
authored
🎉 Added
> Dimensionality Reduction: \ PCA \ t-SNE \ UMAP
1 parent 88940a1 commit aa4f45d

File tree

8 files changed

+290
-0
lines changed

8 files changed

+290
-0
lines changed
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
"""
2+
3+
******************************
4+
** Dimensionality Reduction **
5+
******************************
6+
7+
Dimensionality Reduction is the process to decrease the number
8+
of features from a DataSet combining them to create a new small
9+
number of features called "Components".
10+
11+
-*-*-*-*-
12+
13+
To apply this technique, your DataSet must attend the following
14+
requisition:
15+
16+
/ all Categorical Features must be Encoded, since Dimensionality
17+
Reduction works out just with Numerical Features;
18+
19+
/ the features must be standardized, unless you know you have
20+
good reason not to, such as, the DataSet is already standardized
21+
by default;
22+
23+
/ outliers must be treated being removed or constrained, since
24+
they can have an undue influence on the results.
25+
26+
-*-*-*-*-
27+
28+
Situations when you can use Dimensionality Reduction:
29+
30+
/ when you desire to check out whether clusters have similar
31+
properties and attributes;
32+
33+
/ when the DataSet contains lot of features (DataSet
34+
Compression to two or three features);
35+
36+
/ when the features are multi-colinear (there is a significant
37+
number of Linear Correlations between them);
38+
39+
/ when your goal is to apply denoising.
40+
41+
-*-*-*-*-
42+
43+
Variations of Dimensionality Reduction:
44+
45+
/ Principal Component Analisys (PCA): maximizes the variance;
46+
47+
/ t-Distributed Stochastic Neighbor Embedding (t-SNE): creates a
48+
reduced feature space where similar samples are modeled by nearby
49+
points and dissimilar samples are modeled by distant points with
50+
high probability;
51+
52+
/ Uniform Manifold Approximation and Projection (UMAP): applies
53+
Nearest Neighbors to cluster the datas and then reducts the
54+
dimensions.
55+
"""
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
"""
2+
3+
**********************************
4+
** Principal Component Analysis **
5+
**********************************
6+
7+
Principal Component Analysis (PCA) is used to create new Features
8+
combining other Features. In general, we get these new Features
9+
by tracing diagonal lines (axes) over the scatter plot between the
10+
two features we would like calculate the PCA.
11+
12+
After that, the model will calculate the correlation and the
13+
variance between these two features and return the Components
14+
(new Features).
15+
16+
{ image 1.0 }
17+
18+
These new features are called the principal components of the
19+
data. The weights themselves are called loadings. There will be
20+
as many principal components as there are features in the
21+
original dataset: if we had used ten features instead of two,
22+
we would have ended up with ten components.
23+
"""
24+
25+
# ---- Importing Libraries and Defining Functions ----
26+
import matplotlib.pyplot as plt
27+
import numpy as np
28+
import pandas as pd
29+
import seaborn as sns
30+
from sklearn.feature_selection import mutual_info_regression
31+
from sklearn.decomposition import PCA
32+
33+
def plot_variance(pca, width=8, dpi=100):
34+
35+
# Create figure #
36+
fig, axs = plt.subplots(1, 2)
37+
n = pca.n_components_
38+
grid = np.arange(1, n + 1)
39+
40+
# Explained variance #
41+
evr = pca.explained_variance_ratio_
42+
axs[0].bar(grid, evr)
43+
axs[0].set(xlabel="Component", title="% Explained Variance", ylim=(0.0, 1.0))
44+
45+
# Cumulative Variance #
46+
cv = np.cumsum(evr)
47+
axs[1].plot(np.r_[0, grid], np.r_[0, cv], "o-")
48+
axs[1].set(xlabel="Component", title="% Cumulative Variance", ylim=(0.0, 1.0))
49+
50+
# Set up figure #
51+
fig.set(figwidth=8, dpi=100)
52+
return axs
53+
54+
def make_mi_scores(X, y, discrete_features):
55+
mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
56+
mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
57+
mi_scores = mi_scores.sort_values(ascending=False)
58+
return mi_scores
59+
60+
"""
61+
We've selected four features that cover a range of properties.
62+
Each of these features also has a high MI score with the target,
63+
price. We'll standardize the data since these features aren't
64+
naturally on the same scale.
65+
66+
We say that the features are not in the same scale when their
67+
ratio are different in a highly way, such as: person's age and
68+
salary, while a person's age varies from 0 - 100, the salary can
69+
vary between 1,000 - 1,000,000. There's a huge gap between them,
70+
so we gotta scale the features in order to tthe model doesn't
71+
think that salary is more important than age just because the values
72+
are higher.
73+
"""
74+
75+
# ---- Reading DataSet and Treating the Features ----
76+
df = pd.read_csv("../input/fe-course-data/autos.csv")
77+
features = ["highway_mpg", "engine_size", "horsepower", "curb_weight"]
78+
79+
X = df.copy()
80+
y = X.pop('price')
81+
X = X.loc[:, features]
82+
83+
X_scaled = (X - X.mean(axis=0)) / X.std(axis=0)
84+
85+
86+
# ---- Calculating PCA ----
87+
pca = PCA(n_components=2)
88+
89+
X_pca = pca.fit_transform(X_scaled)
90+
component_names = [f"PC{i+1}" for i in range(X_pca.shape[1])]
91+
X_pca = pd.DataFrame(X_pca, columns=component_names)
92+
93+
X_pca.head()
94+
print(pca.explained_variance_ratio_) # variance ratio
95+
96+
# ---- Getting the Loadings ----
97+
#
98+
# \ loadings are the variance and correlations between each
99+
# created component
100+
loadings = pd.DataFrame(
101+
pca.components_.T, # transpose the matrix of loadings
102+
columns=component_names, # so the columns are the principal components
103+
index=X.columns, # and the rows are the original features
104+
)
105+
loadings
106+
107+
# ---- Calculating Mutual Info Scores and Plotting the Results ----
108+
mi_scores = make_mi_scores(X_pca, y, discrete_features=False)
109+
mi_scores
110+
111+
plot_variance(pca);
112+
113+
"""
114+
115+
{ image 1.1 }
116+
117+
This table of loadings is telling us that in the Size component,
118+
Height and Diameter vary in the same direction (same sign), but
119+
in the Shape component they vary in opposite directions (opposite
120+
sign).
121+
122+
In each component, the loadings are all of the same magnitude
123+
and so the features contribute equally in both.
124+
"""
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
"""
2+
3+
*************************************************
4+
** t-Distributed Stochastic Neighbor Embedding **
5+
*************************************************
6+
7+
Suppose we had a dataset composed of 3 distinct classes
8+
in a 2D plot and we want to convert it to a 1D plot maintaining
9+
the differences and distances between each cluster.
10+
11+
{ image 2.0 }
12+
{ image 2.1 }
13+
"""
14+
15+
# ---- Import Libraries ----
16+
import pandas as pd
17+
import seaborn as sns
18+
import matplotlib.pyplot as plt
19+
from sklearn.manifold import TNSE
20+
21+
# ---- Applying t-SNE ----
22+
#
23+
# \ n_components: number of compoennts / dimensions
24+
#
25+
# \ verbose: logger (1 >> true / 0 >> false)
26+
#
27+
# \ perplexity: number of nearest neighbors that is used
28+
# to Manifold Learning Algorithms. This value should be fine
29+
# between 5 and 50 and as larger the DataSet is, the larger
30+
# its value should be
31+
#
32+
# \ n_iter: number of iterations to run the algorithm's process
33+
# of learning
34+
#
35+
36+
tsne = TSNE(n_components=2, verbose=0, perplexity=40, n_iter=300)
37+
X_tsne = tsne.fit_transform(data_subset)
38+
X_tsne.head()
39+
40+
41+
# ---- Plotting the Result ----
42+
palette = sns.color_palette("bright", 10)
43+
44+
sns.scatterplot(
45+
X_tsne[:,0]
46+
, X_tsne[:,1]
47+
, hue=y
48+
, legend='full'
49+
, palette=palette
50+
)
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
"""
2+
3+
***************************************************
4+
** Uniform Manifold Approximation and Projection **
5+
***************************************************
6+
7+
UMAP is another Dimensionality Reduction Technique that,
8+
different from PCA and t-SNE, applies Nearest Neighbors
9+
to cluster the datas and them reduct the dimensions.
10+
"""
11+
12+
# ---- Importing Libraries ----
13+
import pandas as pd
14+
import matplotlib.pyplot as plt
15+
import seaborn as sns
16+
import umap_umap as UMAP # pip import umap-learn
17+
18+
# ---- Applying Umapping (Unsupervisioned Learning) ----
19+
reducer = UMAP(
20+
n_neighbors=100, # default 15, The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation.
21+
n_components=3, # default 2, The dimension of the space to embed into.
22+
metric='euclidean', # default 'euclidean', The metric to use to compute distances in high dimensional space.
23+
n_epochs=1000, # default None, The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings.
24+
learning_rate=1.0, # default 1.0, The initial learning rate for the embedding optimization.
25+
init='spectral', # default 'spectral', How to initialize the low dimensional embedding. Options are: {'spectral', 'random', A numpy array of initial embedding positions}.
26+
min_dist=0.1, # default 0.1, The effective minimum distance between embedded points.
27+
spread=1.0, # default 1.0, The effective scale of embedded points. In combination with ``min_dist`` this determines how clustered/clumped the embedded points are.
28+
low_memory=False, # default False, For some datasets the nearest neighbor computation can consume a lot of memory. If you find that UMAP is failing due to memory constraints consider setting this option to True.
29+
set_op_mix_ratio=1.0, # default 1.0, The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.
30+
local_connectivity=1, # default 1, The local connectivity required -- i.e. the number of nearest neighbors that should be assumed to be connected at a local level.
31+
repulsion_strength=1.0, # default 1.0, Weighting applied to negative samples in low dimensional embedding optimization.
32+
negative_sample_rate=5, # default 5, Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.
33+
transform_queue_size=4.0, # default 4.0, Larger values will result in slower performance but more accurate nearest neighbor evaluation.
34+
a=None, # default None, More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.
35+
b=None, # default None, More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.
36+
random_state=42, # default: None, If int, random_state is the seed used by the random number generator;
37+
metric_kwds=None, # default None) Arguments to pass on to the metric, such as the ``p`` value for Minkowski distance.
38+
angular_rp_forest=False, # default False, Whether to use an angular random projection forest to initialise the approximate nearest neighbor search.
39+
target_n_neighbors=-1, # default -1, The number of nearest neighbors to use to construct the target simplcial set. If set to -1 use the ``n_neighbors`` value.
40+
#target_metric='categorical', # default 'categorical', The metric used to measure distance for a target array is using supervised dimension reduction. By default this is 'categorical' which will measure distance in terms of whether categories match or are different.
41+
#target_metric_kwds=None, # dict, default None, Keyword argument to pass to the target metric when performing supervised dimension reduction. If None then no arguments are passed on.
42+
#target_weight=0.5, # default 0.5, weighting factor between data topology and target topology.
43+
transform_seed=42, # default 42, Random seed used for the stochastic aspects of the transform operation.
44+
verbose=False, # default False, Controls verbosity of logging.
45+
unique=False, # default False, Controls if the rows of your data should be uniqued before being embedded.
46+
)
47+
48+
X_trans = reducer.fit_transform(X)
49+
print('Shape of X_trans: ', X_trans.shape)
50+
51+
# ---- Applying Umapping (Supervisioned Learning) ----
52+
reducer2 = UMAP(
53+
n_neighbors=100, n_components=3, n_epochs=1000,
54+
min_dist=0.5, local_connectivity=2, random_state=42,
55+
)
56+
57+
X_train_res = reducer2.fit_transform(X_train, y_train)
58+
X_test_res = reducer2.transform(X_test)
59+
60+
print('Shape of X_train_res: ', X_train_res.shape)
61+
print('Shape of X_test_res: ', X_test_res.shape)
39.5 KB
Loading
16.8 KB
Loading
2.69 KB
Loading
2.22 KB
Loading

0 commit comments

Comments
 (0)