Data Mining course 2021-2022.
Masters' Degree in Applied Mathematics, Sapienza University of Rome.
Exam date: July 19th 2022.
Joint work with Emanuele Ferrelli (email: e.ferrelli@hotmail.com).
Principal Component Analysis is a data simplification technique that works by applying a linear transformation to the dataset features. Its aim is data dimensionality reduction, minimizing the loss of information as much as possible.
In this project we analysed some data 1 on bike
rental collected during years 2011-2012 by an american company (Capital
Bikeshares), downloaded from the following link.
The dataset reports the total daily number of bikes rented by the american service for a total of 731 days. Along with total daily shares, many other information are taken into account:
Here is a brief description of the features:
- instant: record index
- dteday : date
- season : season (1:springer, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
- weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered
We discarded column 2 as it is redundant, and standardized the dataset columnwise.
First of all we studied the correlation among features. Have a look at the next picture for the correlation matrix.
A high positive correlation in clearly visible between columns 14 and 15 and between columns 9 and 10. On the contrary columns 8 and 15 show a negative correlation.
This confirms how bad weatear discourages bike
sharing.
The correlation coefficient
Given a matrix of data X, finding the first
If we standardize the columns of X (i.e. column
Note that
The next picture reports the
eigenvalues of our correlation matrix
As we can see, 90% of the total variance in represented by the first 6 eigenvalues, while the first 5 are enough to explain 80% of the total variance.
The first two eigenvectors of
These two vectors provide a criterion to establish how each feature counts for the first two principal components. The next picture explains this concept better.
Here, each column
Another way to study feature importance for the first two principal components is the following. We know from the theory that
This confirms the relation
As we have seen before, the first two eigenvalues alone only explain 40% of the total variance of the data. Therefore it can be significant to study
The next picture on the left shows the cumulated relevance of each feature w.r.t.
More on details, the (simple) relevance is defined as:
while the weighted relevance is given by:
It is fundamental to analyse both simple and weighted relevance. Consider for example column
The next picture puts in comparison
Starting with the last image on the bottom right,
we see the high correlation between the number of
total daily shares and
We report here an example to show that PCA can be useful for anomaly detection.
The next picture is a scatter plot for
We compared the 15 feature values for point 26 with the 15 average values for days 1 to 50, in order to have seasonal average values. The next picture shows that
PCA is a fundamental tool for datasets with many feature, as looking at its principal components can be a good criterion to choose the features to be discarded. In our case one could also want to perform regression to predict future total daily bike rentals. The first principal component
Footnotes
-
Hadi Fanaee-T and Joao Gama.
Event labeling combining ensemble detectors and background knowledge.
Progress in Artificial Intelligence, pages 1–15, 2013. ↩








