Skip to content
This repository was archived by the owner on Jun 20, 2023. It is now read-only.

Commit abde30f

Browse files
authored
Initial Commit
1 parent 7f2aa89 commit abde30f

File tree

51 files changed

+2790
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+2790
-0
lines changed
4.88 KB
Loading
108 KB
Loading
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# read the dataset
2+
# check out how many missing values we have per column
3+
# check out the percentage of missing values per column
4+
# check out the percentage of missing values in the dataset
5+
6+
"""
7+
Is the missing values because it wasn't recorded or
8+
because it doesn't exist?
9+
10+
Answer: If a value is missing becuase it doesn't exist
11+
(like the height of the oldest child of someone who
12+
doesn't have any children) then it doesn't make sense
13+
to try and guess what it might be.
14+
15+
These values you probably do want to keep as NaN. On the
16+
other hand, if a value is missing because it wasn't
17+
recorded, then you can try to guess what it might have
18+
been based on the other values in that column and row.
19+
"""
20+
21+
# Counting the Missing Values and its Percentage #
22+
23+
import pandas as pd
24+
import numpy as np
25+
np.random.seed(0)
26+
27+
df = pd.read_csv('filepath')
28+
29+
# Counting how many missing values each column has
30+
df.isnull().sum()
31+
32+
# Counting the percentage of the missing values
33+
# for each column
34+
df.isnull().sum() / len(df)
35+
df.isnull().sum() * 100 / len(df)
36+
37+
# Counting the percentage of the missing values
38+
# for the whole dataset
39+
total_missing = df.isnull().sum().sum()
40+
total_cells = np.product(df.shape)
41+
42+
percent_missing = (total_missing / total_cells) * 100
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
"""
2+
0 - Scaling
3+
4+
It's used to change the RANGE of the datas. The RANGE
5+
goes from 0 to 1.
6+
7+
----
8+
9+
About the models, you'll need to scale the datas when
10+
you're using methods based on measures of how far apart
11+
data points are, like the models:
12+
13+
/ Gradient Descent Optimization
14+
/ Support Vector Machines (SVM)
15+
/ K-Nearest Neighbors (KNN)
16+
"""
17+
18+
from sklearn.preprocessing import MinMaxScaler
19+
20+
scaler_1 = MinMaxScaler()
21+
scaler_1.fit_transform(df_train)
22+
scaler_1.transform(df_val)
23+
24+
"""
25+
1 - Standardization
26+
27+
It's like the Scale, but the scale range doesn't go
28+
from 0 to 1, it varies.
29+
30+
----
31+
32+
About the models, you'll need to scale the datas when
33+
you're using methods based on measures of how far apart
34+
data points are, like the models:
35+
36+
/ Gradient Descent Optimization
37+
/ Support Vector Machines (SVM)
38+
/ K-Nearest Neighbors (KNN)
39+
"""
40+
41+
from sklearn.preprocessing import RobustScaler
42+
from sklearn.preprocessing import StandardScaler
43+
44+
# Robust Scaler >> Less Sensitive to Outliers
45+
scaler_2 = RobustScaler()
46+
scaler_2.fit_transform(df_train)
47+
scaler_2.transform(df_val)
48+
49+
# Standard Scaler >> Used when the Mean is near to 0
50+
scaler_3 = StandardScaler()
51+
scaler_3.fit_transform(df_train)
52+
scaler_3.transform(df_val)
53+
"""
54+
2 - Normalization
55+
56+
It's used to change the DISTRIBUTION of the data.
57+
58+
In a nutshell, Normalization just changes the distribution
59+
of the datas in order to get a Normal Distribution
60+
(Gaussian Distribution or Bell Curve).
61+
62+
----
63+
64+
About the models, you'll need to normalize the datas
65+
when using:
66+
67+
/ Linear Discriminant Analysis (LDA)
68+
/ Gaussian Naive Bayes
69+
70+
Tip: any method with "Gaussian" in the name probably
71+
needs that you normalize the datas.
72+
"""
73+
74+
from sklearn.preprocessing import Normalizer
75+
76+
normalizer = Normalizer()
77+
normalizer.fit_transform(df_train)
78+
normalizer.transform(df_val)
79+
80+
#########
81+
82+
"""
83+
***********
84+
** Notes **
85+
***********
86+
87+
Explanation Scale/Standardization
88+
89+
It's like to scale Real (R$) to Dollar (U$), where
90+
1 dollar is equals 5 reals nowadays. So, if we don't
91+
use the Scale, the model will consider 1 dollar equals
92+
to 1 real, and that's not true.
93+
94+
Another example is the height and weight, where we gotta
95+
scale the datas, like where 1 inch is equals 2.54 cm,
96+
and 1 pound is equals 0.45 kg.
97+
98+
-*-*-*-*-
99+
100+
Another Explanation Just to Get the Feeling
101+
102+
Scale, Standardization and Normalization avoid the model
103+
considers some features more important than others by
104+
the scale, like consider the salary (from 40,000 to
105+
210,000) more important than the age (from 18 to 100).
106+
"""
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
import pandas as pd
2+
import datetime
3+
import numpy as np
4+
np.random.seed(0)
5+
6+
"""
7+
** Parsing Dates **
8+
9+
Transforming 'objects' dtype into 'datetime' one.
10+
"""
11+
12+
# Checking out the 'date' column
13+
# of a imaginary dataset
14+
15+
df = pd.read_csv('filepath')
16+
17+
df['date'].head()
18+
# > 01/05/99
19+
# > 02/05/99
20+
# > 03/05/99
21+
# > 04/05/99
22+
# > 05/05/99
23+
24+
df['date'].dtype
25+
# > Object
26+
27+
####
28+
29+
# Formatting to:
30+
#
31+
# day/month/two-digit-year
32+
# %d/%m/%y
33+
34+
df['formatted_date'] = pd.to_datetime(df['date']
35+
, format='%d/%m/%y')
36+
37+
df['formatted_date'].dtype
38+
# > datetime64
39+
40+
41+
# When the column has more than one date time format
42+
# use 'infer_datetime_format=True' in order to pandas
43+
# guess the correct format for each row
44+
#
45+
# - Problem 1: pandas can't recognize the correct format
46+
# for all cases;
47+
# - Problem 2: it takes more time than specifying the
48+
# format by yourself
49+
df['formatted_date'] = pd.to_datetime(df['date']
50+
, infer_datetime_format=True)
51+
52+
########
53+
54+
# Extracting information from the dates
55+
56+
df['formatted_date'].dt.day
57+
# > 01
58+
# > 02
59+
# > 03
60+
# > 04
61+
# > 05
62+
63+
#######
64+
65+
# Checking out the Days Distribution in order
66+
# to check if the pandas missformatted the months
67+
# as days
68+
#
69+
# See: "0 - Good Days Distribution.png" to an example
70+
# of a correct distribution!!
71+
sns.distplot(df['formatted_date'].dt.day
72+
, kde=False,
73+
, bins=31)
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
"""
2+
************************
3+
** Character Encoding **
4+
************************
5+
6+
When you read a csv file that's not in 'UTF-8' charset,
7+
you'll get an error like this one:
8+
9+
/ UnicodeDecodeError: 'utf-8' codec can't decode byte
10+
0x99 in position 7955: invalid start byte
11+
12+
To solve this, you gotta convert the file to UTF-8
13+
following the steps bellow:
14+
15+
1 - find out the file's charset;
16+
2 - read the file with the correct charset;
17+
3 - save the file with pandas (UTF-8 is the default
18+
charset to pandas)
19+
"""
20+
21+
import pandas as pd
22+
import chardet # library to guess the file's charset
23+
24+
# Guessing File's Charset #
25+
26+
with open('filepath', 'rb') as file:
27+
28+
# read the first 10,000 bytes of the file
29+
# to guess the charset
30+
guessed_charset = chardet.detect(file.read(10000))
31+
32+
print(guessed_charset)
33+
# > {'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
34+
# so there is 73% of chance of the charset be Windows-1252
35+
36+
# Reading the File with Correct Charset #
37+
df = pd.read_csv('filepath', encoding='Windows-1252')
38+
39+
# Saving the File into UTF-8 #
40+
df.to_csv('new_file_name')

0 commit comments

Comments
 (0)