Skip to content

Latest commit

 

History

History
523 lines (410 loc) · 38.1 KB

File metadata and controls

523 lines (410 loc) · 38.1 KB

Outliers detection and processing through statistical methods

General dataframe information with 15341 IDEAM records for 25 stations

Dataframe records head sample

Fecha 15015020 15065040 23215060 25025002 25025090 25025250 25025300 25025330 28015030 28015070 28025020 28025040 28025070 28025080 28025090 28025502 28035010 28035020 28035040 28035070 28045020 28045040 29065010 29065020 29065030
1980-01-01 00:00:00 32.2 nan nan nan 33.6 nan nan nan nan 32.4 nan nan 34.8 nan nan nan nan 34.6 nan nan nan nan nan nan 33.2
1980-01-02 00:00:00 32.6 nan nan nan 33.4 nan nan nan nan 29.8 nan 29.8 34 nan nan nan nan 30.8 nan nan nan nan nan nan 32.8
1980-01-03 00:00:00 33 nan nan nan 33.8 nan nan nan 34.3 32.4 nan 30 nan 35 nan nan nan 34.4 nan nan nan nan nan nan 33.4

Dataframe records tail sample

Fecha 15015020 15065040 23215060 25025002 25025090 25025250 25025300 25025330 28015030 28015070 28025020 28025040 28025070 28025080 28025090 28025502 28035010 28035020 28035040 28035070 28045020 28045040 29065010 29065020 29065030
2021-12-29 00:00:00 nan nan nan 35.4 34.8 34.2 36.6 33.2 nan 33.2 32.6 nan 34.8 nan 36 nan 35.6 34.8 37.2 nan nan nan nan 35.2 nan
2021-12-30 00:00:00 nan nan nan 34.4 35 34 37.2 nan nan 33.4 32.4 nan 35.2 nan 37 nan 35 35 38 nan nan nan nan 34.4 nan
2021-12-31 00:00:00 nan nan nan 34.8 34.6 35.6 38 nan nan 33.6 33.6 nan 36.4 nan 36.2 nan 36.6 35 38.4 nan nan nan nan 37.6 nan

Datatypes for station and nulls values in the initial file

15015020 15065040 23215060 25025002 25025090 25025250 25025300 25025330 28015030 28015070 28025020 28025040 28025070 28025080 28025090 28025502 28035010 28035020 28035040 28035070 28045020 28045040 29065010 29065020 29065030
Dtype float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64
Nulls 6482 11876 14381 7640 6855 3583 5275 7735 14214 2696 2375 12728 1943 8265 2388 3885 6899 3205 5917 15299 14291 14557 14448 5935 4804

General statistics table - Initial file

count mean std min 25% 50% 75% max
15015020 8859 33.0729 1.48932 25.4 32.2 33.2 34.2 38.4
15065040 3465 34.0803 1.83791 25.1 33 34.2 35.4 39.4
23215060 960 32.7978 1.77501 27 31.6 32.8 34 37.4
25025002 7701 34.3217 2.06657 23 33 34.2 35.6 44
25025090 8486 33.5555 1.8415 26.4 32.4 33.6 34.8 40
25025250 11758 33.787 2.07519 26.6 32.4 34 35 42.6
25025300 10066 34.5617 2.3698 25.2 32.8 34.6 36.2 41.8
25025330 7606 33.2243 2.0554 23.2 31.8 33 34.6 41.6
28015030 1127 34.5657 1.85704 26.3 33.4 34.6 35.8 40.4
28015070 12645 33.7017 2.18347 23.8 32.2 33.6 35.2 42.2
28025020 12966 32.6602 2.08594 24.6 31.3 32.7 34.2 39.7
28025040 2613 29.7907 1.74453 22.4 28.6 30 31 39.2
28025070 13398 34.4325 2.35604 24.1 32.8 34.4 36.2 42.8
28025080 7076 33.5969 2.05285 26.4 32.2 33.6 35 40
28025090 12953 34.3336 2.09333 23.8 33 34.4 35.8 42.3
28025502 11456 34.6493 2.18198 24.8 33.2 34.7 36.2 41.8
28035010 8442 35.0506 2.25909 25 33.6 35.2 36.6 42.4
28035020 12136 34.7338 2.21035 25.4 33.2 34.8 36.4 41.8
28035040 9424 36.1116 2.03179 26 35 36.4 37.4 42.4
28035070 42 37.1095 1.26738 34 36.6 37.4 38 39.2
28045020 1050 33.8226 2.13222 27.2 32.25 33.8 35.4 39.2
28045040 784 34.5605 2.25236 26 33 34.6 36 41.8
29065010 893 33.5259 1.58034 29.4 32.4 33.6 34.6 37.2
29065020 9406 34.1372 1.63127 26 33 34.2 35.2 39.8
29065030 10537 33.3129 1.4911 24.8 32.4 33.4 34.2 38.6

Method 1 - Outliers processing using the interquartile range IQR (q1 = 0.175, q3 = 0.825)

Since the data doesn`t follow a normal distribution, we will calculate the outlier data points using the statistical method called interquartile range (IQR) instead of using Z-score. Using the IQR, the outlier data points are the ones falling below Q1 - 1.5 IQR or above Q3 + 1.5 IQR. The Q1 could be the 25th percentile and Q3 could be the 75th percentile of the dataset, and IQR represents the interquartile range calculated by Q3 minus Q1 (Q3-Q1). 1

Outliers parameters:

  • mean: mean value
  • std: standard deviation value
  • q1: quartile 0.175
  • q3: quartile 0.825
  • IQR: interquartile range (q3-q1)
  • OlLowerLim: outlier bottom limit (q1-1.5*IQR)
  • OlUpperLim: outlier top limit (q3+1.5*IQR)
  • OlMinVal: minimum outlier value founded
  • OlMaxVal: maximum outlier value founded
  • OlCount: # outliers founded
  • CapLowerLim: capped lower limit for outliers replacement ( $\mu$ - 3.6 * $\sigma$ )
  • CapUpperLim: capped upper limit for outliers replacement ( $\mu$ + 3.6 * $\sigma$ )
mean std q1 q3 IQR OlLowerLim OlUpperLim OlMinVal OlMaxVal OlCount CapLowerLim CapUpperLim
15015020 33.0729 1.48932 31.8 34.4 2.6 35.7 38.3 25.4 38.4 17 27.7114 38.4345
15065040 34.0803 1.83791 32.4 35.8 3.4 37.5 40.9 25.1 27.1 5 27.4638 40.6967
23215060 32.7978 1.77501 31 34.4 3.4 36.1 39.5 nan nan 0 26.4078 39.1879
25025002 34.3217 2.06657 32.4 36.2 3.8 38.1 41.9 23 44 5 26.882 41.7613
25025090 33.5555 1.8415 31.8 35 3.2 36.6 39.8 26.4 40 6 26.9261 40.1849
25025250 33.787 2.07519 32 35.4 3.4 37.1 40.5 26.6 42.6 15 26.3163 41.2576
25025300 34.5617 2.3698 32.4 36.8 4.4 39 43.4 25.2 25.4 2 26.0304 43.093
25025330 33.2243 2.0554 31.4 35.2 3.8 37.1 40.9 23.2 41.6 7 25.8248 40.6238
28015030 34.5657 1.85704 33 36.4 3.4 38.1 41.5 26.3 27.9 3 27.8803 41.251
28015070 33.7017 2.18347 31.6 35.8 4.2 37.9 42.1 23.8 42.2 3 25.8412 41.5622
28025020 32.6602 2.08594 30.6 34.6 4 36.6 40.6 nan nan 0 25.1508 40.1696
28025040 29.7907 1.74453 28.2 31.2 3 32.7 35.7 22.4 39.2 4 23.5104 36.071
28025070 34.4325 2.35604 32.2 36.6 4.4 38.8 43.2 24.1 25.6 4 25.9508 42.9143
28025080 33.5969 2.05285 31.8 35.4 3.6 37.2 40.8 26.4 26.4 1 26.2066 40.9872
28025090 34.3336 2.09333 32.4 36.2 3.8 38.1 41.9 23.8 42.3 13 26.7976 41.8696
28025502 34.6493 2.18198 32.6 36.7 4.1 38.75 42.85 24.8 26.4 12 26.7942 42.5044
28035010 35.0506 2.25909 33 37.2 4.2 39.3 43.5 25 26.4 9 26.9179 43.1833
28035020 34.7338 2.21035 32.6 36.8 4.2 38.9 43.1 25.4 26 5 26.7766 42.6911
28035040 36.1116 2.03179 34.4 37.8 3.4 39.5 42.9 26 29.2 42 28.7971 43.426
28035070 37.1095 1.26738 36.235 38.2 1.965 39.1825 41.1475 nan nan 0 32.547 41.6721
28045020 33.8226 2.13222 31.8 36 4.2 38.1 42.3 nan nan 0 26.1466 41.4985
28045040 34.5605 2.25236 32.4 36.6 4.2 38.7 42.9 26 26 1 26.4519 42.669
29065010 33.5259 1.58034 32 35.2 3.2 36.8 40 nan nan 0 27.8366 39.2151
29065020 34.1372 1.63127 32.8 35.6 2.8 37 39.8 26 28.4 20 28.2646 40.0097
29065030 33.3129 1.4911 32 34.6 2.6 35.9 38.5 24.8 38.6 51 27.945 38.6809

R.LTWB

Identified and cleaning tables for 225 IQR outliers founded

Statistical values for the capped and imputed file

IQR - General statistics table - Capped file

count mean std min 25% 50% 75% max
15015020 8859 33.0737 1.48617 27.7114 32.2 33.2 34.2 38.4345
15065040 3465 34.082 1.83098 27.3 33 34.2 35.4 39.4
23215060 960 32.7978 1.77501 27 31.6 32.8 34 37.4
25025002 7701 34.3222 2.06183 26.8 33 34.2 35.6 41.8
25025090 8486 33.5557 1.84131 26.9261 32.4 33.6 34.8 40.1849
25025250 11758 33.7867 2.07458 26.3163 32.4 34 35 41.2576
25025300 10066 34.5619 2.36926 25.8 32.8 34.6 36.2 41.8
25025330 7606 33.2245 2.05156 25.8 31.8 33 34.6 40.6238
28015030 1127 34.5671 1.85119 27.8803 33.4 34.6 35.8 40.4
28015070 12645 33.7018 2.18237 25.4 32.2 33.6 35.2 42
28025020 12966 32.6602 2.08594 24.6 31.3 32.7 34.2 39.7
28025040 2613 29.7902 1.73652 23.5104 28.6 30 31 36.071
28025070 13398 34.4328 2.35485 25.8 32.8 34.4 36.2 42.8
28025080 7076 33.5968 2.05295 26.2066 32.2 33.6 35 40
28025090 12953 34.3342 2.09049 26.7976 33 34.4 35.8 41.8696
28025502 11456 34.6505 2.17731 26.6 33.2 34.7 36.2 41.8
28035010 8442 35.0518 2.25426 26.8 33.6 35.2 36.6 42.4
28035020 12136 34.7342 2.2088 26.4 33.2 34.8 36.4 41.8
28035040 9424 36.1131 2.02555 28.7971 35 36.4 37.4 42.4
28035070 42 37.1095 1.26738 34 36.6 37.4 38 39.2
28045020 1050 33.8226 2.13222 27.2 32.25 33.8 35.4 39.2
28045040 784 34.561 2.25023 26.2 33 34.6 36 41.8
29065010 893 33.5259 1.58034 29.4 32.4 33.6 34.6 37.2
29065020 9406 34.1386 1.6254 28.2646 33 34.2 35.2 39.8
29065030 10537 33.3167 1.47557 27.945 32.4 33.4 34.2 38.6809

IQR - General statistics table - Imputed file

count mean std min 25% 50% 75% max
15015020 8859 33.0828 1.46746 28 32.2 33.2 34.2 38.2
15065040 3465 34.0915 1.81361 27.3 33 34.2 35.4 39.4
23215060 960 32.7978 1.77501 27 31.6 32.8 34 37.4
25025002 7701 34.3251 2.05309 26.8 33 34.2 35.6 41.8
25025090 8486 33.5557 1.83285 27 32.4 33.6 34.8 39.8
25025250 11758 33.7797 2.05733 27 32.4 34 35 40.4
25025300 10066 34.5636 2.3662 25.8 32.8 34.6 36.2 41.8
25025330 7606 33.2215 2.03924 25.8 31.8 33 34.6 40.6
28015030 1127 34.5849 1.81864 28 33.5 34.6 35.8 40.4
28015070 12645 33.7025 2.17901 25.4 32.2 33.6 35.2 42
28025020 12966 32.6602 2.08594 24.6 31.3 32.7 34.2 39.7
28025040 2613 29.7902 1.71904 23.8 28.6 30 31 35.4
28025070 13398 34.4354 2.35028 25.8 32.8 34.4 36.2 42.8
28025080 7076 33.5979 2.05107 26.6 32.2 33.6 35 40
28025090 12953 34.3383 2.07681 27 33 34.4 35.8 41.2
28025502 11456 34.6587 2.16239 26.6 33.2 34.7 36.2 41.8
28035010 8442 35.0605 2.23855 26.8 33.6 35.2 36.6 42.4
28035020 12136 34.7375 2.20289 26.4 33.2 34.8 36.4 41.8
28035040 9424 36.1457 1.9655 29.4 35 36.4 37.4 42.4
28035070 42 37.1095 1.26738 34 36.6 37.4 38 39.2
28045020 1050 33.8226 2.13222 27.2 32.25 33.8 35.4 39.2
28045040 784 34.5714 2.23147 26.2 33.15 34.6 36 41.8
29065010 893 33.5259 1.58034 29.4 32.4 33.6 34.6 37.2
29065020 9406 34.1511 1.60262 28.6 33 34.2 35.2 39.8
29065030 10537 33.3416 1.42724 28.2 32.4 33.4 34.2 38.4

Method 2 - Outliers processing through empirical rule - ER or k-sigma ( $\mu$ - k * $\sigma$ ) with k = 3.6

The empirical rule, also referred to as the three-sigma rule or 68-95-99.7 rule, is a statistical rule which states that for a normal distribution, almost all observed data will fall within three standard deviations (denoted by $\sigma$) of the mean or average (denoted by $\mu$). In particular, the empirical rule predicts that 68% of observations falls within the first standard deviation ( $\mu$ ± $\sigma$ ), 95% within the first two standard deviations ( $\mu$ ± 2 $\sigma$ ), and 99.7% within the first three standard deviations ( $\mu$ ± 3 $\sigma$ ).2

Outliers parameters:

  • mean: mean value
  • std: standard deviation value
  • OlMinVal: minimum outlier value founded
  • OlMaxVal: maximum outlier value founded
  • OlCount: # outliers founded
  • CapLowerLim: capped lower limit for outliers replacement ( $\mu$ - 3.6 * $\sigma$ )
  • CapUpperLim: capped upper limit for outliers replacement ( $\mu$ + 3.6 * $\sigma$ )
mean std OlMinVal OlMaxVal OlCount CapLowerLim CapUpperLim
15015020 33.0729 1.48932 25.4 27.6 13 27.7114 38.4345
15065040 34.0803 1.83791 25.1 27.4 7 27.4638 40.6967
23215060 32.7978 1.77501 nan nan 0 26.4078 39.1879
25025002 34.3217 2.06657 23 44 8 26.882 41.7613
25025090 33.5555 1.8415 26.4 26.8 3 26.9261 40.1849
25025250 33.787 2.07519 41.4 42.6 5 26.3163 41.2576
25025300 34.5617 2.3698 25.2 25.8 3 26.0304 43.093
25025330 33.2243 2.0554 23.2 41.6 8 25.8248 40.6238
28015030 34.5657 1.85704 26.3 27.8 2 27.8803 41.251
28015070 33.7017 2.18347 23.8 42.2 13 25.8412 41.5622
28025020 32.6602 2.08594 24.6 25.1 5 25.1508 40.1696
28025040 29.7907 1.74453 22.4 39.2 3 23.5104 36.071
28025070 34.4325 2.35604 24.1 25.8 5 25.9508 42.9143
28025080 33.5969 2.05285 nan nan 0 26.2066 40.9872
28025090 34.3336 2.09333 23.8 42.3 13 26.7976 41.8696
28025502 34.6493 2.18198 24.8 26.7 17 26.7942 42.5044
28035010 35.0506 2.25909 25 26.8 11 26.9179 43.1833
28035020 34.7338 2.21035 25.4 26.6 12 26.7766 42.6911
28035040 36.1116 2.03179 26 28.6 29 28.7971 43.426
28035070 37.1095 1.26738 nan nan 0 32.547 41.6721
28045020 33.8226 2.13222 nan nan 0 26.1466 41.4985
28045040 34.5605 2.25236 26 26.2 2 26.4519 42.669
29065010 33.5259 1.58034 nan nan 0 27.8366 39.2151
29065020 34.1372 1.63127 26 28.2 18 28.2646 40.0097
29065030 33.3129 1.4911 24.8 27.8 39 27.945 38.6809

R.LTWB

Identified and cleaning tables for 216 ER or k-sigma outliers founded

Statistical values for the capped and imputed file

ER - General statistics table - Capped file

count mean std min 25% 50% 75% max
15015020 8859 33.0738 1.48604 27.7114 32.2 33.2 34.2 38.4
15065040 3465 34.082 1.83074 27.4638 33 34.2 35.4 39.4
23215060 960 32.7978 1.77501 27 31.6 32.8 34 37.4
25025002 7701 34.3222 2.06175 26.882 33 34.2 35.6 41.7613
25025090 8486 33.5557 1.84107 26.9261 32.4 33.6 34.8 40
25025250 11758 33.7866 2.07391 26.6 32.4 34 35 41.2576
25025300 10066 34.5619 2.36917 26.0304 32.8 34.6 36.2 41.8
25025330 7606 33.2245 2.05154 25.8248 31.8 33 34.6 40.6238
28015030 1127 34.5671 1.85113 27.8803 33.4 34.6 35.8 40.4
28015070 12645 33.7019 2.18138 25.8412 32.2 33.6 35.2 41.5622
28025020 12966 32.6603 2.08556 25.1508 31.3 32.7 34.2 39.7
28025040 2613 29.7902 1.73642 23.5104 28.6 30 31 36.071
28025070 13398 34.4328 2.3548 25.9508 32.8 34.4 36.2 42.8
28025080 7076 33.5969 2.05285 26.4 32.2 33.6 35 40
28025090 12953 34.3342 2.09049 26.7976 33 34.4 35.8 41.8696
28025502 11456 34.6505 2.17712 26.7942 33.2 34.7 36.2 41.8
28035010 8442 35.0519 2.25416 26.9179 33.6 35.2 36.6 42.4
28035020 12136 34.7344 2.20819 26.7766 33.2 34.8 36.4 41.8
28035040 9424 36.1134 2.02448 28.7971 35 36.4 37.4 42.4
28035070 42 37.1095 1.26738 34 36.6 37.4 38 39.2
28045020 1050 33.8226 2.13222 27.2 32.25 33.8 35.4 39.2
28045040 784 34.5614 2.24905 26.4519 33 34.6 36 41.8
29065010 893 33.5259 1.58034 29.4 32.4 33.6 34.6 37.2
29065020 9406 34.1386 1.62529 28.2646 33 34.2 35.2 39.8
29065030 10537 33.3167 1.47534 27.945 32.4 33.4 34.2 38.6

ER - General statistics table - Imputed file

count mean std min 25% 50% 75% max
15015020 8859 33.0816 1.47176 27.8 32.2 33.2 34.2 38.4
15065040 3465 34.0954 1.80636 27.6 33 34.2 35.4 39.4
23215060 960 32.7978 1.77501 27 31.6 32.8 34 37.4
25025002 7701 34.3241 2.04776 27 33 34.2 35.6 41.6
25025090 8486 33.558 1.83685 27 32.4 33.6 34.8 40
25025250 11758 33.7835 2.06818 26.6 32.4 34 35 41.2
25025300 10066 34.5644 2.36459 26.4 32.8 34.6 36.2 41.8
25025330 7606 33.2225 2.03746 26 31.8 33 34.6 40.6
28015030 1127 34.579 1.82951 27.9 33.5 34.6 35.8 40.4
28015070 12645 33.7038 2.16677 26 32.2 33.6 35.2 41.2
28025020 12966 32.6632 2.08033 25.2 31.3 32.7 34.2 39.7
28025040 2613 29.7926 1.72333 23.8 28.6 30 31 36
28025070 13398 34.436 2.34909 26 32.8 34.4 36.2 42.8
28025080 7076 33.5969 2.05285 26.4 32.2 33.6 35 40
28025090 12953 34.3383 2.07681 27 33 34.4 35.8 41.2
28025502 11456 34.6622 2.15595 26.8 33.2 34.7 36.2 41.8
28035010 8442 35.0625 2.23493 27.2 33.6 35.2 36.6 42.4
28035020 12136 34.7423 2.19395 26.8 33.2 34.8 36.4 41.8
28035040 9424 36.1359 1.98325 28.8 35 36.4 37.4 42.4
28035070 42 37.1095 1.26738 34 36.6 37.4 38 39.2
28045020 1050 33.8226 2.13222 27.2 32.25 33.8 35.4 39.2
28045040 784 34.582 2.21129 27 33.2 34.6 36 41.8
29065010 893 33.5259 1.58034 29.4 32.4 33.6 34.6 37.2
29065020 9406 34.1499 1.60481 28.4 33 34.2 35.2 39.8
29065030 10537 33.3366 1.43854 28 32.4 33.4 34.2 38.6

Method 3 - Outliers processing through Z-score >= 3.6 or standard core

Z score is an important concept in statistics. Z score is also called standard score. This score helps to understand if each data value is greater or smaller than mean and how far away it is from the mean. More specifically, Z score tells how many standard deviations away a data point is from the mean. Z = ( x - $\mu$ ) / $\sigma$.3

Altought with this method, the identified outliers are the same obtained in Method 2 that uses the empirical rule when the Z-score threshold is the same k-sigma value, the Method 3 creates the Z-score table values. Use this method to compare the identified outliers with differents k-sigma values.

Outliers parameters:

  • mean: mean value
  • std: standard deviation value
  • OlMinVal: minimum outlier value founded
  • OlMaxVal: maximum outlier value founded
  • OlCount: # outliers founded
  • CapLowerLim: capped lower limit for outliers replacement ( $\mu$ - 3.6 * $\sigma$ )
  • CapUpperLim: capped upper limit for outliers replacement ( $\mu$ + 3.6 * $\sigma$ )
mean std OlMinVal OlMaxVal OlCount CapLowerLim CapUpperLim
15015020 33.0729 1.48932 25.4 27.6 13 27.7114 38.4345
15065040 34.0803 1.83791 25.1 27.4 7 27.4638 40.6967
23215060 32.7978 1.77501 nan nan 0 26.4078 39.1879
25025002 34.3217 2.06657 23 44 8 26.882 41.7613
25025090 33.5555 1.8415 26.4 26.8 3 26.9261 40.1849
25025250 33.787 2.07519 41.4 42.6 5 26.3163 41.2576
25025300 34.5617 2.3698 25.2 25.8 3 26.0304 43.093
25025330 33.2243 2.0554 23.2 41.6 8 25.8248 40.6238
28015030 34.5657 1.85704 26.3 27.8 2 27.8803 41.251
28015070 33.7017 2.18347 23.8 42.2 13 25.8412 41.5622
28025020 32.6602 2.08594 24.6 25.1 5 25.1508 40.1696
28025040 29.7907 1.74453 22.4 39.2 3 23.5104 36.071
28025070 34.4325 2.35604 24.1 25.8 5 25.9508 42.9143
28025080 33.5969 2.05285 nan nan 0 26.2066 40.9872
28025090 34.3336 2.09333 23.8 42.3 13 26.7976 41.8696
28025502 34.6493 2.18198 24.8 26.7 17 26.7942 42.5044
28035010 35.0506 2.25909 25 26.8 11 26.9179 43.1833
28035020 34.7338 2.21035 25.4 26.6 12 26.7766 42.6911
28035040 36.1116 2.03179 26 28.6 29 28.7971 43.426
28035070 37.1095 1.26738 nan nan 0 32.547 41.6721
28045020 33.8226 2.13222 nan nan 0 26.1466 41.4985
28045040 34.5605 2.25236 26 26.2 2 26.4519 42.669
29065010 33.5259 1.58034 nan nan 0 27.8366 39.2151
29065020 34.1372 1.63127 26 28.2 18 28.2646 40.0097
29065030 33.3129 1.4911 24.8 27.8 39 27.945 38.6809

R.LTWB

Identified and cleaning tables for 216 Z-score or standard core outliers founded

Statistical values for the capped and imputed file

Z-score - General statistics table - Capped file

count mean std min 25% 50% 75% max
15015020 8859 33.0738 1.48604 27.7114 32.2 33.2 34.2 38.4
15065040 3465 34.082 1.83074 27.4638 33 34.2 35.4 39.4
23215060 960 32.7978 1.77501 27 31.6 32.8 34 37.4
25025002 7701 34.3222 2.06175 26.882 33 34.2 35.6 41.7613
25025090 8486 33.5557 1.84107 26.9261 32.4 33.6 34.8 40
25025250 11758 33.7866 2.07391 26.6 32.4 34 35 41.2576
25025300 10066 34.5619 2.36917 26.0304 32.8 34.6 36.2 41.8
25025330 7606 33.2245 2.05154 25.8248 31.8 33 34.6 40.6238
28015030 1127 34.5671 1.85113 27.8803 33.4 34.6 35.8 40.4
28015070 12645 33.7019 2.18138 25.8412 32.2 33.6 35.2 41.5622
28025020 12966 32.6603 2.08556 25.1508 31.3 32.7 34.2 39.7
28025040 2613 29.7902 1.73642 23.5104 28.6 30 31 36.071
28025070 13398 34.4328 2.3548 25.9508 32.8 34.4 36.2 42.8
28025080 7076 33.5969 2.05285 26.4 32.2 33.6 35 40
28025090 12953 34.3342 2.09049 26.7976 33 34.4 35.8 41.8696
28025502 11456 34.6505 2.17712 26.7942 33.2 34.7 36.2 41.8
28035010 8442 35.0519 2.25416 26.9179 33.6 35.2 36.6 42.4
28035020 12136 34.7344 2.20819 26.7766 33.2 34.8 36.4 41.8
28035040 9424 36.1134 2.02448 28.7971 35 36.4 37.4 42.4
28035070 42 37.1095 1.26738 34 36.6 37.4 38 39.2
28045020 1050 33.8226 2.13222 27.2 32.25 33.8 35.4 39.2
28045040 784 34.5614 2.24905 26.4519 33 34.6 36 41.8
29065010 893 33.5259 1.58034 29.4 32.4 33.6 34.6 37.2
29065020 9406 34.1386 1.62529 28.2646 33 34.2 35.2 39.8
29065030 10537 33.3167 1.47534 27.945 32.4 33.4 34.2 38.6

Z-score - General statistics table - Imputed file

count mean std min 25% 50% 75% max
15015020 8859 33.0816 1.47176 27.8 32.2 33.2 34.2 38.4
15065040 3465 34.0954 1.80636 27.6 33 34.2 35.4 39.4
23215060 960 32.7978 1.77501 27 31.6 32.8 34 37.4
25025002 7701 34.3241 2.04776 27 33 34.2 35.6 41.6
25025090 8486 33.558 1.83685 27 32.4 33.6 34.8 40
25025250 11758 33.7835 2.06818 26.6 32.4 34 35 41.2
25025300 10066 34.5644 2.36459 26.4 32.8 34.6 36.2 41.8
25025330 7606 33.2225 2.03746 26 31.8 33 34.6 40.6
28015030 1127 34.579 1.82951 27.9 33.5 34.6 35.8 40.4
28015070 12645 33.7038 2.16677 26 32.2 33.6 35.2 41.2
28025020 12966 32.6632 2.08033 25.2 31.3 32.7 34.2 39.7
28025040 2613 29.7926 1.72333 23.8 28.6 30 31 36
28025070 13398 34.436 2.34909 26 32.8 34.4 36.2 42.8
28025080 7076 33.5969 2.05285 26.4 32.2 33.6 35 40
28025090 12953 34.3383 2.07681 27 33 34.4 35.8 41.2
28025502 11456 34.6622 2.15595 26.8 33.2 34.7 36.2 41.8
28035010 8442 35.0625 2.23493 27.2 33.6 35.2 36.6 42.4
28035020 12136 34.7423 2.19395 26.8 33.2 34.8 36.4 41.8
28035040 9424 36.1359 1.98325 28.8 35 36.4 37.4 42.4
28035070 42 37.1095 1.26738 34 36.6 37.4 38 39.2
28045020 1050 33.8226 2.13222 27.2 32.25 33.8 35.4 39.2
28045040 784 34.582 2.21129 27 33.2 34.6 36 41.8
29065010 893 33.5259 1.58034 29.4 32.4 33.6 34.6 37.2
29065020 9406 34.1499 1.60481 28.4 33 34.2 35.2 39.8
29065030 10537 33.3366 1.43854 28 32.4 33.4 34.2 38.6

The drop files contains the database values without the outliers identified.

The capped files contains the database values and the outliers has been replaced with the lower or upper capped value calculated. Lower outliers could be replaced with negative values because the limit is defined with (mean() - cap_multiplier * std()). In some cases like temperature analysis, the upper outliers values could be replaced with values over the original values and you can try to fix this issue changing the parameter cap_multiplier that defines the stripe values range.

The imputation method replace each outlier value with the mean value that contains the original outliers values.

Footnotes

  1. Adapted from: https://careerfoundry.com/en/blog/data-analytics/how-to-find-outliers/

  2. https://www.investopedia.com/terms/e/empirical-rule.asp

  3. Adapted from: https://www.geeksforgeeks.org/z-score-for-outlier-detection-python/