Skip to content

Commit 8b56ad9

Browse files
committed
📝 Fill with mean values
1 parent ba42ea4 commit 8b56ad9

File tree

1 file changed

+95
-87
lines changed

1 file changed

+95
-87
lines changed

docs/clean-prep/nulls.ipynb

Lines changed: 95 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -485,9 +485,20 @@
485485
"cell_type": "code",
486486
"execution_count": 8,
487487
"metadata": {},
488-
"outputs": [],
488+
"outputs": [
489+
{
490+
"data": {
491+
"text/plain": [
492+
"(46116, 7)"
493+
]
494+
},
495+
"execution_count": 8,
496+
"metadata": {},
497+
"output_type": "execute_result"
498+
}
499+
],
489500
"source": [
490-
"df2 = df.dropna().shape"
501+
"df.dropna().shape"
491502
]
492503
},
493504
{
@@ -535,49 +546,7 @@
535546
"cell_type": "markdown",
536547
"metadata": {},
537548
"source": [
538-
"### 2.3 Alle Spalten finden, in denen alle Daten vorhanden sind"
539-
]
540-
},
541-
{
542-
"cell_type": "code",
543-
"execution_count": 10,
544-
"metadata": {},
545-
"outputs": [],
546-
"source": [
547-
"complete_columns = list(df.columns)"
548-
]
549-
},
550-
{
551-
"cell_type": "code",
552-
"execution_count": 11,
553-
"metadata": {},
554-
"outputs": [
555-
{
556-
"data": {
557-
"text/plain": [
558-
"['timestamp',\n",
559-
" 'username',\n",
560-
" 'temperature',\n",
561-
" 'heartrate',\n",
562-
" 'build',\n",
563-
" 'latest',\n",
564-
" 'note']"
565-
]
566-
},
567-
"execution_count": 11,
568-
"metadata": {},
569-
"output_type": "execute_result"
570-
}
571-
],
572-
"source": [
573-
"complete_columns"
574-
]
575-
},
576-
{
577-
"cell_type": "markdown",
578-
"metadata": {},
579-
"source": [
580-
"### 2.4 Allte Spalten finden, in denen die meisten Daten vorhanden sind"
549+
"### 2.3 Alle Spalten finden, in denen die meisten Daten vorhanden sind"
581550
]
582551
},
583552
{
@@ -611,7 +580,7 @@
611580
"cell_type": "markdown",
612581
"metadata": {},
613582
"source": [
614-
"### 2.5 Alle Spalten mit fehlenden Daten finden\n",
583+
"### 2.4 Alle Spalten mit fehlenden Daten finden\n",
615584
"\n",
616585
"Mit [pandas.DataFrame.isnull](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isnull.html) können wir fehlende Werte finden und mit [pandas.DataFrame.any](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html) erfahren wir, ob ein Element gültig ist, normalerweise über einer Spalte."
617586
]
@@ -707,7 +676,7 @@
707676
"cell_type": "markdown",
708677
"metadata": {},
709678
"source": [
710-
"### 2.6 Ersetzen fehlender Daten\n",
679+
"### 2.5 Ersetzen fehlender Daten\n",
711680
"\n",
712681
"Um unsere Änderungen in der Spalte `latest` überprüfen zu können, verwenden wir [pandas.Series.value_counts](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html). Die Methode gibt eine Serie zurück, die die Anzahl der eindeutigen Werte enthält:"
713682
]
@@ -720,9 +689,32 @@
720689
{
721690
"data": {
722691
"text/plain": [
723-
"latest\n",
724-
"0.0 75735\n",
725-
"1.0 38364\n",
692+
"temperature\n",
693+
"29.0 4688\n",
694+
"26.0 4674\n",
695+
"16.0 4656\n",
696+
"28.0 4648\n",
697+
"10.0 4632\n",
698+
"13.0 4629\n",
699+
"7.0 4624\n",
700+
"27.0 4621\n",
701+
"21.0 4585\n",
702+
"9.0 4576\n",
703+
"23.0 4571\n",
704+
"5.0 4568\n",
705+
"6.0 4563\n",
706+
"19.0 4561\n",
707+
"18.0 4557\n",
708+
"17.0 4556\n",
709+
"11.0 4529\n",
710+
"15.0 4525\n",
711+
"8.0 4486\n",
712+
"12.0 4484\n",
713+
"20.0 4473\n",
714+
"25.0 4469\n",
715+
"14.0 4464\n",
716+
"22.0 4455\n",
717+
"24.0 4446\n",
726718
"Name: count, dtype: int64"
727719
]
728720
},
@@ -732,60 +724,76 @@
732724
}
733725
],
734726
"source": [
735-
"df.latest.value_counts()"
727+
"df.temperature.value_counts()"
736728
]
737729
},
738730
{
739731
"cell_type": "markdown",
740732
"metadata": {},
741733
"source": [
742-
"Jetzt ersetzen wir die fehlenden Werte in der Spalte `latest` durch `0` mit [DataFrame.fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html):"
734+
"Jetzt ersetzen wir die fehlenden Werte in der Spalte `temperature` durch den auf eine Nachkommastelle gerundeten Mittelwert mit [DataFrame.fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html):"
743735
]
744736
},
745737
{
746738
"cell_type": "code",
747739
"execution_count": 18,
748740
"metadata": {},
749-
"outputs": [],
750-
"source": [
751-
"df.latest = df.latest.fillna(0)"
752-
]
753-
},
754-
{
755-
"cell_type": "code",
756-
"execution_count": 19,
757-
"metadata": {},
758741
"outputs": [
759742
{
760743
"data": {
761744
"text/plain": [
762-
"latest\n",
763-
"0.0 108033\n",
764-
"1.0 38364\n",
745+
"temperature\n",
746+
"17.0 36913\n",
747+
"29.0 4688\n",
748+
"26.0 4674\n",
749+
"16.0 4656\n",
750+
"28.0 4648\n",
751+
"10.0 4632\n",
752+
"13.0 4629\n",
753+
"7.0 4624\n",
754+
"27.0 4621\n",
755+
"21.0 4585\n",
756+
"9.0 4576\n",
757+
"23.0 4571\n",
758+
"5.0 4568\n",
759+
"6.0 4563\n",
760+
"19.0 4561\n",
761+
"18.0 4557\n",
762+
"11.0 4529\n",
763+
"15.0 4525\n",
764+
"8.0 4486\n",
765+
"12.0 4484\n",
766+
"20.0 4473\n",
767+
"25.0 4469\n",
768+
"14.0 4464\n",
769+
"22.0 4455\n",
770+
"24.0 4446\n",
765771
"Name: count, dtype: int64"
766772
]
767773
},
768-
"execution_count": 19,
774+
"execution_count": 18,
769775
"metadata": {},
770776
"output_type": "execute_result"
771777
}
772778
],
773779
"source": [
774-
"df.latest.value_counts()"
780+
"temp_mean = round(df.temperature.mean(), 1)\n",
781+
"fill_mean = df.temperature.fillna(temp_mean)\n",
782+
"fill_mean.value_counts()"
775783
]
776784
},
777785
{
778786
"cell_type": "markdown",
779787
"metadata": {},
780788
"source": [
781-
"### 2.7 Ersetzen fehlender Daten durch `backfill`\n",
789+
"### 2.6 Ersetzen fehlender Daten durch `backfill`\n",
782790
"\n",
783791
"Damit die Datensätze in ihrer zeitlichen Reihenfolge aufeinanderfolgen, setzen wir zunächst den Index für `timestamp` mit [set_index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html):"
784792
]
785793
},
786794
{
787795
"cell_type": "code",
788-
"execution_count": 20,
796+
"execution_count": 19,
789797
"metadata": {},
790798
"outputs": [],
791799
"source": [
@@ -794,7 +802,7 @@
794802
},
795803
{
796804
"cell_type": "code",
797-
"execution_count": 21,
805+
"execution_count": 20,
798806
"metadata": {},
799807
"outputs": [
800808
{
@@ -878,7 +886,7 @@
878886
" <td>29.0</td>\n",
879887
" <td>62</td>\n",
880888
" <td>122f1c6a-403c-2221-6ed1-b5caa08f11e0</td>\n",
881-
" <td>0.0</td>\n",
889+
" <td>NaN</td>\n",
882890
" <td>NaN</td>\n",
883891
" </tr>\n",
884892
" <tr>\n",
@@ -905,7 +913,7 @@
905913
" <td>16.0</td>\n",
906914
" <td>76</td>\n",
907915
" <td>7a60219f-6621-e548-180e-ca69624f9824</td>\n",
908-
" <td>0.0</td>\n",
916+
" <td>NaN</td>\n",
909917
" <td>interval</td>\n",
910918
" </tr>\n",
911919
" <tr>\n",
@@ -932,7 +940,7 @@
932940
" <td>NaN</td>\n",
933941
" <td>63</td>\n",
934942
" <td>e09b6001-125d-51cf-9c3f-9cb686c19d02</td>\n",
935-
" <td>0.0</td>\n",
943+
" <td>NaN</td>\n",
936944
" <td>NaN</td>\n",
937945
" </tr>\n",
938946
" <tr>\n",
@@ -950,7 +958,7 @@
950958
" <td>22.0</td>\n",
951959
" <td>83</td>\n",
952960
" <td>03e1a07b-3e14-412c-3a69-6b45bc79f81c</td>\n",
953-
" <td>0.0</td>\n",
961+
" <td>NaN</td>\n",
954962
" <td>update</td>\n",
955963
" </tr>\n",
956964
" <tr>\n",
@@ -986,7 +994,7 @@
986994
" <td>NaN</td>\n",
987995
" <td>63</td>\n",
988996
" <td>b60bd7de-4057-8a85-f806-e6eec1350338</td>\n",
989-
" <td>0.0</td>\n",
997+
" <td>NaN</td>\n",
990998
" <td>interval</td>\n",
991999
" </tr>\n",
9921000
" <tr>\n",
@@ -1004,7 +1012,7 @@
10041012
" <td>11.0</td>\n",
10051013
" <td>69</td>\n",
10061014
" <td>1aef7db8-9a3e-7dc9-d7a5-781ec0efd200</td>\n",
1007-
" <td>0.0</td>\n",
1015+
" <td>NaN</td>\n",
10081016
" <td>user</td>\n",
10091017
" </tr>\n",
10101018
" <tr>\n",
@@ -1050,25 +1058,25 @@
10501058
"2017-01-01T12:01:09 7256b7b0-e502-f576-62ec-ed73533c9c84 0.0 wake \n",
10511059
"2017-01-01T12:01:34 9226c94b-bb4b-a6c8-8e02-cb42b53e9c90 0.0 NaN \n",
10521060
"2017-01-01T12:02:09 NaN 0.0 update \n",
1053-
"2017-01-01T12:02:36 122f1c6a-403c-2221-6ed1-b5caa08f11e0 0.0 NaN \n",
1061+
"2017-01-01T12:02:36 122f1c6a-403c-2221-6ed1-b5caa08f11e0 NaN NaN \n",
10541062
"2017-01-01T12:03:04 0897dbe5-9c5b-71ca-73a1-7586959ca198 0.0 interval \n",
10551063
"2017-01-01T12:03:51 1c07ab9b-5f66-137d-a74f-921a41001f4e 1.0 NaN \n",
1056-
"2017-01-01T12:04:35 7a60219f-6621-e548-180e-ca69624f9824 0.0 interval \n",
1064+
"2017-01-01T12:04:35 7a60219f-6621-e548-180e-ca69624f9824 NaN interval \n",
10571065
"2017-01-01T12:05:05 a8b87754-a162-da28-2527-4bce4b3d4191 1.0 NaN \n",
10581066
"2017-01-01T12:05:41 585f1a3c-0679-0ffe-9132-508933c70343 0.0 wake \n",
1059-
"2017-01-01T12:06:21 e09b6001-125d-51cf-9c3f-9cb686c19d02 0.0 NaN \n",
1067+
"2017-01-01T12:06:21 e09b6001-125d-51cf-9c3f-9cb686c19d02 NaN NaN \n",
10601068
"2017-01-01T12:06:53 607c9f6e-2bdf-a606-6d16-3004c6958436 1.0 update \n",
1061-
"2017-01-01T12:07:41 03e1a07b-3e14-412c-3a69-6b45bc79f81c 0.0 update \n",
1069+
"2017-01-01T12:07:41 03e1a07b-3e14-412c-3a69-6b45bc79f81c NaN update \n",
10621070
"2017-01-01T12:08:08 NaN 0.0 interval \n",
10631071
"2017-01-01T12:08:35 NaN 0.0 wake \n",
10641072
"2017-01-01T12:09:05 b9890c1e-79d5-8979-63ae-6c08a4cd476a 0.0 NaN \n",
1065-
"2017-01-01T12:09:48 b60bd7de-4057-8a85-f806-e6eec1350338 0.0 interval \n",
1073+
"2017-01-01T12:09:48 b60bd7de-4057-8a85-f806-e6eec1350338 NaN interval \n",
10661074
"2017-01-01T12:10:23 b1dacc73-c8b7-1d7d-ee02-578da781a71e 0.0 test \n",
1067-
"2017-01-01T12:10:57 1aef7db8-9a3e-7dc9-d7a5-781ec0efd200 0.0 user \n",
1075+
"2017-01-01T12:10:57 1aef7db8-9a3e-7dc9-d7a5-781ec0efd200 NaN user \n",
10681076
"2017-01-01T12:11:34 8075d058-7dae-e2ec-d47e-58ec6d26899b 1.0 NaN "
10691077
]
10701078
},
1071-
"execution_count": 21,
1079+
"execution_count": 20,
10721080
"metadata": {},
10731081
"output_type": "execute_result"
10741082
}
@@ -1086,7 +1094,7 @@
10861094
},
10871095
{
10881096
"cell_type": "code",
1089-
"execution_count": 22,
1097+
"execution_count": 21,
10901098
"metadata": {},
10911099
"outputs": [],
10921100
"source": [
@@ -1097,7 +1105,7 @@
10971105
},
10981106
{
10991107
"cell_type": "code",
1100-
"execution_count": 23,
1108+
"execution_count": 22,
11011109
"metadata": {},
11021110
"outputs": [
11031111
{
@@ -1106,15 +1114,15 @@
11061114
"text": [
11071115
"number missing for column temperature: 22633\n",
11081116
"number missing for column build: 32350\n",
1109-
"number missing for column latest: 0\n",
1117+
"number missing for column latest: 32298\n",
11101118
"number missing for column note: 48704\n"
11111119
]
11121120
}
11131121
],
11141122
"source": [
11151123
"for col in incomplete_columns:\n",
11161124
" num_missing = df[df[col].isnull() == True].shape[0]\n",
1117-
" print(\"number missing for column {}: {}\".format(col, num_missing)) "
1125+
" print(f\"number missing for column {col}: {num_missing}\") "
11181126
]
11191127
},
11201128
{
@@ -1149,7 +1157,7 @@
11491157
"name": "python",
11501158
"nbconvert_exporter": "python",
11511159
"pygments_lexer": "ipython3",
1152-
"version": "3.11.4"
1160+
"version": "3.11.10"
11531161
},
11541162
"latex_envs": {
11551163
"LaTeX_envs_menu_present": true,

0 commit comments

Comments
 (0)