|
485 | 485 | "cell_type": "code", |
486 | 486 | "execution_count": 8, |
487 | 487 | "metadata": {}, |
488 | | - "outputs": [], |
| 488 | + "outputs": [ |
| 489 | + { |
| 490 | + "data": { |
| 491 | + "text/plain": [ |
| 492 | + "(46116, 7)" |
| 493 | + ] |
| 494 | + }, |
| 495 | + "execution_count": 8, |
| 496 | + "metadata": {}, |
| 497 | + "output_type": "execute_result" |
| 498 | + } |
| 499 | + ], |
489 | 500 | "source": [ |
490 | | - "df2 = df.dropna().shape" |
| 501 | + "df.dropna().shape" |
491 | 502 | ] |
492 | 503 | }, |
493 | 504 | { |
|
535 | 546 | "cell_type": "markdown", |
536 | 547 | "metadata": {}, |
537 | 548 | "source": [ |
538 | | - "### 2.3 Alle Spalten finden, in denen alle Daten vorhanden sind" |
539 | | - ] |
540 | | - }, |
541 | | - { |
542 | | - "cell_type": "code", |
543 | | - "execution_count": 10, |
544 | | - "metadata": {}, |
545 | | - "outputs": [], |
546 | | - "source": [ |
547 | | - "complete_columns = list(df.columns)" |
548 | | - ] |
549 | | - }, |
550 | | - { |
551 | | - "cell_type": "code", |
552 | | - "execution_count": 11, |
553 | | - "metadata": {}, |
554 | | - "outputs": [ |
555 | | - { |
556 | | - "data": { |
557 | | - "text/plain": [ |
558 | | - "['timestamp',\n", |
559 | | - " 'username',\n", |
560 | | - " 'temperature',\n", |
561 | | - " 'heartrate',\n", |
562 | | - " 'build',\n", |
563 | | - " 'latest',\n", |
564 | | - " 'note']" |
565 | | - ] |
566 | | - }, |
567 | | - "execution_count": 11, |
568 | | - "metadata": {}, |
569 | | - "output_type": "execute_result" |
570 | | - } |
571 | | - ], |
572 | | - "source": [ |
573 | | - "complete_columns" |
574 | | - ] |
575 | | - }, |
576 | | - { |
577 | | - "cell_type": "markdown", |
578 | | - "metadata": {}, |
579 | | - "source": [ |
580 | | - "### 2.4 Allte Spalten finden, in denen die meisten Daten vorhanden sind" |
| 549 | + "### 2.3 Alle Spalten finden, in denen die meisten Daten vorhanden sind" |
581 | 550 | ] |
582 | 551 | }, |
583 | 552 | { |
|
611 | 580 | "cell_type": "markdown", |
612 | 581 | "metadata": {}, |
613 | 582 | "source": [ |
614 | | - "### 2.5 Alle Spalten mit fehlenden Daten finden\n", |
| 583 | + "### 2.4 Alle Spalten mit fehlenden Daten finden\n", |
615 | 584 | "\n", |
616 | 585 | "Mit [pandas.DataFrame.isnull](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isnull.html) können wir fehlende Werte finden und mit [pandas.DataFrame.any](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html) erfahren wir, ob ein Element gültig ist, normalerweise über einer Spalte." |
617 | 586 | ] |
|
707 | 676 | "cell_type": "markdown", |
708 | 677 | "metadata": {}, |
709 | 678 | "source": [ |
710 | | - "### 2.6 Ersetzen fehlender Daten\n", |
| 679 | + "### 2.5 Ersetzen fehlender Daten\n", |
711 | 680 | "\n", |
712 | 681 | "Um unsere Änderungen in der Spalte `latest` überprüfen zu können, verwenden wir [pandas.Series.value_counts](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html). Die Methode gibt eine Serie zurück, die die Anzahl der eindeutigen Werte enthält:" |
713 | 682 | ] |
|
720 | 689 | { |
721 | 690 | "data": { |
722 | 691 | "text/plain": [ |
723 | | - "latest\n", |
724 | | - "0.0 75735\n", |
725 | | - "1.0 38364\n", |
| 692 | + "temperature\n", |
| 693 | + "29.0 4688\n", |
| 694 | + "26.0 4674\n", |
| 695 | + "16.0 4656\n", |
| 696 | + "28.0 4648\n", |
| 697 | + "10.0 4632\n", |
| 698 | + "13.0 4629\n", |
| 699 | + "7.0 4624\n", |
| 700 | + "27.0 4621\n", |
| 701 | + "21.0 4585\n", |
| 702 | + "9.0 4576\n", |
| 703 | + "23.0 4571\n", |
| 704 | + "5.0 4568\n", |
| 705 | + "6.0 4563\n", |
| 706 | + "19.0 4561\n", |
| 707 | + "18.0 4557\n", |
| 708 | + "17.0 4556\n", |
| 709 | + "11.0 4529\n", |
| 710 | + "15.0 4525\n", |
| 711 | + "8.0 4486\n", |
| 712 | + "12.0 4484\n", |
| 713 | + "20.0 4473\n", |
| 714 | + "25.0 4469\n", |
| 715 | + "14.0 4464\n", |
| 716 | + "22.0 4455\n", |
| 717 | + "24.0 4446\n", |
726 | 718 | "Name: count, dtype: int64" |
727 | 719 | ] |
728 | 720 | }, |
|
732 | 724 | } |
733 | 725 | ], |
734 | 726 | "source": [ |
735 | | - "df.latest.value_counts()" |
| 727 | + "df.temperature.value_counts()" |
736 | 728 | ] |
737 | 729 | }, |
738 | 730 | { |
739 | 731 | "cell_type": "markdown", |
740 | 732 | "metadata": {}, |
741 | 733 | "source": [ |
742 | | - "Jetzt ersetzen wir die fehlenden Werte in der Spalte `latest` durch `0` mit [DataFrame.fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html):" |
| 734 | + "Jetzt ersetzen wir die fehlenden Werte in der Spalte `temperature` durch den auf eine Nachkommastelle gerundeten Mittelwert mit [DataFrame.fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html):" |
743 | 735 | ] |
744 | 736 | }, |
745 | 737 | { |
746 | 738 | "cell_type": "code", |
747 | 739 | "execution_count": 18, |
748 | 740 | "metadata": {}, |
749 | | - "outputs": [], |
750 | | - "source": [ |
751 | | - "df.latest = df.latest.fillna(0)" |
752 | | - ] |
753 | | - }, |
754 | | - { |
755 | | - "cell_type": "code", |
756 | | - "execution_count": 19, |
757 | | - "metadata": {}, |
758 | 741 | "outputs": [ |
759 | 742 | { |
760 | 743 | "data": { |
761 | 744 | "text/plain": [ |
762 | | - "latest\n", |
763 | | - "0.0 108033\n", |
764 | | - "1.0 38364\n", |
| 745 | + "temperature\n", |
| 746 | + "17.0 36913\n", |
| 747 | + "29.0 4688\n", |
| 748 | + "26.0 4674\n", |
| 749 | + "16.0 4656\n", |
| 750 | + "28.0 4648\n", |
| 751 | + "10.0 4632\n", |
| 752 | + "13.0 4629\n", |
| 753 | + "7.0 4624\n", |
| 754 | + "27.0 4621\n", |
| 755 | + "21.0 4585\n", |
| 756 | + "9.0 4576\n", |
| 757 | + "23.0 4571\n", |
| 758 | + "5.0 4568\n", |
| 759 | + "6.0 4563\n", |
| 760 | + "19.0 4561\n", |
| 761 | + "18.0 4557\n", |
| 762 | + "11.0 4529\n", |
| 763 | + "15.0 4525\n", |
| 764 | + "8.0 4486\n", |
| 765 | + "12.0 4484\n", |
| 766 | + "20.0 4473\n", |
| 767 | + "25.0 4469\n", |
| 768 | + "14.0 4464\n", |
| 769 | + "22.0 4455\n", |
| 770 | + "24.0 4446\n", |
765 | 771 | "Name: count, dtype: int64" |
766 | 772 | ] |
767 | 773 | }, |
768 | | - "execution_count": 19, |
| 774 | + "execution_count": 18, |
769 | 775 | "metadata": {}, |
770 | 776 | "output_type": "execute_result" |
771 | 777 | } |
772 | 778 | ], |
773 | 779 | "source": [ |
774 | | - "df.latest.value_counts()" |
| 780 | + "temp_mean = round(df.temperature.mean(), 1)\n", |
| 781 | + "fill_mean = df.temperature.fillna(temp_mean)\n", |
| 782 | + "fill_mean.value_counts()" |
775 | 783 | ] |
776 | 784 | }, |
777 | 785 | { |
778 | 786 | "cell_type": "markdown", |
779 | 787 | "metadata": {}, |
780 | 788 | "source": [ |
781 | | - "### 2.7 Ersetzen fehlender Daten durch `backfill`\n", |
| 789 | + "### 2.6 Ersetzen fehlender Daten durch `backfill`\n", |
782 | 790 | "\n", |
783 | 791 | "Damit die Datensätze in ihrer zeitlichen Reihenfolge aufeinanderfolgen, setzen wir zunächst den Index für `timestamp` mit [set_index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html):" |
784 | 792 | ] |
785 | 793 | }, |
786 | 794 | { |
787 | 795 | "cell_type": "code", |
788 | | - "execution_count": 20, |
| 796 | + "execution_count": 19, |
789 | 797 | "metadata": {}, |
790 | 798 | "outputs": [], |
791 | 799 | "source": [ |
|
794 | 802 | }, |
795 | 803 | { |
796 | 804 | "cell_type": "code", |
797 | | - "execution_count": 21, |
| 805 | + "execution_count": 20, |
798 | 806 | "metadata": {}, |
799 | 807 | "outputs": [ |
800 | 808 | { |
|
878 | 886 | " <td>29.0</td>\n", |
879 | 887 | " <td>62</td>\n", |
880 | 888 | " <td>122f1c6a-403c-2221-6ed1-b5caa08f11e0</td>\n", |
881 | | - " <td>0.0</td>\n", |
| 889 | + " <td>NaN</td>\n", |
882 | 890 | " <td>NaN</td>\n", |
883 | 891 | " </tr>\n", |
884 | 892 | " <tr>\n", |
|
905 | 913 | " <td>16.0</td>\n", |
906 | 914 | " <td>76</td>\n", |
907 | 915 | " <td>7a60219f-6621-e548-180e-ca69624f9824</td>\n", |
908 | | - " <td>0.0</td>\n", |
| 916 | + " <td>NaN</td>\n", |
909 | 917 | " <td>interval</td>\n", |
910 | 918 | " </tr>\n", |
911 | 919 | " <tr>\n", |
|
932 | 940 | " <td>NaN</td>\n", |
933 | 941 | " <td>63</td>\n", |
934 | 942 | " <td>e09b6001-125d-51cf-9c3f-9cb686c19d02</td>\n", |
935 | | - " <td>0.0</td>\n", |
| 943 | + " <td>NaN</td>\n", |
936 | 944 | " <td>NaN</td>\n", |
937 | 945 | " </tr>\n", |
938 | 946 | " <tr>\n", |
|
950 | 958 | " <td>22.0</td>\n", |
951 | 959 | " <td>83</td>\n", |
952 | 960 | " <td>03e1a07b-3e14-412c-3a69-6b45bc79f81c</td>\n", |
953 | | - " <td>0.0</td>\n", |
| 961 | + " <td>NaN</td>\n", |
954 | 962 | " <td>update</td>\n", |
955 | 963 | " </tr>\n", |
956 | 964 | " <tr>\n", |
|
986 | 994 | " <td>NaN</td>\n", |
987 | 995 | " <td>63</td>\n", |
988 | 996 | " <td>b60bd7de-4057-8a85-f806-e6eec1350338</td>\n", |
989 | | - " <td>0.0</td>\n", |
| 997 | + " <td>NaN</td>\n", |
990 | 998 | " <td>interval</td>\n", |
991 | 999 | " </tr>\n", |
992 | 1000 | " <tr>\n", |
|
1004 | 1012 | " <td>11.0</td>\n", |
1005 | 1013 | " <td>69</td>\n", |
1006 | 1014 | " <td>1aef7db8-9a3e-7dc9-d7a5-781ec0efd200</td>\n", |
1007 | | - " <td>0.0</td>\n", |
| 1015 | + " <td>NaN</td>\n", |
1008 | 1016 | " <td>user</td>\n", |
1009 | 1017 | " </tr>\n", |
1010 | 1018 | " <tr>\n", |
|
1050 | 1058 | "2017-01-01T12:01:09 7256b7b0-e502-f576-62ec-ed73533c9c84 0.0 wake \n", |
1051 | 1059 | "2017-01-01T12:01:34 9226c94b-bb4b-a6c8-8e02-cb42b53e9c90 0.0 NaN \n", |
1052 | 1060 | "2017-01-01T12:02:09 NaN 0.0 update \n", |
1053 | | - "2017-01-01T12:02:36 122f1c6a-403c-2221-6ed1-b5caa08f11e0 0.0 NaN \n", |
| 1061 | + "2017-01-01T12:02:36 122f1c6a-403c-2221-6ed1-b5caa08f11e0 NaN NaN \n", |
1054 | 1062 | "2017-01-01T12:03:04 0897dbe5-9c5b-71ca-73a1-7586959ca198 0.0 interval \n", |
1055 | 1063 | "2017-01-01T12:03:51 1c07ab9b-5f66-137d-a74f-921a41001f4e 1.0 NaN \n", |
1056 | | - "2017-01-01T12:04:35 7a60219f-6621-e548-180e-ca69624f9824 0.0 interval \n", |
| 1064 | + "2017-01-01T12:04:35 7a60219f-6621-e548-180e-ca69624f9824 NaN interval \n", |
1057 | 1065 | "2017-01-01T12:05:05 a8b87754-a162-da28-2527-4bce4b3d4191 1.0 NaN \n", |
1058 | 1066 | "2017-01-01T12:05:41 585f1a3c-0679-0ffe-9132-508933c70343 0.0 wake \n", |
1059 | | - "2017-01-01T12:06:21 e09b6001-125d-51cf-9c3f-9cb686c19d02 0.0 NaN \n", |
| 1067 | + "2017-01-01T12:06:21 e09b6001-125d-51cf-9c3f-9cb686c19d02 NaN NaN \n", |
1060 | 1068 | "2017-01-01T12:06:53 607c9f6e-2bdf-a606-6d16-3004c6958436 1.0 update \n", |
1061 | | - "2017-01-01T12:07:41 03e1a07b-3e14-412c-3a69-6b45bc79f81c 0.0 update \n", |
| 1069 | + "2017-01-01T12:07:41 03e1a07b-3e14-412c-3a69-6b45bc79f81c NaN update \n", |
1062 | 1070 | "2017-01-01T12:08:08 NaN 0.0 interval \n", |
1063 | 1071 | "2017-01-01T12:08:35 NaN 0.0 wake \n", |
1064 | 1072 | "2017-01-01T12:09:05 b9890c1e-79d5-8979-63ae-6c08a4cd476a 0.0 NaN \n", |
1065 | | - "2017-01-01T12:09:48 b60bd7de-4057-8a85-f806-e6eec1350338 0.0 interval \n", |
| 1073 | + "2017-01-01T12:09:48 b60bd7de-4057-8a85-f806-e6eec1350338 NaN interval \n", |
1066 | 1074 | "2017-01-01T12:10:23 b1dacc73-c8b7-1d7d-ee02-578da781a71e 0.0 test \n", |
1067 | | - "2017-01-01T12:10:57 1aef7db8-9a3e-7dc9-d7a5-781ec0efd200 0.0 user \n", |
| 1075 | + "2017-01-01T12:10:57 1aef7db8-9a3e-7dc9-d7a5-781ec0efd200 NaN user \n", |
1068 | 1076 | "2017-01-01T12:11:34 8075d058-7dae-e2ec-d47e-58ec6d26899b 1.0 NaN " |
1069 | 1077 | ] |
1070 | 1078 | }, |
1071 | | - "execution_count": 21, |
| 1079 | + "execution_count": 20, |
1072 | 1080 | "metadata": {}, |
1073 | 1081 | "output_type": "execute_result" |
1074 | 1082 | } |
|
1086 | 1094 | }, |
1087 | 1095 | { |
1088 | 1096 | "cell_type": "code", |
1089 | | - "execution_count": 22, |
| 1097 | + "execution_count": 21, |
1090 | 1098 | "metadata": {}, |
1091 | 1099 | "outputs": [], |
1092 | 1100 | "source": [ |
|
1097 | 1105 | }, |
1098 | 1106 | { |
1099 | 1107 | "cell_type": "code", |
1100 | | - "execution_count": 23, |
| 1108 | + "execution_count": 22, |
1101 | 1109 | "metadata": {}, |
1102 | 1110 | "outputs": [ |
1103 | 1111 | { |
|
1106 | 1114 | "text": [ |
1107 | 1115 | "number missing for column temperature: 22633\n", |
1108 | 1116 | "number missing for column build: 32350\n", |
1109 | | - "number missing for column latest: 0\n", |
| 1117 | + "number missing for column latest: 32298\n", |
1110 | 1118 | "number missing for column note: 48704\n" |
1111 | 1119 | ] |
1112 | 1120 | } |
1113 | 1121 | ], |
1114 | 1122 | "source": [ |
1115 | 1123 | "for col in incomplete_columns:\n", |
1116 | 1124 | " num_missing = df[df[col].isnull() == True].shape[0]\n", |
1117 | | - " print(\"number missing for column {}: {}\".format(col, num_missing)) " |
| 1125 | + " print(f\"number missing for column {col}: {num_missing}\") " |
1118 | 1126 | ] |
1119 | 1127 | }, |
1120 | 1128 | { |
|
1149 | 1157 | "name": "python", |
1150 | 1158 | "nbconvert_exporter": "python", |
1151 | 1159 | "pygments_lexer": "ipython3", |
1152 | | - "version": "3.11.4" |
| 1160 | + "version": "3.11.10" |
1153 | 1161 | }, |
1154 | 1162 | "latex_envs": { |
1155 | 1163 | "LaTeX_envs_menu_present": true, |
|
0 commit comments