Skip to content

Commit 20a0dda

Browse files
committed
Initial Commit
1 parent a046f4e commit 20a0dda

File tree

6 files changed

+388
-0
lines changed

6 files changed

+388
-0
lines changed

polars-missing-data/README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
These files will allow you to work along with the [How to Deal With Missing Data in Polars](https://realpython.com/how-to-deal-with-polars-missing-data/) tutorial.
2+
3+
The files are:
4+
5+
tutorial_code.ipynb - Contains the code you see in the tutorial.
6+
tips.parquet - Parquet file containing tips information.
7+
sales_trends.csv - CSV file containing sales trend data.
8+
ft_exercise.parquet - Parquet file containing data used in consolidation exercise.
9+
ft_exercise_solution.csv - Parquet file containing solution to consolidation exercise.
10+
1.83 KB
Binary file not shown.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
episode,series,title,original_date
2+
1,1,A Touch of Class,1975-09-19
3+
2,1,The Builders,1975-09-26
4+
3,1,The Wedding Party,1975-10-03
5+
4,1,The Hotel Inspectors,1975-10-10
6+
5,1,Gourmet Night,1975-10-17
7+
6,1,The Germans,1975-10-24
8+
7,2,Communication Problems,1979-02-19
9+
8,2,The Psychiatrist,1979-02-26
10+
9,2,Waldorf Salad,1979-03-05
11+
10,2,The Kipper and the Corpse,1979-03-12
12+
11,2,The Anniversary,1979-03-26
13+
12,2,Basil the Rat,1979-10-25
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
product,last_year,current_year,next_year
2+
A,17,19,29.0
3+
B,35,35,NaN
4+
C,21,19,
5+
D,42,50,-inf
6+
E,23,25,inf

polars-missing-data/tips.parquet

3.96 KB
Binary file not shown.
Lines changed: 359 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,359 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "e47899a3-0806-41d6-a71f-738e7ef9d8d3",
6+
"metadata": {},
7+
"source": [
8+
"# Introduction"
9+
]
10+
},
11+
{
12+
"cell_type": "code",
13+
"execution_count": null,
14+
"id": "a3c5250e-010c-4b4c-a5fa-43a3cc86df30",
15+
"metadata": {},
16+
"outputs": [],
17+
"source": [
18+
"!python -m pip install polars"
19+
]
20+
},
21+
{
22+
"cell_type": "code",
23+
"execution_count": null,
24+
"id": "22c85bb2-8b10-4075-ab58-3b212f1ed050",
25+
"metadata": {},
26+
"outputs": [],
27+
"source": [
28+
"import polars as pl\n",
29+
"\n",
30+
"tips = pl.scan_parquet(\"tips.parquet\")\n",
31+
"\n",
32+
"tips.null_count().collect()"
33+
]
34+
},
35+
{
36+
"cell_type": "markdown",
37+
"id": "c94a5e17-883a-4728-ac18-e4381b793182",
38+
"metadata": {},
39+
"source": [
40+
"# How to Work With Missing Data in Polars"
41+
]
42+
},
43+
{
44+
"cell_type": "code",
45+
"execution_count": null,
46+
"id": "11bc9817-6c80-492d-8846-48451e68fcb1",
47+
"metadata": {},
48+
"outputs": [],
49+
"source": [
50+
"import polars as pl\n",
51+
"\n",
52+
"tips = pl.scan_parquet(\"tips.parquet\")\n",
53+
"\n",
54+
"tips.filter(pl.col(\"total\").is_null() & pl.col(\"tip\").is_null()).collect()"
55+
]
56+
},
57+
{
58+
"cell_type": "code",
59+
"execution_count": null,
60+
"id": "8b7de256-b058-4b6d-b802-822019b0b7eb",
61+
"metadata": {},
62+
"outputs": [],
63+
"source": [
64+
"(\n",
65+
" tips.drop_nulls(\"total\")\n",
66+
" .with_columns(pl.col(\"tip\").fill_null(0))\n",
67+
" .filter(pl.col(\"tip\").is_null())\n",
68+
").collect()"
69+
]
70+
},
71+
{
72+
"cell_type": "markdown",
73+
"id": "c628e41c-fc20-4a56-85ea-9ff631e8d614",
74+
"metadata": {},
75+
"source": [
76+
"# Using a More Strategic Approach"
77+
]
78+
},
79+
{
80+
"cell_type": "code",
81+
"execution_count": null,
82+
"id": "10fd34e7-e94e-47f1-b9da-533b0550c9b7",
83+
"metadata": {},
84+
"outputs": [],
85+
"source": [
86+
"import polars as pl\n",
87+
"\n",
88+
"tips = pl.scan_parquet(\"tips.parquet\")\n",
89+
"\n",
90+
"tips.filter(pl.col(\"time\").is_null()).collect()"
91+
]
92+
},
93+
{
94+
"cell_type": "code",
95+
"execution_count": null,
96+
"id": "a84196c9-5032-4650-83dd-176319b6eed5",
97+
"metadata": {},
98+
"outputs": [],
99+
"source": [
100+
"tips.filter(pl.col(\"record_id\").is_in([2, 3, 4, 14, 15, 16])).collect()"
101+
]
102+
},
103+
{
104+
"cell_type": "code",
105+
"execution_count": null,
106+
"id": "acfdafa7-c9e0-49cc-8b1e-e4366ce2ac59",
107+
"metadata": {},
108+
"outputs": [],
109+
"source": [
110+
"(\n",
111+
" tips.drop_nulls(\"total\")\n",
112+
" .with_columns(pl.col(\"tip\").fill_null(0))\n",
113+
" .with_columns(\n",
114+
" pl.when(pl.col(\"record_id\") == 2)\n",
115+
" .then(pl.col(\"time\").fill_null(strategy=\"forward\"))\n",
116+
" .otherwise(pl.col(\"time\").fill_null(strategy=\"backward\"))\n",
117+
" )\n",
118+
" .filter(pl.col(\"record_id\").is_in([3, 15]))\n",
119+
").collect()"
120+
]
121+
},
122+
{
123+
"cell_type": "markdown",
124+
"id": "9c007132-c939-47b7-84b6-bf89c3da74a2",
125+
"metadata": {},
126+
"source": [
127+
"# Dealing With Nulls Across Multiple Columns"
128+
]
129+
},
130+
{
131+
"cell_type": "code",
132+
"execution_count": null,
133+
"id": "19504937-9a8b-48c9-b504-62db2bff178c",
134+
"metadata": {},
135+
"outputs": [],
136+
"source": [
137+
"tips = pl.scan_parquet(\"tips.parquet\")\n",
138+
"\n",
139+
"(tips.filter(pl.all_horizontal(pl.col(\"total\", \"tip\").is_null()))).collect()"
140+
]
141+
},
142+
{
143+
"cell_type": "code",
144+
"execution_count": null,
145+
"id": "91b280c0-f7f7-4874-86b6-df349b8b6927",
146+
"metadata": {},
147+
"outputs": [],
148+
"source": [
149+
"tips = pl.scan_parquet(\"tips.parquet\")\n",
150+
"\n",
151+
"(tips.filter(pl.all_horizontal(pl.col(\"total\", \"tip\").is_null()))).collect()"
152+
]
153+
},
154+
{
155+
"cell_type": "code",
156+
"execution_count": null,
157+
"id": "0d5ba705-e675-4935-8aab-958a539bd66a",
158+
"metadata": {},
159+
"outputs": [],
160+
"source": [
161+
"(tips.filter(~pl.all_horizontal(pl.col(\"total\", \"tip\").is_null()))).collect()"
162+
]
163+
},
164+
{
165+
"cell_type": "code",
166+
"execution_count": null,
167+
"id": "29a6aab6-edb5-42cc-998b-7bd82f45ce8c",
168+
"metadata": {},
169+
"outputs": [],
170+
"source": [
171+
"import polars as pl\n",
172+
"\n",
173+
"tips = pl.scan_parquet(\"tips.parquet\")\n",
174+
"\n",
175+
"(\n",
176+
" tips.filter(~pl.all_horizontal(pl.col(\"total\", \"tip\").is_null()))\n",
177+
" .with_columns(pl.col(\"tip\").fill_null(0))\n",
178+
" .with_columns(\n",
179+
" pl.when(pl.col(\"record_id\") == 2)\n",
180+
" .then(pl.col(\"time\").fill_null(strategy=\"forward\"))\n",
181+
" .otherwise(pl.col(\"time\").fill_null(strategy=\"backward\"))\n",
182+
" )\n",
183+
").null_count().collect()"
184+
]
185+
},
186+
{
187+
"cell_type": "markdown",
188+
"id": "32c00cbe-e300-4fd8-9a1e-f40371528fef",
189+
"metadata": {},
190+
"source": [
191+
"# Dealing With Nulls by Column Data Type"
192+
]
193+
},
194+
{
195+
"cell_type": "code",
196+
"execution_count": null,
197+
"id": "2e29d50f-b9f8-4545-b954-040490e6f15c",
198+
"metadata": {},
199+
"outputs": [],
200+
"source": [
201+
"import polars as pl\n",
202+
"\n",
203+
"scientists = pl.LazyFrame(\n",
204+
" {\n",
205+
" \"scientist_id\": [1, 2, 3, 4, 5],\n",
206+
" \"first_name\": [\"Isaac\", \"Louis\", None, \"Charles\", \"Marie\"],\n",
207+
" \"last_name\": [None, \"Pasteur\", \"Einstein\", \"Darwin\", \"Curie\"],\n",
208+
" \"birth_year\": [1642, 1822, None, 1809, 1867],\n",
209+
" \"death_year\": [1726, 1895, 1955, None, 1934],\n",
210+
" }\n",
211+
")\n",
212+
"\n",
213+
"scientists.collect()"
214+
]
215+
},
216+
{
217+
"cell_type": "code",
218+
"execution_count": null,
219+
"id": "a6a5a990-d2cf-4dd2-8021-1a59e27c64d2",
220+
"metadata": {},
221+
"outputs": [],
222+
"source": [
223+
"import polars.selectors as cs\n",
224+
"\n",
225+
"(\n",
226+
" scientists.with_columns(cs.string().fill_null(\"Unknown\")).with_columns(\n",
227+
" cs.integer().fill_null(0)\n",
228+
" )\n",
229+
").collect()"
230+
]
231+
},
232+
{
233+
"cell_type": "markdown",
234+
"id": "f211113b-6988-4cf5-a0e9-c1c625b00148",
235+
"metadata": {},
236+
"source": [
237+
"# Dealing With Those Pesky NaNs and infs"
238+
]
239+
},
240+
{
241+
"cell_type": "code",
242+
"execution_count": null,
243+
"id": "8b706a22-cc6a-49c9-858c-69bb3f72cb48",
244+
"metadata": {},
245+
"outputs": [],
246+
"source": [
247+
"import polars as pl\n",
248+
"\n",
249+
"sales_trends = pl.scan_csv(\"sales_trends.csv\")\n",
250+
"\n",
251+
"sales_trends.collect()"
252+
]
253+
},
254+
{
255+
"cell_type": "code",
256+
"execution_count": null,
257+
"id": "5cde06c9-1a4c-45da-991d-cda5cd27542c",
258+
"metadata": {},
259+
"outputs": [],
260+
"source": [
261+
"(\n",
262+
" sales_trends.with_columns(\n",
263+
" pl.col(\"next_year\").replace(\n",
264+
" [float(\"inf\"), -float(\"inf\"), float(\"NaN\")], None\n",
265+
" )\n",
266+
" )\n",
267+
").collect()"
268+
]
269+
},
270+
{
271+
"cell_type": "code",
272+
"execution_count": null,
273+
"id": "babf6ca8-101f-40f8-8224-426eeece5a81",
274+
"metadata": {},
275+
"outputs": [],
276+
"source": [
277+
"(\n",
278+
" sales_trends.with_columns(\n",
279+
" pl.col(\"next_year\").replace(\n",
280+
" [float(\"inf\"), -float(\"inf\"), float(\"NaN\")], None\n",
281+
" )\n",
282+
" ).with_columns(\n",
283+
" pl.col(\"next_year\").fill_null(\n",
284+
" pl.col(\"current_year\")\n",
285+
" + (pl.col(\"current_year\") - pl.col(\"last_year\"))\n",
286+
" )\n",
287+
" )\n",
288+
").collect()"
289+
]
290+
},
291+
{
292+
"cell_type": "markdown",
293+
"id": "903c4028-c3af-49ba-be08-e98afa785c09",
294+
"metadata": {},
295+
"source": [
296+
"# Practicing Your Skills - Solution"
297+
]
298+
},
299+
{
300+
"cell_type": "code",
301+
"execution_count": null,
302+
"id": "d564123d-42da-462b-a52a-c6a815e59b0d",
303+
"metadata": {},
304+
"outputs": [],
305+
"source": [
306+
"import polars as pl\n",
307+
"\n",
308+
"episodes = pl.scan_parquet(\"ft_exercise.parquet\")\n",
309+
"\n",
310+
"episodes.null_count().collect()"
311+
]
312+
},
313+
{
314+
"cell_type": "code",
315+
"execution_count": null,
316+
"id": "000b53ba-c5d3-4a75-89d7-86c36881a078",
317+
"metadata": {},
318+
"outputs": [],
319+
"source": [
320+
"import polars as pl\n",
321+
"\n",
322+
"episodes = pl.scan_parquet(\"ft_exercise.parquet\")\n",
323+
"\n",
324+
"episodes.with_columns(\n",
325+
" pl.when(pl.col(\"episode\") == 6)\n",
326+
" .then(pl.col(\"series\").fill_null(strategy=\"forward\"))\n",
327+
" .otherwise(pl.col(\"series\").fill_null(strategy=\"backward\"))\n",
328+
").with_columns(\n",
329+
" pl.when(pl.col(\"episode\") == 4)\n",
330+
" .then(pl.col(\"title\").fill_null(\"The Hotel Inspectors\"))\n",
331+
" .otherwise(pl.col(\"title\").fill_null(\"Waldorf Salad\"))\n",
332+
").with_columns(\n",
333+
" pl.col(\"original_date\").interpolate()\n",
334+
").null_count().collect()"
335+
]
336+
}
337+
],
338+
"metadata": {
339+
"kernelspec": {
340+
"display_name": "Python 3 (ipykernel)",
341+
"language": "python",
342+
"name": "python3"
343+
},
344+
"language_info": {
345+
"codemirror_mode": {
346+
"name": "ipython",
347+
"version": 3
348+
},
349+
"file_extension": ".py",
350+
"mimetype": "text/x-python",
351+
"name": "python",
352+
"nbconvert_exporter": "python",
353+
"pygments_lexer": "ipython3",
354+
"version": "3.12.0"
355+
}
356+
},
357+
"nbformat": 4,
358+
"nbformat_minor": 5
359+
}

0 commit comments

Comments
 (0)