Skip to content

Commit 1ac02fa

Browse files
committed
Update data processing pipeline to include device information and align column naming conventions
- Add loading of `devices.csv` to incorporate missing columns (`browser_type` and `os_type`). - Perform a left join between `events` and `devices` datasets using `device_id`. - Rename columns (`browser_type` -> `browser_family`, `os_type` -> `os_family`) to align with the rest of the notebook's codebase. - Improve dataset completeness and maintain consistency in column naming conventions.
1 parent b99a44b commit 1ac02fa

File tree

1 file changed

+13
-9
lines changed

1 file changed

+13
-9
lines changed

bootcamp/materials/3-spark-fundamentals/notebooks/event_data_pyspark.ipynb

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"cells": [
33
{
44
"cell_type": "code",
5-
"execution_count": 7,
5+
"execution_count": null,
66
"id": "81cca085-dba2-42eb-a13b-fa64b6e86583",
77
"metadata": {},
88
"outputs": [
@@ -46,7 +46,11 @@
4646
"\n",
4747
"spark\n",
4848
"\n",
49-
"df = spark.read.option(\"header\", \"true\").csv(\"/home/iceberg/data/events.csv\").withColumn(\"event_date\", expr(\"DATE_TRUNC('day', event_time)\"))\n",
49+
"events = spark.read.option(\"header\", \"true\").csv(\"/home/iceberg/data/events.csv\").withColumn(\"event_date\", expr(\"DATE_TRUNC('day', event_time)\"))\n",
50+
"devices = spark.read.option(\"header\",\"true\").csv(\"/home/iceberg/data/devices.csv\")\n",
51+
"\n",
52+
"df = events.join(devices,on=\"device_id\",how=\"left\")\n",
53+
"df = df.withColumnsRenamed({'browser_type': 'browser_family', 'os_type': 'os_family'})\n",
5054
"\n",
5155
"df.show()"
5256
]
@@ -414,17 +418,17 @@
414418
{
415419
"cell_type": "code",
416420
"execution_count": null,
417-
"outputs": [],
418-
"source": [
419-
"%%sql \n",
420-
"SELECT COUNT(1) FROM bootcamp.matches_bucketed.files"
421-
],
422421
"metadata": {
423422
"collapsed": false,
424423
"pycharm": {
425424
"name": "#%%\n"
426425
}
427-
}
426+
},
427+
"outputs": [],
428+
"source": [
429+
"%%sql \n",
430+
"SELECT COUNT(1) FROM bootcamp.matches_bucketed.files"
431+
]
428432
},
429433
{
430434
"cell_type": "code",
@@ -489,4 +493,4 @@
489493
},
490494
"nbformat": 4,
491495
"nbformat_minor": 5
492-
}
496+
}

0 commit comments

Comments
 (0)