Skip to content

Commit 39d8b4f

Browse files
Update to UC
1 parent 71428e3 commit 39d8b4f

File tree

7 files changed

+118
-143
lines changed

7 files changed

+118
-143
lines changed

01_intro.py

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -13,20 +13,20 @@
1313

1414
# MAGIC %md
1515
# MAGIC ## Overview
16-
# MAGIC
16+
# MAGIC
1717
# MAGIC Behind the growth of every consumer-facing product is the acquisition and retention of an engaged user base. When it comes to acquisition, the goal is to attract high quality users as cost effectively as possible. With marketing dollars dispersed across a wide array of campaigns, channels, and creatives, however, measuring effectiveness is a challenge. In other words, it's difficult to know how to assign credit where credit is due. Enter multi-touch attribution. With multi-touch attribution, credit can be assigned in a variety of ways, but at a high-level, it's typically done using one of two methods: `heuristic` or `data-driven`.
18-
# MAGIC
18+
# MAGIC
1919
# MAGIC * Broadly speaking, heuristic methods are rule-based and consist of both `single-touch` and `multi-touch` approaches. Single-touch methods, such as `first-touch` and `last-touch`, assign credit to the first channel, or the last channel, associated with a conversion. Multi-touch methods, such as `linear` and `time-decay`, assign credit to multiple channels associated with a conversion. In the case of linear, credit is assigned uniformly across all channels, whereas for time-decay, an increasing amount of credit is assigned to the channels that appear closer in time to the conversion event.
20-
# MAGIC
20+
# MAGIC
2121
# MAGIC * In contrast to heuristic methods, data-driven methods determine assignment using probabilities and statistics. Examples of data-driven methods include `Markov Chains` and `SHAP`. In this series of notebooks, we cover the use of Markov Chains and include a comparison to a few heuristic methods.
2222

2323
# COMMAND ----------
2424

2525
# MAGIC %md
2626
# MAGIC ## About This Series of Notebooks
27-
# MAGIC
27+
# MAGIC
2828
# MAGIC * This series of notebooks is intended to help you use multi-touch attribution to optimize your marketing spend.
29-
# MAGIC
29+
# MAGIC
3030
# MAGIC * In support of this goal, we will:
3131
# MAGIC * Generate synthetic ad impression and conversion data.
3232
# MAGIC * Create a streaming pipeline for processing ad impression and conversion data in near real-time.
@@ -75,16 +75,16 @@
7575
# MAGIC </tr>
7676
# MAGIC </tbody>
7777
# MAGIC </table>
78-
# MAGIC
78+
# MAGIC
7979
# MAGIC * In the following sections, you will generate this synthetic dataset and then process it using Structured Streaming. You will then apply additional transformations so that it is suitable to use with Markov Chains.
80-
# MAGIC
80+
# MAGIC
8181
# MAGIC * **Note:** Default settings are used to generate this data set. After working through this series of notebooks for the first time, you may want to customize these settings for additional exploration. Please note that if you do so, commentary in the notebooks may not line up with the newly generated data.
8282

8383
# COMMAND ----------
8484

8585
# MAGIC %md
8686
# MAGIC ## Step 1: Configure the Environment
87-
# MAGIC
87+
# MAGIC
8888
# MAGIC In this step, we will:
8989
# MAGIC 1. Import libraries
9090
# MAGIC 2. Run the `99_utils` notebook to gain access to the function `get_params`
@@ -151,7 +151,7 @@
151151

152152
# MAGIC %md
153153
# MAGIC ## Step 2: Generate the Data
154-
# MAGIC
154+
# MAGIC
155155
# MAGIC In this step, we will:
156156
# MAGIC 1. Define the functions that will be used to generate the synthetic data
157157
# MAGIC 2. Call `data_gen` to generate the synthetic data
@@ -255,6 +255,11 @@ def data_gen(data_gen_path):
255255

256256
# COMMAND ----------
257257

258+
generated_data = spark.read.format('csv').option('header','true').load(raw_data_path)
259+
display(generated_data.groupBy('interaction','channel').agg({'conversion':'sum'}))
260+
261+
# COMMAND ----------
262+
258263
# MAGIC %md
259264
# MAGIC ## Next Steps
260265
# MAGIC * In the next notebook, we will load the data we generated here into [Delta](https://docs.databricks.com/delta/delta-intro.html) tables.
@@ -263,7 +268,7 @@ def data_gen(data_gen_path):
263268

264269
# MAGIC %md
265270
# MAGIC Copyright Databricks, Inc. [2021]. The source in this notebook is provided subject to the [Databricks License](https://databricks.com/db-license-source). All included or referenced third party libraries are subject to the licenses set forth below.
266-
# MAGIC
271+
# MAGIC
267272
# MAGIC |Library Name|Library license | Library License URL | Library Source URL |
268273
# MAGIC |---|---|---|---|
269274
# MAGIC |Matplotlib|Python Software Foundation (PSF) License |https://matplotlib.org/stable/users/license.html|https://github.com/matplotlib/matplotlib|

02_load_data.py

Lines changed: 16 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313

1414
# MAGIC %md
1515
# MAGIC ## Overview
16-
# MAGIC
16+
# MAGIC
1717
# MAGIC ### In this notebook you:
1818
# MAGIC * Use `Databricks Autoloader` to import the ad impression and conversion data generated in the notebook `01_intro`.
1919
# MAGIC * Write the data out in `Delta` format.
@@ -23,7 +23,7 @@
2323

2424
# MAGIC %md
2525
# MAGIC ## Step 1: Configure the Environment
26-
# MAGIC
26+
# MAGIC
2727
# MAGIC In this step, we will:
2828
# MAGIC 1. Import libraries
2929
# MAGIC 2. Run `utils` notebook to gain access to the function `get_params`
@@ -63,6 +63,7 @@
6363
# COMMAND ----------
6464

6565
params = get_params()
66+
catalog_name = params['catalog_name']
6667
database_name = params['database_name']
6768
raw_data_path = params['raw_data_path']
6869
bronze_tbl_path = params['bronze_tbl_path']
@@ -71,7 +72,7 @@
7172

7273
# MAGIC %md
7374
# MAGIC ## Step 2: Load Data using Databricks Auto Loader
74-
# MAGIC
75+
# MAGIC
7576
# MAGIC In this step, we will:
7677
# MAGIC 1. Define the schema of the synthetic data generated in `01_load_data`
7778
# MAGIC 2. Read the synthetic data into a dataframe using Auto Loader
@@ -82,9 +83,9 @@
8283
# MAGIC %md
8384
# MAGIC But, what is Auto Loader?
8485
# MAGIC * Auto Loader incrementally and efficiently loads new data files as they arrive in [S3](https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html) or [Azure Blog Storage](https://docs.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/auto-loader). This is enabled by providing a Structured Streaming source called `cloudFiles`.
85-
# MAGIC
86+
# MAGIC
8687
# MAGIC * Auto Loader internally keeps tracks of what files have been processed to provide exactly-once semantics, so you do not need to manage any state information yourself.
87-
# MAGIC
88+
# MAGIC
8889
# MAGIC * Auto Loader supports two modes for detecting when new files arrive:
8990
# MAGIC
9091
# MAGIC * `Directory listing:` Identifies new files by parallel listing of the input directory. Quick to get started since no permission configurations are required. Suitable for scenarios where only a few files need to be streamed in on a regular basis.
@@ -118,7 +119,7 @@
118119
.option("cloudFiles.region", "us-west-2") \
119120
.option("cloudFiles.includeExistingFiles", "true") \
120121
.schema(schema) \
121-
.load(raw_data_path)
122+
.load(raw_data_path)
122123

123124
# COMMAND ----------
124125

@@ -134,12 +135,12 @@
134135

135136
# MAGIC %md
136137
# MAGIC ## Step 3: Write Data to Delta Lake
137-
# MAGIC
138+
# MAGIC
138139
# MAGIC In this section of the solution accelerator, we write our data out to [Delta Lake](https://delta.io/) and then create a table (and database) for easy access and queryability.
139-
# MAGIC
140+
# MAGIC
140141
# MAGIC * Delta Lake is an open-source project that enables building a **Lakehouse architecture** on top of existing storage systems such as S3, ADLS, GCS, and HDFS.
141142
# MAGIC * Information on the **Lakehouse Architecture** can be found in this [paper](http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf) that was presented at [CIDR 2021](http://cidrdb.org/cidr2021/index.html) and in this [video](https://www.youtube.com/watch?v=RU2dXoVU8hY)
142-
# MAGIC
143+
# MAGIC
143144
# MAGIC * Key features of Delta Lake include:
144145
# MAGIC * **ACID Transactions**: Ensures data integrity and read consistency with complex, concurrent data pipelines.
145146
# MAGIC * **Unified Batch and Streaming Source and Sink**: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
@@ -157,58 +158,14 @@
157158

158159
# COMMAND ----------
159160

160-
raw_data_df.writeStream.format("delta") \
161-
.trigger(once=True) \
162-
.option("checkpointLocation", bronze_tbl_path+"/checkpoint") \
163-
.start(bronze_tbl_path) \
164-
.awaitTermination()
165-
166-
# COMMAND ----------
167-
168-
# MAGIC %md
169-
# MAGIC ## Step 4: Create Database
170-
171-
# COMMAND ----------
172-
173-
# Delete the old database and tables if needed
174-
_ = spark.sql('DROP DATABASE IF EXISTS {} CASCADE'.format(database_name))
175-
176-
# Create database to house tables
177-
_ = spark.sql('CREATE DATABASE {}'.format(database_name))
178-
179-
# COMMAND ----------
180-
181-
# MAGIC %md
182-
# MAGIC ## Step 5: Create bronze-level table in Delta format
183-
# MAGIC
184-
# MAGIC * **Note:** this step will produce an exception if it is run before writeStream in step 3 is initialized.
185-
# MAGIC
186-
# MAGIC * The nomenclature of bronze, silver, and gold tables correspond with a commonly used data modeling approach known as multi-hop architecture.
187-
# MAGIC * Additional information about this pattern can be found [here](https://databricks.com/blog/2019/08/14/productionizing-machine-learning-with-delta-lake.html).
188-
189-
# COMMAND ----------
190-
191-
# Create bronze table
192-
_ = spark.sql('''
193-
CREATE TABLE `{}`.bronze
194-
USING DELTA
195-
LOCATION '{}'
196-
'''.format(database_name,bronze_tbl_path))
161+
dbutils.fs.rm(bronze_tbl_path+"/checkpoint", recurse=True)
197162

198163
# COMMAND ----------
199164

200-
# MAGIC %md
201-
# MAGIC ## Step 6: View the bronze table
202-
# MAGIC
203-
# MAGIC Using `spark.table` here enables use of Python. An alternative approach is to query the data directly using SQL. This will be shown in the `03_data_prep` notebook.
204-
205-
# COMMAND ----------
206-
207-
bronze_tbl = spark.table("{}.bronze".format(database_name))
208-
209-
# COMMAND ----------
210-
211-
display(bronze_tbl)
165+
raw_data_df.writeStream.format("delta") \
166+
.trigger(availableNow=True) \
167+
.option("checkpointLocation", bronze_tbl_path+"/checkpoint") \
168+
.toTable(f"{catalog_name}.{database_name}.bronze")
212169

213170
# COMMAND ----------
214171

@@ -220,7 +177,7 @@
220177

221178
# MAGIC %md
222179
# MAGIC Copyright Databricks, Inc. [2021]. The source in this notebook is provided subject to the [Databricks License](https://databricks.com/db-license-source). All included or referenced third party libraries are subject to the licenses set forth below.
223-
# MAGIC
180+
# MAGIC
224181
# MAGIC |Library Name|Library license | Library License URL | Library Source URL |
225182
# MAGIC |---|---|---|---|
226183
# MAGIC |Matplotlib|Python Software Foundation (PSF) License |https://matplotlib.org/stable/users/license.html|https://github.com/matplotlib/matplotlib|

03_prep_data.py

Lines changed: 25 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313

1414
# MAGIC %md
1515
# MAGIC ## Overview
16-
# MAGIC
16+
# MAGIC
1717
# MAGIC ### In this notebook you:
1818
# MAGIC * Create a gold_user_journey table
1919
# MAGIC * Optimize the gold_user_journey table using z-ordering
@@ -26,7 +26,7 @@
2626

2727
# MAGIC %md
2828
# MAGIC ## Step 1: Configure the Environment
29-
# MAGIC
29+
# MAGIC
3030
# MAGIC In this step, we will:
3131
# MAGIC 1. Import libraries
3232
# MAGIC 2. Run `utils` notebook to gain access to the functions `get_params`
@@ -58,7 +58,7 @@
5858

5959
# MAGIC %md
6060
# MAGIC ##### Step 1.3: `get_params` and store values in variables
61-
# MAGIC
61+
# MAGIC
6262
# MAGIC * Three of the parameters returned by `get_params` are used in this notebook. For convenience, we will store the values for these parameters in new variables.
6363
# MAGIC * **database_name:** the name of the database created in notebook `02_load_data`. The default value can be overridden in the notebook `99_config`
6464
# MAGIC * **gold_user_journey_tbl_path:** the path used in `03_prep_data` to write out gold-level user journey data in delta format.
@@ -67,6 +67,7 @@
6767
# COMMAND ----------
6868

6969
params = get_params()
70+
catalog_name = params['catalog_name']
7071
database_name = params['database_name']
7172
gold_user_journey_tbl_path = params['gold_user_journey_tbl_path']
7273
gold_attribution_tbl_path = params['gold_attribution_tbl_path']
@@ -79,13 +80,14 @@
7980

8081
# COMMAND ----------
8182

82-
_ = spark.sql("use {}".format(database_name))
83+
_ = spark.sql("use catalog {}".format(catalog_name))
84+
_ = spark.sql("use schema {}".format(database_name))
8385

8486
# COMMAND ----------
8587

8688
# MAGIC %md
8789
# MAGIC ## Step 2: Create a Gold-level User Journey Table
88-
# MAGIC
90+
# MAGIC
8991
# MAGIC In this step, we will:
9092
# MAGIC 1. Create a user journey temporary view
9193
# MAGIC 2. View the user journey data
@@ -101,7 +103,7 @@
101103
# MAGIC * `first_interaction`: the first channel that an impression for a given campaign was delivered on for a given user.
102104
# MAGIC * `last_interaction`: the last channel that an impression for a given campaign was delivered on for a given user.
103105
# MAGIC * `conversion`: boolean indicating whether the given user has converted (1) or not (0).
104-
# MAGIC
106+
# MAGIC
105107
# MAGIC * This query is used to create a temporary view. The temporary view will be used in `Step 2.3` to create a table.
106108

107109
# COMMAND ----------
@@ -162,12 +164,10 @@
162164

163165
# COMMAND ----------
164166

165-
_ = spark.sql('''
166-
CREATE TABLE IF NOT EXISTS `{}`.gold_user_journey
167-
USING DELTA
168-
LOCATION '{}'
169-
AS SELECT * from user_journey_view
170-
'''.format(database_name, gold_user_journey_tbl_path))
167+
# MAGIC %sql
168+
# MAGIC CREATE TABLE IF NOT EXISTS gold_user_journey
169+
# MAGIC USING DELTA
170+
# MAGIC AS SELECT * from user_journey_view
171171

172172
# COMMAND ----------
173173

@@ -179,9 +179,9 @@
179179
# MAGIC %md
180180
# MAGIC ## Step 3: Optimize the gold_user_journey table
181181
# MAGIC * [Z-Ordering](https://docs.databricks.com/delta/optimizations/file-mgmt.html#z-ordering-multi-dimensional-clustering) is a technique used to co-locate related information into the same set of files. This co-locality is automatically used by Delta Lake's data-skipping algorithms to dramatically reduce the amount of data that needs to be read. The less data that needs to be read, the quicker that query results are returned.
182-
# MAGIC
182+
# MAGIC
183183
# MAGIC * In practice, Z-ordering is most suitable for high-cardinality columns that you frequently want to filter on.
184-
# MAGIC
184+
# MAGIC
185185
# MAGIC * Please note that the data set we are using here is relatively small and Z-ordering is likely unnecessary. It has been included, however, for illustration purposes.
186186

187187
# COMMAND ----------
@@ -193,14 +193,14 @@
193193

194194
# MAGIC %md
195195
# MAGIC ## Step 4: Create gold-level attribution summary table
196-
# MAGIC
196+
# MAGIC
197197
# MAGIC In the table, `gold_user_journey`, that we just created in the previous step, we captured the values for `first_interaction` and `last_interaction` in their own respective columns. With this data now in place, let's take a look at attribution using the heuristic methods `first-touch` and `last-touch`.
198-
# MAGIC
198+
# MAGIC
199199
# MAGIC In this step, we will:
200200
# MAGIC 1. Create a temporary view for first-touch and last-touch attribution metrics
201201
# MAGIC 2. Use the temporary view to create the gold_attribution table
202202
# MAGIC 3. Use the gold_attribution table to view first touch vs. last touch by channel
203-
# MAGIC
203+
# MAGIC
204204
# MAGIC After we build our Markov model in the next notebook, `04_markov_chains`, we will then take a look at how attribution using a data-driven method compares to these heuristic methods.
205205

206206
# COMMAND ----------
@@ -241,12 +241,11 @@
241241

242242
# COMMAND ----------
243243

244-
_ = spark.sql('''
245-
CREATE TABLE IF NOT EXISTS gold_attribution
246-
USING DELTA
247-
LOCATION '{}'
248-
AS
249-
SELECT * FROM attribution_view'''.format(gold_attribution_tbl_path))
244+
# MAGIC %sql
245+
# MAGIC CREATE TABLE IF NOT EXISTS gold_attribution
246+
# MAGIC USING DELTA
247+
# MAGIC AS
248+
# MAGIC SELECT * FROM attribution_view
250249

251250
# COMMAND ----------
252251

@@ -269,7 +268,7 @@
269268

270269
# MAGIC %md
271270
# MAGIC ## Appendix: Production
272-
# MAGIC
271+
# MAGIC
273272
# MAGIC In this appendix, we will:
274273
# MAGIC * Demonstrate that Delta Lake brings ACID transaction and full DML support to data lakes (e.g. delete, update, merge into)
275274
# MAGIC * Demonstrate how auditing and governance is enabled by Delta Lake
@@ -337,14 +336,14 @@
337336

338337
# MAGIC %md
339338
# MAGIC ## Next Steps
340-
# MAGIC
339+
# MAGIC
341340
# MAGIC * Create Markov Chain Attribution Model
342341

343342
# COMMAND ----------
344343

345344
# MAGIC %md
346345
# MAGIC Copyright Databricks, Inc. [2021]. The source in this notebook is provided subject to the [Databricks License](https://databricks.com/db-license-source). All included or referenced third party libraries are subject to the licenses set forth below.
347-
# MAGIC
346+
# MAGIC
348347
# MAGIC |Library Name|Library license | Library License URL | Library Source URL |
349348
# MAGIC |---|---|---|---|
350349
# MAGIC |Matplotlib|Python Software Foundation (PSF) License |https://matplotlib.org/stable/users/license.html|https://github.com/matplotlib/matplotlib|

0 commit comments

Comments
 (0)