databricks-industry-solutions
diff --git a/‎01_intro.py‎
Lines changed: 15 additions & 10 deletions b/‎01_intro.py‎
Lines changed: 15 additions & 10 deletions
diff --git a/‎02_load_data.py‎
Lines changed: 16 additions & 59 deletions b/‎02_load_data.py‎
Lines changed: 16 additions & 59 deletions
diff --git a/‎03_prep_data.py‎
Lines changed: 25 additions & 26 deletions b/‎03_prep_data.py‎
Lines changed: 25 additions & 26 deletions
@@ -13,20 +13,20 @@
 
 # MAGIC %md
 # MAGIC ## Overview
-# MAGIC 
+# MAGIC
 # MAGIC Behind the growth of every consumer-facing product is the acquisition and retention of an engaged user base. When it comes to acquisition, the goal is to attract high quality users as cost effectively as possible. With marketing dollars dispersed across a wide array of campaigns, channels, and creatives, however, measuring effectiveness is a challenge. In other words, it's difficult to know how to assign credit where credit is due. Enter multi-touch attribution. With multi-touch attribution, credit can be assigned in a variety of ways, but at a high-level, it's typically done using one of two methods: `heuristic` or `data-driven`.
-# MAGIC 
+# MAGIC
 # MAGIC * Broadly speaking, heuristic methods are rule-based and consist of both `single-touch` and `multi-touch` approaches. Single-touch methods, such as `first-touch` and `last-touch`, assign credit to the first channel, or the last channel, associated with a conversion. Multi-touch methods, such as `linear` and `time-decay`, assign credit to multiple channels associated with a conversion. In the case of linear, credit is assigned uniformly across all channels, whereas for time-decay, an increasing amount of credit is assigned to the channels that appear closer in time to the conversion event.
-# MAGIC 
+# MAGIC
 # MAGIC * In contrast to heuristic methods, data-driven methods determine assignment using probabilities and statistics. Examples of data-driven methods include `Markov Chains` and `SHAP`. In this series of notebooks, we cover the use of Markov Chains and include a comparison to a few heuristic methods.
 
 # COMMAND ----------
 
 # MAGIC %md
 # MAGIC ## About This Series of Notebooks
-# MAGIC 
+# MAGIC
 # MAGIC * This series of notebooks is intended to help you use multi-touch attribution to optimize your marketing spend.
-# MAGIC 
+# MAGIC
 # MAGIC * In support of this goal, we will:
 # MAGIC  * Generate synthetic ad impression and conversion data.
 # MAGIC  * Create a streaming pipeline for processing ad impression and conversion data in near real-time.
@@ -75,16 +75,16 @@
 # MAGIC       </tr>
 # MAGIC    </tbody>
 # MAGIC </table>
-# MAGIC 
+# MAGIC
 # MAGIC * In the following sections, you will generate this synthetic dataset and then process it using Structured Streaming. You will then apply additional transformations so that it is suitable to use with Markov Chains.
-# MAGIC 
+# MAGIC
 # MAGIC * **Note:** Default settings are used to generate this data set. After working through this series of notebooks for the first time, you may want to customize these settings for additional exploration. Please note that if you do so, commentary in the notebooks may not line up with the newly generated data.
 
 # COMMAND ----------
 
 # MAGIC %md
 # MAGIC ## Step 1: Configure the Environment
-# MAGIC 
+# MAGIC
 # MAGIC In this step, we will:
 # MAGIC   1. Import libraries
 # MAGIC   2. Run the `99_utils` notebook to gain access to the function `get_params`
@@ -151,7 +151,7 @@
 
 # MAGIC %md
 # MAGIC ## Step 2: Generate the Data
-# MAGIC 
+# MAGIC
 # MAGIC In this step, we will:
 # MAGIC   1. Define the functions that will be used to generate the synthetic data
 # MAGIC   2. Call `data_gen` to generate the synthetic data
@@ -255,6 +255,11 @@ def data_gen(data_gen_path):
 
 # COMMAND ----------
 
+generated_data = spark.read.format('csv').option('header','true').load(raw_data_path)
+display(generated_data.groupBy('interaction','channel').agg({'conversion':'sum'}))
+
+# COMMAND ----------
+
 # MAGIC %md
 # MAGIC ## Next Steps
 # MAGIC * In the next notebook, we will load the data we generated here into [Delta](https://docs.databricks.com/delta/delta-intro.html) tables.
@@ -263,7 +268,7 @@ def data_gen(data_gen_path):
 
 # MAGIC %md
 # MAGIC Copyright Databricks, Inc. [2021]. The source in this notebook is provided subject to the [Databricks License](https://databricks.com/db-license-source).  All included or referenced third party libraries are subject to the licenses set forth below.
-# MAGIC 
+# MAGIC
 # MAGIC |Library Name|Library license | Library License URL | Library Source URL |
 # MAGIC |---|---|---|---|
 # MAGIC |Matplotlib|Python Software Foundation (PSF) License |https://matplotlib.org/stable/users/license.html|https://github.com/matplotlib/matplotlib|
 
@@ -13,7 +13,7 @@
 
 # MAGIC %md
 # MAGIC ## Overview
-# MAGIC 
+# MAGIC
 # MAGIC ### In this notebook you:
 # MAGIC * Use `Databricks Autoloader` to import the ad impression and conversion data generated in the notebook `01_intro`.
 # MAGIC * Write the data out in `Delta` format.
@@ -23,7 +23,7 @@
 
 # MAGIC %md
 # MAGIC ## Step 1: Configure the Environment
-# MAGIC 
+# MAGIC
 # MAGIC In this step, we will:
 # MAGIC   1. Import libraries
 # MAGIC   2. Run `utils` notebook to gain access to the function `get_params`
@@ -63,6 +63,7 @@
 # COMMAND ----------
 
 params = get_params()
+catalog_name = params['catalog_name']
 database_name = params['database_name']
 raw_data_path = params['raw_data_path']
 bronze_tbl_path = params['bronze_tbl_path']
@@ -71,7 +72,7 @@
 
 # MAGIC %md
 # MAGIC ## Step 2: Load Data using Databricks Auto Loader
-# MAGIC 
+# MAGIC
 # MAGIC In this step, we will:
 # MAGIC   1. Define the schema of the synthetic data generated in `01_load_data`
 # MAGIC   2. Read the synthetic data into a dataframe using Auto Loader
@@ -82,9 +83,9 @@
 # MAGIC %md 
 # MAGIC But, what is Auto Loader?
 # MAGIC * Auto Loader incrementally and efficiently loads new data files as they arrive in [S3](https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html) or [Azure Blog Storage](https://docs.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/auto-loader). This is enabled by providing a Structured Streaming source called `cloudFiles`. 
-# MAGIC 
+# MAGIC
 # MAGIC * Auto Loader internally keeps tracks of what files have been processed to provide exactly-once semantics, so you do not need to manage any state information yourself.
-# MAGIC 
+# MAGIC
 # MAGIC * Auto Loader supports two modes for detecting when new files arrive:
 # MAGIC   
 # MAGIC   * `Directory listing:` Identifies new files by parallel listing of the input directory. Quick to get started since no permission configurations are required. Suitable for scenarios where only a few files need to be streamed in on a regular basis.
@@ -118,7 +119,7 @@
             .option("cloudFiles.region", "us-west-2") \
             .option("cloudFiles.includeExistingFiles", "true") \
             .schema(schema) \
-            .load(raw_data_path) 
+            .load(raw_data_path)
 
 # COMMAND ----------
 
@@ -134,12 +135,12 @@
 
 # MAGIC %md
 # MAGIC ## Step 3: Write Data to Delta Lake
-# MAGIC 
+# MAGIC
 # MAGIC In this section of the solution accelerator, we write our data out to [Delta Lake](https://delta.io/) and then create a table (and database) for easy access and queryability.
-# MAGIC 
+# MAGIC
 # MAGIC * Delta Lake is an open-source project that enables building a **Lakehouse architecture** on top of existing storage systems such as S3, ADLS, GCS, and HDFS.
 # MAGIC    * Information on the **Lakehouse Architecture** can be found in this [paper](http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf) that was presented at [CIDR 2021](http://cidrdb.org/cidr2021/index.html) and in this [video](https://www.youtube.com/watch?v=RU2dXoVU8hY)
-# MAGIC 
+# MAGIC
 # MAGIC * Key features of Delta Lake include:
 # MAGIC   * **ACID Transactions**: Ensures data integrity and read consistency with complex, concurrent data pipelines.
 # MAGIC   * **Unified Batch and Streaming Source and Sink**: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box. 
@@ -157,58 +158,14 @@
 
 # COMMAND ----------
 
-raw_data_df.writeStream.format("delta") \
-  .trigger(once=True) \
-  .option("checkpointLocation", bronze_tbl_path+"/checkpoint") \
-  .start(bronze_tbl_path) \
-  .awaitTermination()
-
-# COMMAND ----------
-
-# MAGIC %md
-# MAGIC ## Step 4: Create Database
-
-# COMMAND ----------
-
-# Delete the old database and tables if needed
-_ = spark.sql('DROP DATABASE IF EXISTS {} CASCADE'.format(database_name))
-
-# Create database to house tables
-_ = spark.sql('CREATE DATABASE {}'.format(database_name))
-
-# COMMAND ----------
-
-# MAGIC %md
-# MAGIC ## Step 5: Create bronze-level table in Delta format
-# MAGIC 
-# MAGIC * **Note:** this step will produce an exception if it is run before writeStream in step 3 is initialized.
-# MAGIC 
-# MAGIC * The nomenclature of bronze, silver, and gold tables correspond with a commonly used data modeling approach known as multi-hop architecture. 
-# MAGIC   * Additional information about this pattern can be found [here](https://databricks.com/blog/2019/08/14/productionizing-machine-learning-with-delta-lake.html).
-
-# COMMAND ----------
-
-# Create bronze table
-_ = spark.sql('''
-  CREATE TABLE `{}`.bronze
-  USING DELTA 
-  LOCATION '{}'
-  '''.format(database_name,bronze_tbl_path))
+dbutils.fs.rm(bronze_tbl_path+"/checkpoint", recurse=True)
 
 # COMMAND ----------
 
-# MAGIC %md
-# MAGIC ## Step 6: View the bronze table
-# MAGIC 
-# MAGIC Using `spark.table` here enables use of Python. An alternative approach is to query the data directly using SQL. This will be shown in the `03_data_prep` notebook.
-
-# COMMAND ----------
-
-bronze_tbl = spark.table("{}.bronze".format(database_name))
-
-# COMMAND ----------
-
-display(bronze_tbl)
+raw_data_df.writeStream.format("delta") \
+  .trigger(availableNow=True) \
+  .option("checkpointLocation", bronze_tbl_path+"/checkpoint") \
+  .toTable(f"{catalog_name}.{database_name}.bronze")
 
 # COMMAND ----------
 
@@ -220,7 +177,7 @@
 
 # MAGIC %md
 # MAGIC Copyright Databricks, Inc. [2021]. The source in this notebook is provided subject to the [Databricks License](https://databricks.com/db-license-source).  All included or referenced third party libraries are subject to the licenses set forth below.
-# MAGIC 
+# MAGIC
 # MAGIC |Library Name|Library license | Library License URL | Library Source URL |
 # MAGIC |---|---|---|---|
 # MAGIC |Matplotlib|Python Software Foundation (PSF) License |https://matplotlib.org/stable/users/license.html|https://github.com/matplotlib/matplotlib|
 
@@ -13,7 +13,7 @@
 
 # MAGIC %md
 # MAGIC ## Overview
-# MAGIC 
+# MAGIC
 # MAGIC ### In this notebook you:
 # MAGIC * Create a gold_user_journey table
 # MAGIC * Optimize the gold_user_journey table using z-ordering
@@ -26,7 +26,7 @@
 
 # MAGIC %md
 # MAGIC ## Step 1: Configure the Environment
-# MAGIC 
+# MAGIC
 # MAGIC In this step, we will:
 # MAGIC   1. Import libraries
 # MAGIC   2. Run `utils` notebook to gain access to the functions `get_params`
@@ -58,7 +58,7 @@
 
 # MAGIC %md
 # MAGIC ##### Step 1.3: `get_params` and store values in variables
-# MAGIC 
+# MAGIC
 # MAGIC * Three of the parameters returned by `get_params` are used in this notebook. For convenience, we will store the values for these parameters in new variables. 
 # MAGIC   * **database_name:** the name of the database created in notebook `02_load_data`. The default value can be overridden in the notebook `99_config`
 # MAGIC   * **gold_user_journey_tbl_path:** the path used in `03_prep_data` to write out gold-level user journey data in delta format.
@@ -67,6 +67,7 @@
 # COMMAND ----------
 
 params = get_params()
+catalog_name = params['catalog_name']
 database_name = params['database_name']
 gold_user_journey_tbl_path = params['gold_user_journey_tbl_path']
 gold_attribution_tbl_path = params['gold_attribution_tbl_path']
@@ -79,13 +80,14 @@
 
 # COMMAND ----------
 
-_ = spark.sql("use {}".format(database_name))
+_ = spark.sql("use catalog {}".format(catalog_name))
+_ = spark.sql("use schema {}".format(database_name))
 
 # COMMAND ----------
 
 # MAGIC %md
 # MAGIC ## Step 2: Create a Gold-level User Journey Table
-# MAGIC 
+# MAGIC
 # MAGIC In this step, we will:
 # MAGIC 1. Create a user journey temporary view
 # MAGIC 2. View the user journey data
@@ -101,7 +103,7 @@
 # MAGIC   * `first_interaction`: the first channel that an impression for a given campaign was delivered on for a given user.
 # MAGIC   * `last_interaction`: the last channel that an impression for a given campaign was delivered on for a given user.
 # MAGIC   * `conversion`: boolean indicating whether the given user has converted (1) or not (0).
-# MAGIC 
+# MAGIC
 # MAGIC * This query is used to create a temporary view. The temporary view will be used in `Step 2.3` to create a table.
 
 # COMMAND ----------
@@ -162,12 +164,10 @@
 
 # COMMAND ----------
 
-_ = spark.sql('''
-  CREATE TABLE IF NOT EXISTS `{}`.gold_user_journey
-  USING DELTA 
-  LOCATION '{}'
-  AS SELECT * from user_journey_view
-  '''.format(database_name, gold_user_journey_tbl_path))
+# MAGIC %sql
+# MAGIC CREATE TABLE IF NOT EXISTS gold_user_journey
+# MAGIC   USING DELTA 
+# MAGIC   AS SELECT * from user_journey_view
 
 # COMMAND ----------
 
@@ -179,9 +179,9 @@
 # MAGIC %md
 # MAGIC ## Step 3: Optimize the gold_user_journey table
 # MAGIC * [Z-Ordering](https://docs.databricks.com/delta/optimizations/file-mgmt.html#z-ordering-multi-dimensional-clustering) is a technique used to co-locate related information into the same set of files. This co-locality is automatically used by Delta Lake's data-skipping algorithms to dramatically reduce the amount of data that needs to be read. The less data that needs to be read, the quicker that query results are returned.
-# MAGIC 
+# MAGIC
 # MAGIC * In practice, Z-ordering is most suitable for high-cardinality columns that you frequently want to filter on.
-# MAGIC 
+# MAGIC
 # MAGIC * Please note that the data set we are using here is relatively small and Z-ordering is likely unnecessary. It has been included, however, for illustration purposes.
 
 # COMMAND ----------
@@ -193,14 +193,14 @@
 
 # MAGIC %md
 # MAGIC ## Step 4: Create gold-level attribution summary table
-# MAGIC 
+# MAGIC
 # MAGIC In the table, `gold_user_journey`, that we just created in the previous step, we captured the values for `first_interaction` and `last_interaction` in their own respective columns. With this data now in place, let's take a look at attribution using the heuristic methods `first-touch` and `last-touch`. 
-# MAGIC 
+# MAGIC
 # MAGIC In this step, we will:
 # MAGIC 1. Create a temporary view for first-touch and last-touch attribution metrics
 # MAGIC 2. Use the temporary view to create the gold_attribution table
 # MAGIC 3. Use the gold_attribution table to view first touch vs. last touch by channel
-# MAGIC 
+# MAGIC
 # MAGIC After we build our Markov model in the next notebook, `04_markov_chains`, we will then take a look at how attribution using a data-driven method compares to these heuristic methods.
 
 # COMMAND ----------
@@ -241,12 +241,11 @@
 
 # COMMAND ----------
 
-_ = spark.sql('''
-CREATE TABLE IF NOT EXISTS gold_attribution
-USING DELTA
-LOCATION '{}'
-AS
-SELECT * FROM attribution_view'''.format(gold_attribution_tbl_path))
+# MAGIC %sql
+# MAGIC CREATE TABLE IF NOT EXISTS gold_attribution
+# MAGIC USING DELTA
+# MAGIC AS
+# MAGIC SELECT * FROM attribution_view
 
 # COMMAND ----------
 
@@ -269,7 +268,7 @@
 
 # MAGIC %md
 # MAGIC ## Appendix: Production
-# MAGIC 
+# MAGIC
 # MAGIC In this appendix, we will:
 # MAGIC * Demonstrate that Delta Lake brings ACID transaction and full DML support to data lakes (e.g. delete, update, merge into)
 # MAGIC * Demonstrate how auditing and governance is enabled by Delta Lake
@@ -337,14 +336,14 @@
 
 # MAGIC %md
 # MAGIC ## Next Steps
-# MAGIC 
+# MAGIC
 # MAGIC * Create Markov Chain Attribution Model
 
 # COMMAND ----------
 
 # MAGIC %md
 # MAGIC Copyright Databricks, Inc. [2021]. The source in this notebook is provided subject to the [Databricks License](https://databricks.com/db-license-source).  All included or referenced third party libraries are subject to the licenses set forth below.
-# MAGIC 
+# MAGIC
 # MAGIC |Library Name|Library license | Library License URL | Library Source URL |
 # MAGIC |---|---|---|---|
 # MAGIC |Matplotlib|Python Software Foundation (PSF) License |https://matplotlib.org/stable/users/license.html|https://github.com/matplotlib/matplotlib|