Improved data lake tutorial

maheshwarip · maheshwarip · commit 5a6039220555 · 2025-04-06T15:59:34.000-04:00
diff --git a/src/content/docs/pipelines/tutorials/send-data-from-client/index.mdx b/src/content/docs/pipelines/tutorials/send-data-from-client/index.mdx
@@ -1,9 +1,9 @@
 ---
-updated: 2025-03-03
+updated: 2025-04-06
 difficulty: Intermediate
 content_type: 📝 Tutorial
 pcx_content_type: tutorial
-title: Sending Clickstream data from client-side to Pipelines
+title: Create a data lake of clickstream data
 products:
   - R2
   - Workers
@@ -15,14 +15,15 @@ languages:
 
 import { Render, PackageManagers, Details, WranglerConfig } from "~/components";
 
-In this tutorial, you will learn how to ingest clickstream data to a R2 bucket using Pipelines. You will send this data from the client-side, that means you will make a call to the Pipelines URL directly from the client-side JavaScript code.
+In this tutorial, you will learn how to build a data lake of website interaction events (clickstream data), using Pipelines.
 
-For this tutorial, you will build a landing page of an e-commerce website. The page will list the products available for sale. A user can click on the view button to view the product details or click on the add to cart button to add the product to their cart. The focus of this tutorial is to show how to ingest the data to R2 using Pipelines from the client-side. Hence, the landing page will be a simple HTML page with no actual e-commerce functionality.
+Data lakes are a way to store large volumes of raw data in an object storage service such as [R2](/r2). You can run queries over a data lake, to analyze the raw events and generate product insights.
+
+For this tutorial, you will build a landing page for an e-commerce website. Users can click on the website, to view products or add them to the cart. As the user clicks on the page, events will be sent to a pipeline. These events are "client-side"; they're sent directly from a users' browser to your pipeline. Your pipeline will automatically batch the ingested data, build output files, and deliver them to an [R2 bucket](/r2) to build your data lake.
 
 ## Prerequisites
 
-1. Create a [R2 bucket](/r2/buckets/create-buckets/) in your Cloudflare account.
-2. Install [`Node.js`](https://docs.npmjs.com/downloading-and-installing-node-js-and-npm).
+1. Install [`Node.js`](https://docs.npmjs.com/downloading-and-installing-node-js-and-npm).
 
 <Details header="Node.js version manager">
 	Use a Node version manager like [Volta](https://volta.sh/) or
@@ -31,7 +32,7 @@ For this tutorial, you will build a landing page of an e-commerce website. The p
 	later in this guide, requires a Node version of `16.17.0` or later.
 </Details>
 
-## 1. Create a new project
+## 1. Create a new Workers project
 
 You will create a new Worker project that will use [Static Assets](/workers/static-assets/) to serve the HTML file. While you can use any front-end framework, this tutorial uses plain HTML and JavaScript to keep things simple. If you are interested in learning how to build and deploy a web application on Workers with Static Assets, you can refer to the [Frameworks](/workers/frameworks/) documentation.
 
@@ -59,9 +60,9 @@ Navigate to the `e-commerce-pipelines-client-side` directory:
 cd e-commerce-pipelines-client-side
 ```
 
-## 2. Create the front-end
+## 2. Create the website frontend
 
-Using Static Assets, you can serve the frontend of your application from your Worker. To use Static Assets, you need to add the required bindings to your `wrangler.toml` file.
+Using [Workers Static Assets](/workers/static-assets/), you can serve the frontend of your application from your Worker. To use Static Assets, you need to add the required bindings to your `wrangler.toml` file.
 
 <WranglerConfig>
 
@@ -185,7 +186,6 @@ Next, create a `public` directory and add an `index.html` file. The `index.html`
     </body>
 
 </html>
-
 ```
 </details>
 
@@ -197,18 +197,22 @@ The above code does the following:
 - Adds a button to add a product to the cart.
 - Contains a `handleClick` function to handle the click events. This function logs the action and the product ID. In the next steps, you will create a pipeline and add the logic to send the click events to this pipeline.
 
-## 3. Create a pipeline
+## 3. Create an R2 Bucket
+We'll create a new R2 bucket to use as the sink for our pipeline. Create a new r2 bucket `clickstream-bucket` using the [Wrangler CLI](/workers/wrangler/). Open a terminal window, and run the following command:
+
+```sh
+npx wrangler r2 bucket create clickstream-bucket
+```
 
+## 4. Create a pipeline
 You need to create a new pipeline and connect it to your R2 bucket.
 
-Create a new pipeline `clickstream-pipeline-client` using the [Wrangler CLI](/workers/wrangler/):
+Create a new pipeline `clickstream-pipeline-client` using the [Wrangler CLI](/workers/wrangler/). Open a terminal window, and run the following command:
 
 ```sh
-npx wrangler pipelines create clickstream-pipeline-client --r2-bucket <BUCKET_NAME> --compression none --batch-max-seconds 5
+npx wrangler pipelines create clickstream-pipeline-client --r2-bucket clickstream-bucket --compression none --batch-max-seconds 5
 ```
 
-Replace `<BUCKET_NAME>` with the name of your R2 bucket.
-
 When you run the command, you will be prompted to authorize Cloudflare Workers Pipelines to create R2 API tokens on your behalf. These tokens are required by your Pipeline. Your Pipeline uses these tokens when loading data into your bucket. You can approve the request through the browser link which will open automatically.
 
 :::note
@@ -220,13 +224,13 @@ These flags are useful for testing, but we recommend keeping the default setting
 :::
 
 ```output
-✅ Successfully created Pipeline "clickstream-pipeline-client" with ID 0a10c577652949718bc014f4efxea241
+✅ Successfully created Pipeline "clickstream-pipeline-client" with ID <PIPELINE_ID>
 
-Id:    0a10c577652949718bc014f4efxea241
+Id:    <PIPELINE_ID>
 Name:  clickstream-pipeline-client
 Sources:
   HTTP:
-    Endpoint:        https://0a10c577652949718bc014f4efxea241.pipelines.cloudflare.com
+    Endpoint:        https://<PIPELINE_ID>.pipelines.cloudflare.com
     Authentication:  off
     Format:          JSON
   Worker:
@@ -245,12 +249,12 @@ Destination:
 
 Send data to your Pipeline's HTTP endpoint:
 
-curl "https://0a10c577652949718bc014f4efxea241.pipelines.cloudflare.com" -d '[{"foo": "bar"}]'
+curl "https://<PIPELINE_ID>.pipelines.cloudflare.com" -d '[{"foo": "bar"}]'
 ```
 
 Make a note of the URL of the pipeline. You will use this URL to send the clickstream data from the client-side.
 
-## 4. Generate clickstream data
+## 5. Generate clickstream data
 
 You need to send clickstream data like the `timestamp`, `user_id`, `session_id`, and `device_info` to your pipeline. You can generate this data on the client side. Add the following function in the `<script>` tag in your `public/index.html`. This function gets the device information:
 
@@ -306,7 +310,7 @@ function extractDeviceInfo(userAgent) {
 }
 ```
 
-## 5. Send clickstream data to your pipeline
+## 6. Send clickstream data to your pipeline
 
 You will send the clickstream data to the pipline from the client-side. To do that, update the `handleClick` function to make a `POST` request to the pipeline URL with the data. Replace `<PIPELINE_URL>` with the URL of your pipeline.
 
@@ -368,19 +372,17 @@ npm run dev
 
 However, no data gets sent to the pipeline. Inspect the browser console to view the error message. The error message you see is for [CORS](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS). In the next step, you will update the CORS settings to allow the client-side JavaScript to send data to the pipeline.
 
-## 6. Update CORS settings
+## 7. Update CORS settings
 
-By default, the Pipelines endpoint does not allow cross-origin requests. You need to update the CORS settings to allow the client-side JavaScript to send data to the pipeline. To update the CORS settings, execute the following command:
+By default, the HTTP ingestion endpoint for your pipeline does not allow cross-origin requests. You need to update the CORS settings to allow the client-side JavaScript to send data to the pipeline. To update the CORS settings, execute the following command:
 
 ```sh
 npx wrangler pipelines update clickstream-pipeline-client --cors-origins http://localhost:8787
 ```
 
-Now when you run the development server and open the application in the browser, you will see the clickstream data being sent to the pipeline when you click on the `View Details` or `Add to Cart` button. You can also see the data in the R2 bucket.
+Now when you run the development server locally, and open the website in a browser, clickstream data will be successfully sent to the pipeline. You can learn more about the CORS settings in the [Specifying CORS settings](/pipelines/build-with-pipelines/http/#specifying-cors-settings) documentation.
 
-You can learn more about the CORS settings in the [Specifying CORS settings](/pipelines/build-with-pipelines/http/#specifying-cors-settings) documentation.
-
-## 7. Deploy the application
+## 8. Deploy the application
 
 To deploy the application, run the following command:
 
@@ -406,17 +408,17 @@ Deployed e-commerce-pipelines-client-side triggers (7.60 sec)
 Current Version ID: <VERSION_ID>
 ```
 
-We now need to update the pipeline's CORS settings to include the URL of our newly deployed application. Run the command below, and replace `<URL>` with the URL of the application.
+We now need to update the pipeline's CORS settings again. This time, we'll include the URL of our newly deployed application. Run the command below, and replace `<URL>` with the URL of the application.
 
 ```sh
 npx wrangler pipelines update clickstream-pipeline-client --cors-origins http://localhost:8787 https://<URL>.workers.dev
 ```
 
 Now, you can access the application at the deployed URL. When you click on the `View Details` or `Add to Cart` button, the clickstream data will be sent to your pipeline.
 
-## 8. View the data in R2
+## 9. View the data in R2
 
-You can view the data in the R2 bucket. If you are not signed in to the Cloudflare dashboard, sign in and navigate to the R2 overview page.
+You can view the data in the R2 bucket. If you are not signed in to the Cloudflare dashboard, sign in and navigate to the [R2 overview](https://dash.cloudflare.com/?to=/:account/r2/overview) page.
 
 Open the bucket you configured for your pipeline in Step 3. You can see files, representing the clickstream data. These files are newline delimited JSON files. Each row in a file represents one click event. Download one of the files, and open it in your preferred text editor to see the output:
 
@@ -428,15 +430,50 @@ Open the bucket you configured for your pipeline in Step 3. You can see files, r
 {"timestamp":"2025-04-06T16:24:33.978Z","session_id":"1234567890abcdef","user_id":"user333","event_data":{"event_id":467,"event_type":"product_view","page_url":"https://<URL>.workers.dev/","timestamp":"2025-04-06T16:24:33.978Z","product_id":6},"device_info":{"browser":"Chrome","os":"Linux","device":"Mobile","userAgent":"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Mobile Safari/537.36"},"referrer":""}
 ```
 
+## 10. Optional: Connect a query engine to your R2 bucket and query the data
+Once you have collected the raw events in R2, you might want to query the events, to answer questions such as "how many `product_view` events occurred?". You can connect a query engine, such as MotherDuck, to your R2 bucket.
+
+You can connect the bucket to MotherDuck in several ways, which you can learn about from the [MotherDuck documentation](https://motherduck.com/docs/integrations/cloud-storage/cloudflare-r2/). In this tutorial, you will connect the bucket to MotherDuck using the MotherDuck dashboard.
+
+### Connect your bucket to MotherDuck
+
+Before connecting the bucket to MotherDuck, you need to obtain the Access Key ID and Secret Access Key for the R2 bucket. You can find the instructions to obtain the keys in the [R2 API documentation](/r2/api/tokens/).
+
+Before connecting the bucket to MotherDuck, you need to obtain the Access Key ID and Secret Access Key for the R2 bucket. You can find the instructions to obtain the keys in the [R2 API documentation](/r2/api/tokens/).
+
+1. Log in to the MotherDuck dashboard and select your profile.
+2. Navigate to the **Secrets** page.
+3. Select the **Add Secret** button and enter the following information:
+
+   - **Secret Name**: `Clickstream pipeline`
+   - **Secret Type**: `Cloudflare R2`
+   - **Access Key ID**: `ACCESS_KEY_ID` (replace with the Access Key ID)
+   - **Secret Access Key**: `SECRET_ACCESS_KEY` (replace with the Secret Access Key)
+
+4. Select the **Add Secret** button to save the secret.
+
+### Query the data
+In this step, you will query the data stored in the R2 bucket using MotherDuck.
+
+1. Navigate back to the MotherDuck dashboard and select the **+** icon to add a new Notebook.
+2. Select the **Add Cell** button to add a new cell to the notebook.
+
+3. In the cell, enter the following query and select the **Run** button to execute the query:
+
+```sql
+SELECT count(*) FROM read_json_auto('r2://clickstream-bucket/**/*');
+```
+
+The query will return a count of all the events received.
+
 ## Conclusion
 
 You have successfully created a Pipeline and used it to send clickstream data from the client. Through this tutorial, you've gained hands-on experience in:
 
-1. Creating a Workers project with a static frontend
+1. Creating a Workers project, using static assets
 2. Generating and capturing clickstream data
-3. Setting up a Cloudflare Pipelines to ingest data into R2
+3. Setting up a pipeline to ingest data into R2
 4. Deploying the application to Workers
-
-For your next steps, consider connecting your R2 bucket to MotherDuck to analyse the data. You can follow the instructions in the [Analyzing Clickstream Data with MotherDuck and Cloudflare R2](/pipelines/tutorials/query-data-with-motherduck#7-connect-the-r2-bucket-to-motherduck) tutorial to connect your R2 bucket to MotherDuck and analyse data.
+5. Using MotherDuck to query the data
 
 You can find the source code of the application in the [GitHub repository](https://github.com/harshil1712/e-commerce-pipelines-client-side).