DataSQRL · mbroecheler · May 6, 2025 · Mar 31, 2025 · Apr 8, 2025 · Apr 8, 2025
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -14,44 +14,41 @@ jobs:
       fail-fast: false
       matrix:
         include:
+          - example: recommendation
+            path: clickstream-ai-recommendation
+            test_commands: |
+              compile -c package-api.json
+              compile -c package-kafka.json
           - example: finance
             path: finance-credit-card-chatbot
             test_commands: |
-              test -c package-analytics-local.json --snapshot snapshots-analytics
-              test -c package-rewards-local.json --snapshot snapshots-rewards
+              test -c package-analytics-local.json --snapshots snapshots-analytics
+              test -c package-rewards-local.json --snapshots snapshots-rewards
 
           - example: healthcare
             path: healthcare-study-monitoring
             test_commands: |
-              test -c study_analytics_package.json study_analytics.sqrl --snapshot snapshots-study-analytics
-              test -c study_api_test_package.json --tests tests-api --snapshot snapshots-study-api
-              test -c study_stream_local_package.json study_stream.sqrl --snapshot snapshots-study-stream
-              compile study_create_api.sqrl
+              test -c study_analytics_package.json study_analytics.sqrl --snapshots snapshots-study-analytics
+              test -c study_api_test_package.json --tests tests-study-api --snapshots snapshots-study-api
+              test -c study_stream_local_package.json study_stream.sqrl --snapshots snapshots-study-stream
+              compile -c study_stream_kafka_package.json
+              compile -c study_analytics_snowflake_package.json
 
           - example: logistics
             path: logistics-shipping-geodata
             test_commands: |
-              test logistics.sqrl --snapshot snapshots
+              test logistics.sqrl --snapshots snapshots
 
           - example: iot-sensor
             path: iot-sensor-metrics
             test_commands: |
-              test sensors.sqrl --snapshot snapshots
-
-          - example: retail
-            path: retail-customer360-nutshop
-            test_commands: |
-              test customer360.sqrl --snapshot snapshots
-
-          - example: recommendation
-            path: clickstream-ai-recommendation
-            test_commands: |
-              test -c package.json --snapshot snapshots
+              test -c sensor-static.json --snapshots snapshots-static
+              test -c sensor-api.json --tests test-api --snapshots snapshots-api
 
           - example: law
             path: law-enforcement
             test_commands: |
-              test -c baseball-card-local.json --snapshot snapshots
+              test -c baseball-card-local.json --snapshots snapshots
 
           - example: oil-gas
             path: oil-gas-agent-automation
@@ -60,7 +57,7 @@ jobs:
 
     env:
       TZ: 'America/Los_Angeles'
-      SQRL_VERSION: 'v0.5.10'
+      SQRL_VERSION: '0.6.0'
 
     steps:
     - uses: actions/checkout@v4

diff --git a/.gitignore b/.gitignore
@@ -4,6 +4,7 @@ target
 h2.db.mv.db
 cache
 .project
+*/myenv
 
 
 # Created by https://www.toptal.com/developers/gitignore/api/python

diff --git a/README.md b/README.md
@@ -3,51 +3,42 @@
 This is a repository for real world [DataSQRL](https://github.com/DataSQRL/sqrl) use cases and examples.
 
 * **[Finance Credit Card Chatbot](finance-credit-card-chatbot/)**: Build a data pipeline that enriches and analyzes credit card transaction in real time and feeds the data into a GenAI chatbot to answer customer's questions about their transactions and spending. The extended example shows how to build a credit card rewards program and GenAI agent that sells credit cards.
+* **[Healthcare Study](healthcare-study-monitoring/)**: Build a data pipeline for enriching healthcare data and querying it in realtime through an API, for data analytics in Iceberg, and publishing it to Kafka.
+* **[Law Enforcement](law-enforcement)**: Build a realtime data pipeline for capturing and tracking warrants and Bolos.
+* **[Oil & Gas IoT Automation Agent](oil-gas-agent-automation)**: Build a realtime data enrichment pipeline that triggers an agent to analyze abnormal events for automated troubleshooting.
 * **[Clickstream AI Recommendation](clickstream-ai-recommendation/)**: Build a personalized recommendation engine based on clickstream data and vector content embeddings generated by an LLM.
 * **[IoT Sensor Metrics](iot-sensor-metrics/)**: Build an event-driven microservice that ingests sensor metrics, processes them in realtime, and produces alerts and dashboards for users.
 * **[Logistics Shipping](logistics-shipping-geodata/)**: Build a data pipeline that processes logistics data to provide real-time tracking and shipment information for customers.
-* **[Retail Nutshop](retail-customer360-nutshop/)**: Build a realtime Customer 360 application for an online shop with personalized recommendations.
 * **[User Defined Function](user-defined-function/)**: This small tutorial shows how to include your call a custom function in your SQRL script.
 
 ## Running the Examples
 
 ### Prerequisites
 
-Running these examples requires the DataSQRL compiler. The easiest way to run the DataSQRL compiler is in Docker. This requires that you have a [recent version of Docker](https://docs.docker.com/get-docker/) installed on your machine.
+To compile and run these examples with DataSQRL you need to have a [recent version of Docker](https://docs.docker.com/get-docker/) installed on your machine.
 
-### Compiling Examples
+### Running & Compiling Examples
 
-To run the DataSQRL compiler on Linux or MacOS, open a terminal and run the following command:
+To run the examples on Linux or MacOS, open a terminal and run the following command:
 ```bash
-docker run -it --rm -v $PWD:/build datasqrl/cmd:latest compile [ARGUMENTS GO HERE]
+docker run -it --rm -p 8888:8888 -p 8081:8081 -p 9092:9092 -v $PWD:/build datasqrl/cmd:latest run [ARGUMENTS GO HERE]
 ```
 
 If you are on windows using Powershell, you need to reference the local directory with a slightly different syntax:
 ```bash
-docker run -it --rm -v ${PWD}:/build datasqrl/cmd:latest compile [ARGUMENTS GO HERE]
+docker run -it --rm -p 8888:8888 -p 8081:8081 -p 9092:9092 -v $PWD:/build datasqrl/cmd:latest run [ARGUMENTS GO HERE]
 ```
 
-Check the `README.md` in the respective directory for more information on how to run each example. We will be using the Unix syntax, so keep in mind that you have to adjust the commands slightly on Windows machines by using `${PWD}` instead.
-
-### Running Examples
-
-DataSQRL compiles all the assets for a completely integrated data pipeline. The assets are generated in the `build/deploy` folder. You can run that data pipeline with Docker:
-
-`(cd build/deploy; docker compose up --build)`.
-
-This will build all the images and stand up all the components of the data pipeline. Note, that this can take a few minutes - in particular if you are building for the first time.
-
-Once you are done with the data pipeline, you can bring it down safely with:
+Check the `README.md` in the respective directory for more information on the arguments to run each example. 
+We will be using the Unix syntax, so keep in mind that you have to adjust the commands slightly on Windows machines by using `${PWD}` instead.
 
-`(cd build/deploy; docker compose down -v)`
+To compile an example (without running it), use this command:
+```bash
+docker run -it --rm -v $PWD:/build datasqrl/cmd:latest compile [ARGUMENTS GO HERE]
+```
 
 ## What is DataSQRL?
 
-![Example DataSQRL Feature Store](util/img/feature-store.png)
-
-DataSQRL is a flexible data development framework for building various types of data architectures, like data pipelines, event-driven microservices, and Kappa. It provides the basic structure, common patterns, and a set of tools for streamlining the development process.
-
-DataSQRL integrates a broad array of technologies including Apache Flink, Apache Kafka, PostgreSQL, Apache Iceberg, Snowflake, Vert.x and others. By enabling developers to define data processing workflows in SQL and supporting custom functions in Java, Scala, and soon Python, DataSQRL generates the necessary glue code, schematics, and mappings to connect and configure these components seamlessly. DataSQRL simplifies the data engineering process by automating routine tasks such as testing, debugging, deployment, and maintenance of data architectures.
+Check out the main [DataSQRL repository](https://github.com/DataSQRL/sqrl/) for more information on the compiler and runtime used in these examples.
 
-[DataSQRL](https://github.com/DataSQRL/sqrl) is an open-source project hosted on GitHub.
-[Click here](https://www.datasqrl.com) for more information and documentation on DataSQRL.
+Take a look at the [DataSQRL documentation](https://datasqrl.github.io/sqrl) to learn how to build your own project with DataSQRL.
diff --git a/clickstream-ai-recommendation/README.md b/clickstream-ai-recommendation/README.md
@@ -1,9 +1,8 @@
 # Clickstream Recommendation
 
-This example demonstrates DataSQRL's capabilities by creating personalized content recommendations
+We are building personalized content recommendations
 based on clickstream data and content vector embeddings. The pipeline ingests data, processes it,
-and serves recommendations in real time, highlighting the power of DataSQRL in handling streaming
-data efficiently.
+and serves recommendations in real time.
 
 ## Architecture
 
@@ -20,122 +19,84 @@ components in AWS as shown in the diagram above.
 
 There are two ways to run this example depending on how you want to ingest the clickstream data.
 
-### Ingest from Stream
-
-This method reads data from the stream directly and requires that you add the data to the stream
-specifically.
-
-In the project, we have two packages for the same data. The `content-kafka` package specifies reading
-data through a Kafka connector. The `content-file`, uses a connector that reads from the
-filesystem. Initially, we will use the `content-kafka` package, which is already
-imported (`IMPORT content.Content`, which has a dependency aliased in the package.json).
+## Ingest Data from API
 
-Then, execute the following steps:
+This version allows you to add content and clickstream data manually through a GraphQL API giving you more control over how the data is ingested and what is happening.
 
-1. Run the following command in the root directory to compile: `docker run -it --rm -v $PWD:/build datasqrl/cmd:v0.5.2 compile`
-1. Add the current directory as an env variable: `export SQRL_DIR=${PWD}`
-1. Start the pipeline: `(cd build/deploy; docker compose up --build)`. This sets up the entire data pipeline with
-   Redpanda, Flink, Postgres, and API server. It takes a few minutes for all the components to boot
-   up.
-1. Once everything is started, open another terminal window to add data to Kafka using the
-   load_data.py script in the `yourdata-files` directory. This requires **kafka-python-ng** installed
-   via `pip3 install kafka-python-ng`.
-1. Load the content data: `python3 load_data.py content.json.gz localhost:9094 content --msg 50`.
-   Wait until it finishes, which takes about two minutes. Check the Flink Dashboard running
-   at http://localhost:8081/ to see the progress. Wait until the task turns blue again.
-1. Load the clickstream
-   data: `python3 load_data.py clickstream.json.gz localhost:9094 clickstream --msg 100`. This loads
-   100 clicks per second. Wait a few seconds for some data to load. Let this run in the background
-   until it finishes (about 4 minutes).
-
-Open GraphiQL and query the data:
-`http://localhost:8888/graphiql/`
-
-Query for recommendations either by page:
+Run the API version with this command:
+```bash
+docker run -it -p 8081:8081 -p 8888:8888 --rm -v $PWD:/build -e OPENAI_API_KEY=[YOUR_API_KEY_HERE] datasqrl/cmd:latest run -c package-api.json
+```
 
+Next, open [GraphQL Explorer](http://localhost:8888/graphiql/) and add some content:
 ```graphql
-query {
-    Recommendation(url: "https://en.wikipedia.org/wiki/Generosity%3A%20An%20Enhancement") {
-        recommendation
-        frequency
-    }
+mutation {
+  Content(event: {url: "https://en.wikipedia.org/wiki/Zoop", title: "Zoop", text: "Zoop is a puzzle video game originally developed by Hookstone Productions and published by Viacom New Media for many platforms in 1995. It has similarities to Taito's 1989 arcade game Plotting (known as Flipull in other territories and on other systems) but Zoop runs in real-time instead. Players are tasked with eliminating pieces that spawn from one of the sides of the screen, before they reach the center of the playfield, by pointing at a specific piece and shooting it to either swap it with t"}) {
+    title
+  }
 }
 ```
+You can find more content samples to add in the [sample_content.jsonl](content-api/sample_content.jsonl) file.
 
-or for a user:
-
+Once you have added multiple pieces of content, we can register clicks with this mutation:
 ```graphql
-query {
-    SimilarContent(userid: "f5e9c688-408d-b54f-94aa-493df43dac8c") {
-        url
-        similarity
-    }
+mutation {
+  Clickstream(event:{url:"https://en.wikipedia.org/wiki/Zoop", userid: "1"}) {
+    userid
+  }
 }
 ```
 
-You can find all the page URLs in the file `datawiki/wikipedia_urls.txt` and user ids in the
-file `yourdata-files/clickstream.json.gz` (read it with `gzcat`) if you want to experiment with
-different queries.
-
-Once you are done, hit `CTRL-C` and take down the pipeline containers with `docker compose down -v`.
-
-### From Local Files and API
-
-This method reads the content data from local files and ingests the clickstream data through the
-API. Please look at the `package-mutation.json` and compare it with the `package.json` we used before.
-Notice that we are using different graphql files and different dependencies.
-
-Execute the following steps:
-
-1. Run the following command in the root directory to compile: `docker run -it --rm -v $PWD:/build datasqrl/cmd:v0.5.2 compile -c package-mutation.json`
-2. Navigate to the build/deploy directory: `(cd build/deploy; docker compose up --build)`
-
-Open GraphiQL to add and query data.
-
-First, add some clickstream data for the user with ID f5e9c688-408d-b54f-94aa-493df43dac8c by
-running the following mutations one after the other. Each mutation simulates a user clicking on a
-specific URL. We expect the system to record these clicks and use them to generate personalized
-recommendations.
-
-Open GraphiQL to execute the following graphql commands:
-`http://localhost:8888/graphiql/`
-
-This mutation records a click event for the user on the page "Generosity: An Enhancement":
+Run this mutation a few times with different urls for the same userid (to create Covisits) and different userids.
 
+To retrieve similar content to what a user has viewed before, run this query:
 ```graphql
-mutation {
-    Clickstream(click: {userid: "f5e9c688-408d-b54f-94aa-493df43dac8c",
-        url: "https://en.wikipedia.org/wiki/Generosity%3A%20An%20Enhancement"}) {
-        event_time
-    }
+{
+  SimilarContent(userid: "1") {
+    similarity
+    title
+  }
 }
 ```
 
-This mutation records another click event for the user on the page "Lock's Quest":
-
+To retrieve recommendations by URL:
 ```graphql
-mutation {
-    Clickstream(click: {userid: "f5e9c688-408d-b54f-94aa-493df43dac8c",
-        url: "https://en.wikipedia.org/wiki/Lock%27s%20Quest"}) {
-        event_time
-    }
+{
+Recommendation(url: "https://en.wikipedia.org/wiki/Zoop") {
+  recommendation
+  frequency
+}
 }
 ```
 
-his mutation records a final click event for the user on the page "SystemC":
 
-```graphql
-mutation {
-    Clickstream(click: {userid: "f5e9c688-408d-b54f-94aa-493df43dac8c",
-        url: "https://en.wikipedia.org/wiki/SystemC"}) {
-        event_time
-    }
-}
+### Ingest from Stream
+
+This method reads data from the stream directly and requires that you add the data to the stream specifically.
+
+The `content-kafka` package specifies reading
+data through a Kafka connector.
+
+To run the pipeline:
+```bash
+docker run -it -p 8081:8081 -p 8888:8888 -p 9092:9092 --rm -v $PWD:/build -e OPENAI_API_KEY=[YOUR_API_KEY_HERE] datasqrl/cmd:latest run -c package-kafka.json
 ```
 
-Now, query for recommendations. Either by page:
+1. Once everything is started, open another terminal window to add data to Kafka using the
+   load_data.py script in the `yourdata-files` directory. This requires **kafka-python-ng** installed
+   via `pip3 install kafka-python-ng`.
+1. Load the content data: `python3 load_data.py content.json.gz localhost:9092 content --msg 50`.
+   Wait until it finishes, which takes about two minutes. Check the Flink Dashboard running
+   at http://localhost:8081/ to see the progress. Wait until the task turns blue again.
+1. Load the clickstream
+   data: `python3 load_data.py clickstream.json.gz localhost:9092 clickstream --msg 100`. This loads
+   100 clicks per second. Wait a few seconds for some data to load. Let this run in the background
+   until it finishes (about 4 minutes).
+
+Open GraphiQL and query the data:
+`http://localhost:8888/graphiql/`
 
-This query provides recommendations based on the specified page URL:
+Query for recommendations either by page:
 
 ```graphql
 query {
@@ -146,8 +107,7 @@ query {
 }
 ```
 
-This query provides recommendations based on the user's clickstream data, showing similar content to
-what the user has interacted with:
+or for a user:
 
 ```graphql
 query {
@@ -158,4 +118,8 @@ query {
 }
 ```
 
-Once you are done, hit CTRL-C and take down the pipeline containers with docker compose down -v.
+You can find all the page URLs in the file `datawiki/wikipedia_urls.txt` and user ids in the
+file `yourdata-files/clickstream.json.gz` (read it with `gzcat`) if you want to experiment with
+different queries.
+
+Once you are done, hit `CTRL-C` and take down the pipeline.
diff --git a/clickstream-ai-recommendation/clickstream-api/clickstream.table.sql b/clickstream-ai-recommendation/clickstream-api/clickstream.table.sql
@@ -0,0 +1,5 @@
+CREATE TABLE AddClick (
+    url STRING NOT NULL,
+    userid STRING NOT NULL,
+    event_time TIMESTAMP_LTZ(3) NOT NULL METADATA FROM 'timestamp'
+);
diff --git a/clickstream-ai-recommendation/clickstream-kafka/clickstream.schema.yml b/clickstream-ai-recommendation/clickstream-kafka/clickstream.schema.yml
-Original file line number
+Diff line change
@@ Expand Up / @@ -4,6 +4,7 @@ target @@
     h2.db.mv.db
     cache
     .project
+    */myenv
     # Created by https://www.toptal.com/developers/gitignore/api/python
@@ Expand Down @@