Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 17 additions & 20 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,44 +14,41 @@ jobs:
fail-fast: false
matrix:
include:
- example: recommendation
path: clickstream-ai-recommendation
test_commands: |
compile -c package-api.json
compile -c package-kafka.json
- example: finance
path: finance-credit-card-chatbot
test_commands: |
test -c package-analytics-local.json --snapshot snapshots-analytics
test -c package-rewards-local.json --snapshot snapshots-rewards
test -c package-analytics-local.json --snapshots snapshots-analytics
test -c package-rewards-local.json --snapshots snapshots-rewards

- example: healthcare
path: healthcare-study-monitoring
test_commands: |
test -c study_analytics_package.json study_analytics.sqrl --snapshot snapshots-study-analytics
test -c study_api_test_package.json --tests tests-api --snapshot snapshots-study-api
test -c study_stream_local_package.json study_stream.sqrl --snapshot snapshots-study-stream
compile study_create_api.sqrl
test -c study_analytics_package.json study_analytics.sqrl --snapshots snapshots-study-analytics
test -c study_api_test_package.json --tests tests-study-api --snapshots snapshots-study-api
test -c study_stream_local_package.json study_stream.sqrl --snapshots snapshots-study-stream
compile -c study_stream_kafka_package.json
compile -c study_analytics_snowflake_package.json

- example: logistics
path: logistics-shipping-geodata
test_commands: |
test logistics.sqrl --snapshot snapshots
test logistics.sqrl --snapshots snapshots

- example: iot-sensor
path: iot-sensor-metrics
test_commands: |
test sensors.sqrl --snapshot snapshots

- example: retail
path: retail-customer360-nutshop
test_commands: |
test customer360.sqrl --snapshot snapshots

- example: recommendation
path: clickstream-ai-recommendation
test_commands: |
test -c package.json --snapshot snapshots
test -c sensor-static.json --snapshots snapshots-static
test -c sensor-api.json --tests test-api --snapshots snapshots-api

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should split this file in 2, leave one for 0.5 and other for 0.6.

we can delete 0.5 as things progress

- example: law
path: law-enforcement
test_commands: |
test -c baseball-card-local.json --snapshot snapshots
test -c baseball-card-local.json --snapshots snapshots

- example: oil-gas
path: oil-gas-agent-automation
Expand All @@ -60,7 +57,7 @@ jobs:

env:
TZ: 'America/Los_Angeles'
SQRL_VERSION: 'v0.5.10'
SQRL_VERSION: '0.6.0'

steps:
- uses: actions/checkout@v4
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ target
h2.db.mv.db
cache
.project
*/myenv


# Created by https://www.toptal.com/developers/gitignore/api/python
Expand Down
41 changes: 16 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,51 +3,42 @@
This is a repository for real world [DataSQRL](https://github.com/DataSQRL/sqrl) use cases and examples.

* **[Finance Credit Card Chatbot](finance-credit-card-chatbot/)**: Build a data pipeline that enriches and analyzes credit card transaction in real time and feeds the data into a GenAI chatbot to answer customer's questions about their transactions and spending. The extended example shows how to build a credit card rewards program and GenAI agent that sells credit cards.
* **[Healthcare Study](healthcare-study-monitoring/)**: Build a data pipeline for enriching healthcare data and querying it in realtime through an API, for data analytics in Iceberg, and publishing it to Kafka.
* **[Law Enforcement](law-enforcement)**: Build a realtime data pipeline for capturing and tracking warrants and Bolos.
* **[Oil & Gas IoT Automation Agent](oil-gas-agent-automation)**: Build a realtime data enrichment pipeline that triggers an agent to analyze abnormal events for automated troubleshooting.
* **[Clickstream AI Recommendation](clickstream-ai-recommendation/)**: Build a personalized recommendation engine based on clickstream data and vector content embeddings generated by an LLM.
* **[IoT Sensor Metrics](iot-sensor-metrics/)**: Build an event-driven microservice that ingests sensor metrics, processes them in realtime, and produces alerts and dashboards for users.
* **[Logistics Shipping](logistics-shipping-geodata/)**: Build a data pipeline that processes logistics data to provide real-time tracking and shipment information for customers.
* **[Retail Nutshop](retail-customer360-nutshop/)**: Build a realtime Customer 360 application for an online shop with personalized recommendations.
* **[User Defined Function](user-defined-function/)**: This small tutorial shows how to include your call a custom function in your SQRL script.

## Running the Examples

### Prerequisites

Running these examples requires the DataSQRL compiler. The easiest way to run the DataSQRL compiler is in Docker. This requires that you have a [recent version of Docker](https://docs.docker.com/get-docker/) installed on your machine.
To compile and run these examples with DataSQRL you need to have a [recent version of Docker](https://docs.docker.com/get-docker/) installed on your machine.

### Compiling Examples
### Running & Compiling Examples

To run the DataSQRL compiler on Linux or MacOS, open a terminal and run the following command:
To run the examples on Linux or MacOS, open a terminal and run the following command:
```bash
docker run -it --rm -v $PWD:/build datasqrl/cmd:latest compile [ARGUMENTS GO HERE]
docker run -it --rm -p 8888:8888 -p 8081:8081 -p 9092:9092 -v $PWD:/build datasqrl/cmd:latest run [ARGUMENTS GO HERE]
```

If you are on windows using Powershell, you need to reference the local directory with a slightly different syntax:
```bash
docker run -it --rm -v ${PWD}:/build datasqrl/cmd:latest compile [ARGUMENTS GO HERE]
docker run -it --rm -p 8888:8888 -p 8081:8081 -p 9092:9092 -v $PWD:/build datasqrl/cmd:latest run [ARGUMENTS GO HERE]
```

Check the `README.md` in the respective directory for more information on how to run each example. We will be using the Unix syntax, so keep in mind that you have to adjust the commands slightly on Windows machines by using `${PWD}` instead.

### Running Examples

DataSQRL compiles all the assets for a completely integrated data pipeline. The assets are generated in the `build/deploy` folder. You can run that data pipeline with Docker:

`(cd build/deploy; docker compose up --build)`.

This will build all the images and stand up all the components of the data pipeline. Note, that this can take a few minutes - in particular if you are building for the first time.

Once you are done with the data pipeline, you can bring it down safely with:
Check the `README.md` in the respective directory for more information on the arguments to run each example.
We will be using the Unix syntax, so keep in mind that you have to adjust the commands slightly on Windows machines by using `${PWD}` instead.

`(cd build/deploy; docker compose down -v)`
To compile an example (without running it), use this command:
```bash
docker run -it --rm -v $PWD:/build datasqrl/cmd:latest compile [ARGUMENTS GO HERE]
```

## What is DataSQRL?

![Example DataSQRL Feature Store](util/img/feature-store.png)

DataSQRL is a flexible data development framework for building various types of data architectures, like data pipelines, event-driven microservices, and Kappa. It provides the basic structure, common patterns, and a set of tools for streamlining the development process.

DataSQRL integrates a broad array of technologies including Apache Flink, Apache Kafka, PostgreSQL, Apache Iceberg, Snowflake, Vert.x and others. By enabling developers to define data processing workflows in SQL and supporting custom functions in Java, Scala, and soon Python, DataSQRL generates the necessary glue code, schematics, and mappings to connect and configure these components seamlessly. DataSQRL simplifies the data engineering process by automating routine tasks such as testing, debugging, deployment, and maintenance of data architectures.
Check out the main [DataSQRL repository](https://github.com/DataSQRL/sqrl/) for more information on the compiler and runtime used in these examples.

[DataSQRL](https://github.com/DataSQRL/sqrl) is an open-source project hosted on GitHub.
[Click here](https://www.datasqrl.com) for more information and documentation on DataSQRL.
Take a look at the [DataSQRL documentation](https://datasqrl.github.io/sqrl) to learn how to build your own project with DataSQRL.
160 changes: 62 additions & 98 deletions clickstream-ai-recommendation/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
# Clickstream Recommendation

This example demonstrates DataSQRL's capabilities by creating personalized content recommendations
We are building personalized content recommendations
based on clickstream data and content vector embeddings. The pipeline ingests data, processes it,
and serves recommendations in real time, highlighting the power of DataSQRL in handling streaming
data efficiently.
and serves recommendations in real time.

## Architecture

Expand All @@ -20,122 +19,84 @@ components in AWS as shown in the diagram above.

There are two ways to run this example depending on how you want to ingest the clickstream data.

### Ingest from Stream

This method reads data from the stream directly and requires that you add the data to the stream
specifically.

In the project, we have two packages for the same data. The `content-kafka` package specifies reading
data through a Kafka connector. The `content-file`, uses a connector that reads from the
filesystem. Initially, we will use the `content-kafka` package, which is already
imported (`IMPORT content.Content`, which has a dependency aliased in the package.json).
## Ingest Data from API

Then, execute the following steps:
This version allows you to add content and clickstream data manually through a GraphQL API giving you more control over how the data is ingested and what is happening.

1. Run the following command in the root directory to compile: `docker run -it --rm -v $PWD:/build datasqrl/cmd:v0.5.2 compile`
1. Add the current directory as an env variable: `export SQRL_DIR=${PWD}`
1. Start the pipeline: `(cd build/deploy; docker compose up --build)`. This sets up the entire data pipeline with
Redpanda, Flink, Postgres, and API server. It takes a few minutes for all the components to boot
up.
1. Once everything is started, open another terminal window to add data to Kafka using the
load_data.py script in the `yourdata-files` directory. This requires **kafka-python-ng** installed
via `pip3 install kafka-python-ng`.
1. Load the content data: `python3 load_data.py content.json.gz localhost:9094 content --msg 50`.
Wait until it finishes, which takes about two minutes. Check the Flink Dashboard running
at http://localhost:8081/ to see the progress. Wait until the task turns blue again.
1. Load the clickstream
data: `python3 load_data.py clickstream.json.gz localhost:9094 clickstream --msg 100`. This loads
100 clicks per second. Wait a few seconds for some data to load. Let this run in the background
until it finishes (about 4 minutes).

Open GraphiQL and query the data:
`http://localhost:8888/graphiql/`

Query for recommendations either by page:
Run the API version with this command:
```bash
docker run -it -p 8081:8081 -p 8888:8888 --rm -v $PWD:/build -e OPENAI_API_KEY=[YOUR_API_KEY_HERE] datasqrl/cmd:latest run -c package-api.json
```

Next, open [GraphQL Explorer](http://localhost:8888/graphiql/) and add some content:
```graphql
query {
Recommendation(url: "https://en.wikipedia.org/wiki/Generosity%3A%20An%20Enhancement") {
recommendation
frequency
}
mutation {
Content(event: {url: "https://en.wikipedia.org/wiki/Zoop", title: "Zoop", text: "Zoop is a puzzle video game originally developed by Hookstone Productions and published by Viacom New Media for many platforms in 1995. It has similarities to Taito's 1989 arcade game Plotting (known as Flipull in other territories and on other systems) but Zoop runs in real-time instead. Players are tasked with eliminating pieces that spawn from one of the sides of the screen, before they reach the center of the playfield, by pointing at a specific piece and shooting it to either swap it with t"}) {
title
}
}
```
You can find more content samples to add in the [sample_content.jsonl](content-api/sample_content.jsonl) file.

or for a user:

Once you have added multiple pieces of content, we can register clicks with this mutation:
```graphql
query {
SimilarContent(userid: "f5e9c688-408d-b54f-94aa-493df43dac8c") {
url
similarity
}
mutation {
Clickstream(event:{url:"https://en.wikipedia.org/wiki/Zoop", userid: "1"}) {
userid
}
}
```

You can find all the page URLs in the file `datawiki/wikipedia_urls.txt` and user ids in the
file `yourdata-files/clickstream.json.gz` (read it with `gzcat`) if you want to experiment with
different queries.

Once you are done, hit `CTRL-C` and take down the pipeline containers with `docker compose down -v`.

### From Local Files and API

This method reads the content data from local files and ingests the clickstream data through the
API. Please look at the `package-mutation.json` and compare it with the `package.json` we used before.
Notice that we are using different graphql files and different dependencies.

Execute the following steps:

1. Run the following command in the root directory to compile: `docker run -it --rm -v $PWD:/build datasqrl/cmd:v0.5.2 compile -c package-mutation.json`
2. Navigate to the build/deploy directory: `(cd build/deploy; docker compose up --build)`

Open GraphiQL to add and query data.

First, add some clickstream data for the user with ID f5e9c688-408d-b54f-94aa-493df43dac8c by
running the following mutations one after the other. Each mutation simulates a user clicking on a
specific URL. We expect the system to record these clicks and use them to generate personalized
recommendations.

Open GraphiQL to execute the following graphql commands:
`http://localhost:8888/graphiql/`

This mutation records a click event for the user on the page "Generosity: An Enhancement":
Run this mutation a few times with different urls for the same userid (to create Covisits) and different userids.

To retrieve similar content to what a user has viewed before, run this query:
```graphql
mutation {
Clickstream(click: {userid: "f5e9c688-408d-b54f-94aa-493df43dac8c",
url: "https://en.wikipedia.org/wiki/Generosity%3A%20An%20Enhancement"}) {
event_time
}
{
SimilarContent(userid: "1") {
similarity
title
}
}
```

This mutation records another click event for the user on the page "Lock's Quest":

To retrieve recommendations by URL:
```graphql
mutation {
Clickstream(click: {userid: "f5e9c688-408d-b54f-94aa-493df43dac8c",
url: "https://en.wikipedia.org/wiki/Lock%27s%20Quest"}) {
event_time
}
{
Recommendation(url: "https://en.wikipedia.org/wiki/Zoop") {
recommendation
frequency
}
}
```

his mutation records a final click event for the user on the page "SystemC":

```graphql
mutation {
Clickstream(click: {userid: "f5e9c688-408d-b54f-94aa-493df43dac8c",
url: "https://en.wikipedia.org/wiki/SystemC"}) {
event_time
}
}
### Ingest from Stream

This method reads data from the stream directly and requires that you add the data to the stream specifically.

The `content-kafka` package specifies reading
data through a Kafka connector.

To run the pipeline:
```bash
docker run -it -p 8081:8081 -p 8888:8888 -p 9092:9092 --rm -v $PWD:/build -e OPENAI_API_KEY=[YOUR_API_KEY_HERE] datasqrl/cmd:latest run -c package-kafka.json
```

Now, query for recommendations. Either by page:
1. Once everything is started, open another terminal window to add data to Kafka using the
load_data.py script in the `yourdata-files` directory. This requires **kafka-python-ng** installed
via `pip3 install kafka-python-ng`.
1. Load the content data: `python3 load_data.py content.json.gz localhost:9092 content --msg 50`.
Wait until it finishes, which takes about two minutes. Check the Flink Dashboard running
at http://localhost:8081/ to see the progress. Wait until the task turns blue again.
1. Load the clickstream
data: `python3 load_data.py clickstream.json.gz localhost:9092 clickstream --msg 100`. This loads
100 clicks per second. Wait a few seconds for some data to load. Let this run in the background
until it finishes (about 4 minutes).

Open GraphiQL and query the data:
`http://localhost:8888/graphiql/`

This query provides recommendations based on the specified page URL:
Query for recommendations either by page:

```graphql
query {
Expand All @@ -146,8 +107,7 @@ query {
}
```

This query provides recommendations based on the user's clickstream data, showing similar content to
what the user has interacted with:
or for a user:

```graphql
query {
Expand All @@ -158,4 +118,8 @@ query {
}
```

Once you are done, hit CTRL-C and take down the pipeline containers with docker compose down -v.
You can find all the page URLs in the file `datawiki/wikipedia_urls.txt` and user ids in the
file `yourdata-files/clickstream.json.gz` (read it with `gzcat`) if you want to experiment with
different queries.

Once you are done, hit `CTRL-C` and take down the pipeline.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
CREATE TABLE AddClick (
url STRING NOT NULL,
userid STRING NOT NULL,
event_time TIMESTAMP_LTZ(3) NOT NULL METADATA FROM 'timestamp'
);

This file was deleted.

Loading