diff --git a/finance-credit-card-chatbot/credit-card-analytics/creditcard-kafka/rewardsink.table.sql b/finance-credit-card-chatbot/credit-card-analytics/creditcard-kafka/rewardsink.table.sql deleted file mode 100644 index 5bf9dba7..00000000 --- a/finance-credit-card-chatbot/credit-card-analytics/creditcard-kafka/rewardsink.table.sql +++ /dev/null @@ -1,13 +0,0 @@ -CREATE TABLE CustomerReward ( - PRIMARY KEY (`customerId`) NOT ENFORCED -) WITH ( - 'connector' = 'kafka', - 'properties.bootstrap.servers' = '${PROPERTIES_BOOTSTRAP_SERVERS}', - 'properties.group.id' = 'mygroupid', - 'scan.startup.mode' = 'group-offsets', - 'properties.auto.offset.reset' = 'earliest', - 'key.format' = 'raw', - 'key.fields' = 'customerId', - 'value.format' = 'flexible-json', - 'topic' = 'customerreward' - ); \ No newline at end of file diff --git a/getting-started-examples/00_getting_started/README.md b/getting-started-examples/00_getting_started/README.md deleted file mode 100644 index 8bdbf3e6..00000000 --- a/getting-started-examples/00_getting_started/README.md +++ /dev/null @@ -1,83 +0,0 @@ -# DataSQRL Personal Examples - -This repository contains a curated set of practical, beginner-friendly examples for learning and experimenting with [DataSQRL](https://github.com/DataSQRL/sqrl). Each example demonstrates a real-world data pipeline pattern using technologies like Kafka, Iceberg, Glue, and Schema Registry. - -## 📁 Example Index - -| Folder | Description | -| ------------------------------------------------ |--------------------------------------------------------------------------------------------------------| -| **01\_kafka\_to\_console** | Reads from Kafka and outputs to console. Simple setup for basic Kafka ingestion testing. | -| **02\_kafka\_to\_kafka** | Reads from one Kafka topic and writes to another. Useful for learning basic Kafka transformations. | -| **03\_two\_streams\_kafka\_to\_kafka** | Combines two Kafka topics, performs stream joins/enrichment, and writes to Kafka. | -| **04\_two\_streams\_external\_kafka\_to\_kafka** | Simulates a more decoupled version of multi-stream joins. Good for external integration scenarios. | -| **05\_file\_iceberg\_test** | Writes data to local file-based Iceberg tables. Great for learning Iceberg without cloud dependencies. | -| **06\_external\_kafka\_iceberg\_test** | Kafka to Iceberg using a local warehouse directory. Minimal config, good for staging/testing. | -| **07\_external\_kafka\_iceberg\_glue\_test** | Kafka to Iceberg using AWS Glue as catalog and S3 as storage. | -| **08\_schema\_registry\_kafka\_to\_kafka** | Kafka-to-Kafka pipeline using Confluent Schema Registry for Avro schema management. | - -Each folder includes: - -* `package.json`: Configuration for engines, connectors, and environment -* `.sqrl` script(s): The logic for the pipeline -* `data_generator/`: Input data generation scripts and sample files - ---- - -## 🚀 Getting Started - -### Prerequisites - -* [Docker](https://docs.docker.com/get-docker/) installed -* Optional: Kafka and Schema Registry (locally or in cloud) -* For Glue/S3 integration: AWS CLI credentials (`~/.aws`) mounted in Docker - -### Run Any Example - -```bash -docker run -it --rm \ - -p 8888:8888 \ - -p 8081:8081 \ - -v $PWD:/build \ - datasqrl/cmd:dev run -c package.json -``` - -If using AWS or external services, extend with environment mounts: - -```bash -docker run -it --rm \ - -p 8888:8888 \ - -p 8081:8081 \ - -v $PWD:/build \ - -v ~/.aws:/root/.aws \ - -e AWS_REGION=us-east-1 \ - -e S3_WAREHOUSE_PATH=s3://your-bucket/path/ \ - datasqrl/cmd:dev run -c package.json -``` - -### Compile Without Running - -```bash -docker run -it --rm \ - -v $PWD:/build \ - datasqrl/cmd:dev compile -c package.json -``` - ---- - -## 🤔 Why These Examples? - -These examples are designed to: - -* Be self-contained and runnable out-of-the-box -* Good to get started with datasqrl -* Serve as a foundation for building your own DataSQRL pipelines - ---- - -## 📚 Learn More - -* 📘 [DataSQRL Docs](https://datasqrl.github.io/sqrl) -* 💻 [GitHub Repository](https://github.com/DataSQRL/sqrl) -* 💬 [Community Discord](https://docs.datasqrl.com/community/) - -Feel free to fork and build on top of these examples! diff --git a/getting-started-examples/01_kafka_to_console/README.md b/getting-started-examples/01_kafka_to_console/README.md index 67ddf4a2..aebdc757 100644 --- a/getting-started-examples/01_kafka_to_console/README.md +++ b/getting-started-examples/01_kafka_to_console/README.md @@ -3,21 +3,19 @@ This project demonstrates how to use [DataSQRL](https://datasqrl.com) to build a streaming pipeline that: - Reads data from a kafka topic and prints output to console -- Kafka is part of datasqrl package. - - +- Kafka is part of the DataSQRL package. ## 🐳 Running DataSQRL Run the following command from the project root where your `package.json` and SQRL scripts reside: ```bash -docker run -it --rm -p 8888:8888 -p 9092:9092 -v $PWD:/build datasqrl/cmd:dev run -c package.json +docker run -it --rm -p 8888:8888 -p 9092:9092 -v $PWD:/build datasqrl/cmd:0.7.0 run -c package.json ``` ## Generate Data * Go to `data-generator` folder ```bash - python3 send_kafka_avro_records.py ../kafka-source/contact.avsc data.jsonl contact localhost:9092 -``` \ No newline at end of file +python3 send_kafka_avro_records.py ../kafka-source/contact.avsc data.jsonl contact localhost:9092 +``` diff --git a/getting-started-examples/01_kafka_to_console/kafka-console.sqrl b/getting-started-examples/01_kafka_to_console/kafka-console.sqrl index 2659493c..2d1b05f6 100644 --- a/getting-started-examples/01_kafka_to_console/kafka-console.sqrl +++ b/getting-started-examples/01_kafka_to_console/kafka-console.sqrl @@ -1,4 +1,4 @@ IMPORT kafka-source.Contact; -- Export the results to console (print) -EXPORT Contact TO print.Contact; \ No newline at end of file +EXPORT Contact TO print.Contact; diff --git a/getting-started-examples/01_kafka_to_console/package.json b/getting-started-examples/01_kafka_to_console/package.json index aba2b763..7cd4067c 100644 --- a/getting-started-examples/01_kafka_to_console/package.json +++ b/getting-started-examples/01_kafka_to_console/package.json @@ -13,10 +13,5 @@ }, "test-runner": { "create-topics": ["contact", "contact"] - }, - "dependencies": { - "metrics": { - "folder": "kafka-source" - } } } diff --git a/getting-started-examples/02_kafka_to_kafka/README.md b/getting-started-examples/02_kafka_to_kafka/README.md index 3f2c16bd..02b68dac 100644 --- a/getting-started-examples/02_kafka_to_kafka/README.md +++ b/getting-started-examples/02_kafka_to_kafka/README.md @@ -3,16 +3,14 @@ This project demonstrates how to use [DataSQRL](https://datasqrl.com) to build a streaming pipeline that: - Reads data from a kafka topic and writes to another kafka topic -- Kafka is part of datasqrl package. - - +- Kafka is part of the DataSQRL package. ## 🐳 Running DataSQRL Run the following command from the project root where your `package.json` and SQRL scripts reside: ```bash -docker run -it --rm -p 8888:8888 -p 9092:9092 -v $PWD:/build datasqrl/cmd:dev run -c package.json +docker run -it --rm -p 8888:8888 -p 9092:9092 -v $PWD:/build datasqrl/cmd:0.7.0 run -c package.json ``` ## Generate Data @@ -25,4 +23,4 @@ docker run -it --rm -p 8888:8888 -p 9092:9092 -v $PWD:/build datasqrl/cmd:d ## Output -* Updated records should be generated in contactupdated topic. \ No newline at end of file +* Updated records should be generated in `contactupdated` topic. diff --git a/getting-started-examples/02_kafka_to_kafka/kafka-test.sqrl b/getting-started-examples/02_kafka_to_kafka/kafka-test.sqrl index 99cf7588..11c30413 100644 --- a/getting-started-examples/02_kafka_to_kafka/kafka-test.sqrl +++ b/getting-started-examples/02_kafka_to_kafka/kafka-test.sqrl @@ -2,4 +2,4 @@ IMPORT kafka-source.Contact as Contacts; ContactsUpdated := SELECT firstname, lastname, last_updated FROM Contacts; -EXPORT ContactsUpdated TO kafkasink.ContactUpdated; \ No newline at end of file +EXPORT ContactsUpdated TO kafka-sink.ContactUpdated; diff --git a/getting-started-examples/02_kafka_to_kafka/package.json b/getting-started-examples/02_kafka_to_kafka/package.json index cdbd1243..db4378ff 100644 --- a/getting-started-examples/02_kafka_to_kafka/package.json +++ b/getting-started-examples/02_kafka_to_kafka/package.json @@ -13,13 +13,5 @@ }, "test-runner": { "create-topics": ["contact", "contactupdated"] - }, - "dependencies": { - "kafkasource": { - "folder": "kafka-source" - }, - "kafkasink": { - "folder": "kafka-sink" - } } } diff --git a/getting-started-examples/03_two_streams_kafka_to_kafka/README.md b/getting-started-examples/03_two_streams_kafka_to_kafka/README.md index 73cec4b4..fcb8ebaa 100644 --- a/getting-started-examples/03_two_streams_kafka_to_kafka/README.md +++ b/getting-started-examples/03_two_streams_kafka_to_kafka/README.md @@ -13,7 +13,7 @@ This project demonstrates how to use [DataSQRL](https://datasqrl.com) to build a Run the following command from the project root where your `package.json` and SQRL scripts reside: ```bash -docker run -it --rm -p 8888:8888 -p 9092:9092 -v $PWD:/build datasqrl/cmd:dev run -c package.json +docker run -it --rm -p 8888:8888 -p 9092:9092 -v $PWD:/build datasqrl/cmd:0.7.0 run -c package.json ``` ## Generate Data @@ -32,4 +32,4 @@ docker run -it --rm -p 8888:8888 -p 9092:9092 -v $PWD:/build datasqrl/cmd:d ## Output -* Updated records should be generated in enrichedcontact topic. \ No newline at end of file +* Updated records should be generated in enrichedcontact topic. diff --git a/getting-started-examples/03_two_streams_kafka_to_kafka/kafka-test.sqrl b/getting-started-examples/03_two_streams_kafka_to_kafka/kafka-test.sqrl index b8c0cfda..a705e1c6 100644 --- a/getting-started-examples/03_two_streams_kafka_to_kafka/kafka-test.sqrl +++ b/getting-started-examples/03_two_streams_kafka_to_kafka/kafka-test.sqrl @@ -7,4 +7,4 @@ EnrichedContacts := SELECT c.id, c.firstname, c.lastname, o.orgname, c.last_upda ON c.id = o.userid AND c.last_updated BETWEEN o.last_updated - INTERVAL '30' SECOND AND o.last_updated + INTERVAL '30' SECOND; -EXPORT EnrichedContacts TO kafkasink.EnrichedContact; +EXPORT EnrichedContacts TO kafka-sink.EnrichedContact; diff --git a/getting-started-examples/03_two_streams_kafka_to_kafka/package.json b/getting-started-examples/03_two_streams_kafka_to_kafka/package.json index fab85ff7..f57a1360 100644 --- a/getting-started-examples/03_two_streams_kafka_to_kafka/package.json +++ b/getting-started-examples/03_two_streams_kafka_to_kafka/package.json @@ -13,13 +13,5 @@ }, "test-runner": { "create-topics": ["contact", "organization", "enrichedcontact"] - }, - "dependencies": { - "kafkasource": { - "folder": "kafka-source" - }, - "kafkasink": { - "folder": "kafka-sink" - } } } diff --git a/getting-started-examples/04_two_streams_external_kafka_to_kafka/README.md b/getting-started-examples/04_two_streams_external_kafka_to_kafka/README.md index 5fc74d88..260a2b4a 100644 --- a/getting-started-examples/04_two_streams_external_kafka_to_kafka/README.md +++ b/getting-started-examples/04_two_streams_external_kafka_to_kafka/README.md @@ -5,23 +5,23 @@ This project demonstrates how to use [DataSQRL](https://datasqrl.com) to build a - This example uses kafka that is running outside of docker on host machine - Reads data from two kafka topics and combines the data from two streams using temporal join - Writes output to another kafka topic running on host machine -- We are not using kafka running inside of datasqrl +- We are not using kafka running inside DataSQRL ## Few things to note -1. `'properties.bootstrap.servers' = 'host.docker.internal:9092'`, -> this tells docker to connect to your host machine's kafka -2. You don't need to create output topic `enrichedcontact` -3. `create-topics` array from package.json was removed since we are using external kafka where we expect source topics to be present -4. We removed kafka engine from `enabled-engines` array in package.json - +* `'properties.bootstrap.servers' = 'host.docker.internal:9092'` -> this tells docker to connect to your host machine's kafka +* You don't need to create output topic `enrichedcontact` +* `create-topics` array from `package.json` was removed since we are using external Kafka, where we expect source topics to be present +* We removed kafka engine from `enabled-engines` array in `package.json` ## 🐳 Running DataSQRL Run the following command from the project root where your `package.json` and SQRL scripts reside: -Note: We removed `-p 9092:9092` as we are using our own kafka running locally on host machine now ```bash -docker run -it --rm -p 8888:8888 -p 8081:8081 -v $PWD:/build datasqrl/cmd:dev run -c package.json +docker run -it --rm -p 8888:8888 -p 8081:8081 -v $PWD:/build -v $PWD/data:/data datasqrl/cmd:0.7.0 run -c package.json ``` +> [!NOTE] +> We removed `-p 9092:9092` as we are using our own kafka running locally on host machine now ## Generate Data @@ -36,9 +36,6 @@ docker run -it --rm -p 8888:8888 -p 8081:8081 -v $PWD:/build datasqrl/cmd:dev ru python3 load_data.py organization.jsonl localhost:9092 organization ``` - ## Output -* Updated records should be generated in enrichedcontact topic. - - +* Updated records should be generated in `enrichedcontact` table. diff --git a/getting-started-examples/04_two_streams_external_kafka_to_kafka/kafka-docker/docker-compose.yml b/getting-started-examples/04_two_streams_external_kafka_to_kafka/docker-compose.yml similarity index 100% rename from getting-started-examples/04_two_streams_external_kafka_to_kafka/kafka-docker/docker-compose.yml rename to getting-started-examples/04_two_streams_external_kafka_to_kafka/docker-compose.yml diff --git a/getting-started-examples/04_two_streams_external_kafka_to_kafka/kafka-test.sqrl b/getting-started-examples/04_two_streams_external_kafka_to_kafka/kafka-test.sqrl index a7768884..94496fe1 100644 --- a/getting-started-examples/04_two_streams_external_kafka_to_kafka/kafka-test.sqrl +++ b/getting-started-examples/04_two_streams_external_kafka_to_kafka/kafka-test.sqrl @@ -1,16 +1,15 @@ -IMPORT kafkasource.Contact AS Contacts; -IMPORT kafkasource.Organization AS Organizations; +IMPORT kafka-source.Contact AS Contacts; +IMPORT kafka-source.Organization AS Organizations; /* Joins Contacts and Organizations Matches rows where the contact's id equals the organization's userid But only if their last_updated timestamps are within ±5 seconds of each other */ - EnrichedContacts := SELECT c.id, c.firstname, c.lastname, o.orgname, c.last_updated FROM Contacts c JOIN Organizations o ON c.id = o.userid AND c.last_updated BETWEEN o.last_updated - INTERVAL '30' SECOND AND o.last_updated + INTERVAL '30' SECOND; -EXPORT EnrichedContacts TO kafkasink.EnrichedContact; +EXPORT EnrichedContacts TO kafka-sink.EnrichedContact; diff --git a/getting-started-examples/04_two_streams_external_kafka_to_kafka/package.json b/getting-started-examples/04_two_streams_external_kafka_to_kafka/package.json index 58bcc373..38e97cb4 100644 --- a/getting-started-examples/04_two_streams_external_kafka_to_kafka/package.json +++ b/getting-started-examples/04_two_streams_external_kafka_to_kafka/package.json @@ -10,13 +10,5 @@ "table.exec.source.idle-timeout": "60 s" } } - }, - "dependencies": { - "kafkasource": { - "folder": "kafka-source" - }, - "kafkasink": { - "folder": "kafka-sink" - } } } diff --git a/getting-started-examples/05_file_iceberg_test/README.md b/getting-started-examples/05_file_iceberg_test/README.md index 547d287c..b9864636 100644 --- a/getting-started-examples/05_file_iceberg_test/README.md +++ b/getting-started-examples/05_file_iceberg_test/README.md @@ -9,20 +9,13 @@ This project demonstrates how to use [DataSQRL](https://datasqrl.com) to build a Run the following command from the project root where your `package.json` and SQRL scripts reside: ```bash -docker run -it --rm \ - -p 8888:8888 -p 8081:8081 \ - -v $PWD:/build \ - -e LOCAL_WAREHOUSE_DIR=warehouse \ - datasqrl/cmd:dev run -c package.json +docker run -it --rm -p 8888:8888 -p 8081:8081 -v $PWD:/build -v $PWD/data:/data datasqrl/cmd:0.7.0 run -c package.json ``` -Note: It will store iceberg files in `warehouse` directory - - +> [!NOTE] +> Iceberg files will be stored in the `warehouse` directory set by `package.json` ## Output -* There should be iceberg files and folders generated in warehouse directory -* Data for the output table will reside in ProcessedData (as defined in the sqrl script) - - +* There should be iceberg files and folders generated in `$PWD/data/iceberg` directory +* Data for the output table will reside in `ProcessedData` (as defined in the sqrl script) diff --git a/getting-started-examples/05_file_iceberg_test/file-to-iceberg.sqrl b/getting-started-examples/05_file_iceberg_test/file-to-iceberg.sqrl index a6ee9d15..7dec237a 100644 --- a/getting-started-examples/05_file_iceberg_test/file-to-iceberg.sqrl +++ b/getting-started-examples/05_file_iceberg_test/file-to-iceberg.sqrl @@ -1,4 +1,3 @@ -IMPORT file-data.MyFile AS InputFile; +IMPORT testdata.MyFile AS InputFile; ProcessedData := SELECT id, name, amount, event_time FROM InputFile; - diff --git a/getting-started-examples/05_file_iceberg_test/package.json b/getting-started-examples/05_file_iceberg_test/package.json index d8372dd8..b95113b0 100644 --- a/getting-started-examples/05_file_iceberg_test/package.json +++ b/getting-started-examples/05_file_iceberg_test/package.json @@ -18,10 +18,5 @@ "catalog-type": "hadoop", "catalog-name": "mycatalog" } - }, - "dependencies": { - "file-data": { - "folder": "file-data" - } } } diff --git a/getting-started-examples/05_file_iceberg_test/file-data/MyFile.table.sql b/getting-started-examples/05_file_iceberg_test/testdata/MyFile.table.sql similarity index 100% rename from getting-started-examples/05_file_iceberg_test/file-data/MyFile.table.sql rename to getting-started-examples/05_file_iceberg_test/testdata/MyFile.table.sql diff --git a/getting-started-examples/05_file_iceberg_test/file-data/myfile.jsonl b/getting-started-examples/05_file_iceberg_test/testdata/myfile.jsonl similarity index 100% rename from getting-started-examples/05_file_iceberg_test/file-data/myfile.jsonl rename to getting-started-examples/05_file_iceberg_test/testdata/myfile.jsonl diff --git a/getting-started-examples/06_external_kafka_iceberg_test/README.md b/getting-started-examples/06_external_kafka_iceberg_test/README.md index 1d1239c1..bb75d02c 100644 --- a/getting-started-examples/06_external_kafka_iceberg_test/README.md +++ b/getting-started-examples/06_external_kafka_iceberg_test/README.md @@ -2,21 +2,17 @@ This project demonstrates how to use [DataSQRL](https://datasqrl.com) to build a streaming pipeline that: -- This example uses kafka that is running outside of datasqrl docker on host machine +- This example uses Kafka that is running outside DataSQRL docker on host machine - Reads data from kafka topic and writes to an iceberg table locally - ## 🐳 Running DataSQRL Run the following command from the project root where your `package.json` and SQRL scripts reside: -Note: We removed `-p 9092:9092` as we are using our own kafka running locally on host machine now ```bash -docker run -it --rm \ - -p 8888:8888 -p 8081:8081 \ - -v $PWD:/build \ - -e LOCAL_WAREHOUSE_DIR=warehouse \ - datasqrl/cmd:dev run -c package.json +docker run -it --rm -p 8888:8888 -p 8081:8081 -v $PWD:/build -v $PWD/data:/data datasqrl/cmd:0.7.0 run -c package.json ``` +> [!NOTE] +> We removed `-p 9092:9092` as we are using our own kafka running locally on host machine now ## Generate Data @@ -27,10 +23,6 @@ docker run -it --rm \ python3 load_data.py contacts.jsonl contact ``` - - ## Output -* Updated records should be generated in enrichedcontact topic. - - +* Updated records should be generated in `enrichedcontact` topic. diff --git a/getting-started-examples/06_external_kafka_iceberg_test/iceberg-sink/EnrichedContact.table.sql b/getting-started-examples/06_external_kafka_iceberg_test/iceberg-sink/EnrichedContact.table.sql index 3e7afd98..e41ee852 100644 --- a/getting-started-examples/06_external_kafka_iceberg_test/iceberg-sink/EnrichedContact.table.sql +++ b/getting-started-examples/06_external_kafka_iceberg_test/iceberg-sink/EnrichedContact.table.sql @@ -4,6 +4,6 @@ CREATE TABLE EnrichedContact ( 'connector' = 'iceberg', 'catalog-name' = 'mycatalog', 'catalog-type' = 'hadoop', - 'warehouse' = '${LOCAL_WAREHOUSE_DIR}', + 'warehouse' = '/data/iceberg', 'format-version' = '2' ); diff --git a/getting-started-examples/06_external_kafka_iceberg_test/kafka-to-iceberg.sqrl b/getting-started-examples/06_external_kafka_iceberg_test/kafka-to-iceberg.sqrl index ada55d7c..c479d146 100644 --- a/getting-started-examples/06_external_kafka_iceberg_test/kafka-to-iceberg.sqrl +++ b/getting-started-examples/06_external_kafka_iceberg_test/kafka-to-iceberg.sqrl @@ -6,4 +6,3 @@ EnrichedContacts := SELECT lastname, last_updated FROM Contacts; - diff --git a/getting-started-examples/06_external_kafka_iceberg_test/load_data.py b/getting-started-examples/06_external_kafka_iceberg_test/load_data.py deleted file mode 100644 index 1eef958e..00000000 --- a/getting-started-examples/06_external_kafka_iceberg_test/load_data.py +++ /dev/null @@ -1,20 +0,0 @@ -import json -import sys -from kafka import KafkaProducer - -def load_data(file_path, topic): - producer = KafkaProducer( - bootstrap_servers='localhost:9092', - value_serializer=lambda v: json.dumps(v).encode('utf-8') - ) - with open(file_path, 'r') as f: - for line in f: - record = json.loads(line) - producer.send(topic, record) - producer.flush() - -if __name__ == "__main__": - if len(sys.argv) != 3: - print("Usage: python load_data.py ") - sys.exit(1) - load_data(sys.argv[1], sys.argv[2]) diff --git a/getting-started-examples/06_external_kafka_iceberg_test/package.json b/getting-started-examples/06_external_kafka_iceberg_test/package.json index c47114e9..40dd2d1c 100644 --- a/getting-started-examples/06_external_kafka_iceberg_test/package.json +++ b/getting-started-examples/06_external_kafka_iceberg_test/package.json @@ -18,13 +18,5 @@ "catalog-type": "hadoop", "catalog-name": "mycatalog" } - }, - "dependencies": { - "kafka-source": { - "folder": "kafka-source" - }, - "iceberg-sink": { - "folder": "iceberg-sink" - } } } diff --git a/getting-started-examples/07_external_kafka_iceberg_glue_test/README.md b/getting-started-examples/07_external_kafka_iceberg_glue_test/README.md index 10b62ede..33057453 100644 --- a/getting-started-examples/07_external_kafka_iceberg_glue_test/README.md +++ b/getting-started-examples/07_external_kafka_iceberg_glue_test/README.md @@ -5,18 +5,13 @@ This project demonstrates how to use [DataSQRL](https://datasqrl.com) to build a - This example demonstrates how to use kafka and glue - We read from kafka topic and write data to s3 in iceberg glue format -## Note: -- To run this program you would need access to aws s3 -- you should have a credentials file in your ~/.aws directory - +> [!IMPORTANT] +> To run this program you would need access to AWS S3 +> You should have a credentials file in your ~/.aws directory ## 🐳 Running DataSQRL Run the following command from the project root where your `package.json` and SQRL scripts reside: - -Note: -- We removed `-p 9092:9092` as we are using our own kafka running locally on host machine now -- you also need to set `s3://[BUCKET]/path/to/folder/` ```bash docker run -it --rm \ -p 8888:8888 \ @@ -24,24 +19,22 @@ docker run -it --rm \ -v $PWD:/build \ -v ~/.aws:/root/.aws \ -e AWS_REGION=us-east-1 \ - -e S3_WAREHOUSE_PATH=s3://[BUCKET]/path/to/folder/ \ - datasqrl/cmd:dev run -c package.json - + -e S3_WAREHOUSE_PATH=s3:///path/to/warehouse/ \ + datasqrl/cmd:0.7.0 run -c package.json ``` +> [!NOTE] +> We removed `-p 9092:9092` as we are using our own Kafka running locally on host machine now +> You also need to set `s3:///path/to/warehouse/` ## Generate Data * Go to `data-generator` folder - * `python3 load_data.py ` + * `python3 load_data.py ` * To send Contact data ```bash python3 load_data.py contacts.jsonl contact ``` - - ## Output -* Updated records should be generated in enrichedcontact topic. - - +* Updated records should be generated in `enrichedcontact` table. diff --git a/getting-started-examples/07_external_kafka_iceberg_glue_test/kafka-to-iceberg.sqrl b/getting-started-examples/07_external_kafka_iceberg_glue_test/kafka-to-iceberg.sqrl index a1ee781e..6a833d9a 100644 --- a/getting-started-examples/07_external_kafka_iceberg_glue_test/kafka-to-iceberg.sqrl +++ b/getting-started-examples/07_external_kafka_iceberg_glue_test/kafka-to-iceberg.sqrl @@ -6,4 +6,3 @@ enrichedcontacts := SELECT lastname, last_updated FROM contacts; - diff --git a/getting-started-examples/07_external_kafka_iceberg_glue_test/package.json b/getting-started-examples/07_external_kafka_iceberg_glue_test/package.json index d80e9917..fef37ff0 100644 --- a/getting-started-examples/07_external_kafka_iceberg_glue_test/package.json +++ b/getting-started-examples/07_external_kafka_iceberg_glue_test/package.json @@ -22,10 +22,5 @@ "catalog-database": "nsdatabase", "write.upsert.enabled": "true" } - }, - "dependencies": { - "kafka-source": { - "folder": "kafka-source" - } } } diff --git a/getting-started-examples/08_schema_registry_kafka_to_kafka/README.md b/getting-started-examples/08_schema_registry_kafka_to_kafka/README.md index 7c833989..cfdbdcd0 100644 --- a/getting-started-examples/08_schema_registry_kafka_to_kafka/README.md +++ b/getting-started-examples/08_schema_registry_kafka_to_kafka/README.md @@ -30,25 +30,21 @@ Run the following command from the project root where your `package.json` and SQ docker run -it --rm \ -p 8888:8888 \ -v $PWD:/build \ - datasqrl/cmd:dev run -c package.json + datasqrl/cmd:0.7.0 run -c package.json ``` ## Generate Data * Go to `data-generator` folder - -Usage: - - python send_kafka_avro_records.py - -Example: - - python send_kafka_avro_records.py contact.avsc data.jsonl contact localhost:9092 http://localhost:8081 + * `python3 send_kafka_avro_records.py ` +* Send Contact data +```bash +python3 send_kafka_avro_records.py contact.avsc data.jsonl contact localhost:9092 http://localhost:8081 +``` What it does: - -- Reads newline-delimited JSON (.jsonl) file. +- Reads newline-delimited JSON (`.jsonl`) file. - Publishes each record as a Kafka message. - For example, to load static test files to Kafka topics. ## Output -* You should see output topic created called enrichedcontact_avro with data in it \ No newline at end of file +* You should see output topic created called `enrichedcontact_avro` with data in it diff --git a/getting-started-examples/08_schema_registry_kafka_to_kafka/kafka-to-kafka.sqrl b/getting-started-examples/08_schema_registry_kafka_to_kafka/kafka-to-kafka.sqrl index 51bb4bde..f5d1704f 100644 --- a/getting-started-examples/08_schema_registry_kafka_to_kafka/kafka-to-kafka.sqrl +++ b/getting-started-examples/08_schema_registry_kafka_to_kafka/kafka-to-kafka.sqrl @@ -1,4 +1,4 @@ -IMPORT kafkasource.Contact AS Contact; +IMPORT kafka-source.Contact AS Contact; EnrichedContactAvro := SELECT firstname, @@ -6,5 +6,4 @@ EnrichedContactAvro := SELECT CAST(last_updated AS STRING) AS last_updated_str FROM Contact; -EXPORT EnrichedContactAvro TO kafkasink.EnrichedContactAvro; - +EXPORT EnrichedContactAvro TO kafka-sink.EnrichedContactAvro; diff --git a/getting-started-examples/08_schema_registry_kafka_to_kafka/load_data.py b/getting-started-examples/08_schema_registry_kafka_to_kafka/load_data.py deleted file mode 100644 index 1eef958e..00000000 --- a/getting-started-examples/08_schema_registry_kafka_to_kafka/load_data.py +++ /dev/null @@ -1,20 +0,0 @@ -import json -import sys -from kafka import KafkaProducer - -def load_data(file_path, topic): - producer = KafkaProducer( - bootstrap_servers='localhost:9092', - value_serializer=lambda v: json.dumps(v).encode('utf-8') - ) - with open(file_path, 'r') as f: - for line in f: - record = json.loads(line) - producer.send(topic, record) - producer.flush() - -if __name__ == "__main__": - if len(sys.argv) != 3: - print("Usage: python load_data.py ") - sys.exit(1) - load_data(sys.argv[1], sys.argv[2]) diff --git a/getting-started-examples/08_schema_registry_kafka_to_kafka/package.json b/getting-started-examples/08_schema_registry_kafka_to_kafka/package.json index dc59ba8a..12e6a1dd 100644 --- a/getting-started-examples/08_schema_registry_kafka_to_kafka/package.json +++ b/getting-started-examples/08_schema_registry_kafka_to_kafka/package.json @@ -11,13 +11,5 @@ "table.exec.source.idle-timeout": "60 s" } } - }, - "dependencies": { - "kafkasource": { - "folder": "kafka-source" - }, - "kafkasink": { - "folder": "kafka-sink" - } } } diff --git a/getting-started-examples/README.md b/getting-started-examples/README.md new file mode 100644 index 00000000..327b0560 --- /dev/null +++ b/getting-started-examples/README.md @@ -0,0 +1,98 @@ +# DataSQRL Personal Examples + +This repository contains a curated set of practical, beginner-friendly examples for learning and experimenting with [DataSQRL](https://github.com/DataSQRL/sqrl). +Each example demonstrates a real-world data pipeline pattern using technologies like Kafka, Iceberg, Glue, and Schema Registry. + +## 📁 Example Index + +| Folder | Description | +|----------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------| +| [**01\_kafka\_to\_console**](./01_kafka_to_console) | Reads from Kafka and outputs to console. Simple setup for basic Kafka ingestion testing. | +| [**02\_kafka\_to\_kafka**](./02_kafka_to_kafka) | Reads from one Kafka topic and writes to another. Useful for learning basic Kafka transformations. | +| [**03\_two\_streams\_kafka\_to\_kafka**](./03_two_streams_kafka_to_kafka) | Combines two Kafka topics, performs stream joins/enrichment, and writes to Kafka. | +| [**04\_two\_streams\_external\_kafka\_to\_kafka**](./04_two_streams_external_kafka_to_kafka) | Simulates a more decoupled version of multi-stream joins. Good for external integration scenarios. | +| [**05\_file\_iceberg\_test**](./05_file_iceberg_test) | Writes data to local file-based Iceberg tables. Great for learning Iceberg without cloud dependencies. | +| [**06\_external\_kafka\_iceberg\_test**](./06_external_kafka_iceberg_test) | Kafka to Iceberg using a local warehouse directory. Minimal config, good for staging/testing. | +| [**07\_external\_kafka\_iceberg\_glue\_test**](./07_external_kafka_iceberg_glue_test) | Kafka to Iceberg using AWS Glue as catalog and S3 as storage. | +| [**08\_schema\_registry\_kafka\_to\_kafka**](./08_schema_registry_kafka_to_kafka) | Kafka-to-Kafka pipeline using Confluent Schema Registry for Avro schema management. | + +Each folder includes: + +* `package.json`: Configuration for engines, connectors, and environment +* `.sqrl` script(s): The logic for the pipeline +* `data_generator/`: Input data generation scripts and sample files + +--- + +## 🚀 Getting Started + +### Prerequisites + +* [Docker](https://docs.docker.com/get-docker/) installed +* Optional: Kafka and Schema Registry (locally or in cloud) +* For Glue/S3 integration: AWS CLI credentials (`~/.aws`) mounted in Docker + +### Run Any Example + +```bash +docker run -it --rm \ + -p 8888:8888 \ + -p 8081:8081 \ + -v $PWD:/build \ + datasqrl/cmd:0.7.0 run -c package.json +``` +#### Persistent Data + +To preserve the internally persisted data after the Docker container stopped running, extend with `/data` mount: + +```bash +docker run -it --rm \ + -p 8888:8888 \ + -p 8081:8081 \ + -v $PWD:/build \ + -v $PWD/data:/data \ + datasqrl/cmd:0.7.0 run -c package.json +``` + +#### Mount External Services + +If using AWS or external services, extend with environment mounts: + +```bash +docker run -it --rm \ + -p 8888:8888 \ + -p 8081:8081 \ + -v $PWD:/build \ + -v ~/.aws:/root/.aws \ + -e AWS_REGION=us-east-1 \ + -e S3_WAREHOUSE_PATH=s3://your-bucket/path/ \ + datasqrl/cmd:0.7.0 run -c package.json +``` + +### Compile Without Running + +```bash +docker run -it --rm \ + -v $PWD:/build \ + datasqrl/cmd:0.7.0 compile -c package.json +``` + +--- + +## 🤔 Why These Examples? + +These examples are designed to: + +* Be self-contained and runnable out-of-the-box +* Good to get started with DataSQRL +* Serve as a foundation for building your own DataSQRL pipelines + +--- + +## 📚 Learn More + +* 📘 [DataSQRL Docs](https://datasqrl.github.io/sqrl) +* 💻 [GitHub Repository](https://github.com/DataSQRL/sqrl) +* 💬 [Community Discord](https://docs.datasqrl.com/community/) + +Feel free to fork and build on top of these examples!