Skip to content

Latest commit

 

History

History
101 lines (71 loc) · 4.45 KB

File metadata and controls

101 lines (71 loc) · 4.45 KB

Redpanda Examples in Scala

Slack

Blog

https://redpanda.com/blog/buildling-streaming-application-docker-spark-redpanda/

Build

Prerequisites:

$ git clone https://github.com/vectorizedio/redpanda-examples.git
$ cd redpanda-examples/spark/scala
$ sbt clean assembly
$ ls target/scala-2.12/redpanda-examples-assembly-1.0.0.jar

Apache Spark Streaming Example

Demonstrates how to read from and write to Redpanda with Spark Streaming:

Create Local Environment

Use Docker Compose to create a local environment with three containers: Spark Master, Spark Worker, and Redpanda Broker.

$ cd redpanda-examples/spark
$ docker-compose -f redpanda-spark.yml up -d
[+] Running 4/4
 ⠿ Network docker-compose_redpanda_network  Created 
 ⠿ Container spark-master                   Started
 ⠿ Container redpanda                       Started
 ⠿ Container spark-worker                   Started
...
$ docker-compose -f redpanda-spark.yml down -v

Write Messages to Redpanda

The ProducerExample writes stock market activity data to Redpanda (SPX Historical Data downloaded from nasdaq.com) as JSON formatted messages:

{"Volume":"--","High":"4713.57","Close/Last":"4712.02","Open":"4687.64","Date":"12/10/2021","Low":"4670.24"}

Run the Producer with sbt or scala, passing the producer configuration (must include bootstrap.servers) and topic name as arguments:

$ cd redpanda-examples/spark/scala
$ sbt "run redpanda.config spx_history"
Multiple main classes detected. Select one to run:
 [1] com.redpanda.examples.clients.ConsumerExample
 [2] com.redpanda.examples.clients.ProducerExample *
 [3] com.redpanda.examples.spark.RedpandaSparkStream

Or...

$ scala -classpath target/scala-2.12/redpanda-examples-assembly-1.0.0.jar com.redpanda.examples.clients.ProducerExample redpanda.config spx_history

Run Spark Streaming Application

The Streaming application RedpandaSparkStream reads the JSON formatted messages from Redpanda in 2 second microbatches into a Spark SQL DataFrame. A simple function is run on the DataFrame to derive a new column, before the data is written back to another Redpanda topic. The stream is also printed to the console.

To avoid having to install Spark on your local machine, the easiest way to run the application is to spin up another Docker container with the necessary libraries already installed. To make it even easier, bind mount the local directoy on the container for easy access to the target .jar file:

$ docker run --rm -it --user=root -e SPARK_MASTER="spark://spark-master:7077" -v `pwd`:/project --network spark_redpanda_network -p 4040:4040 docker.io/bitnami/spark:3 /bin/bash

# spark-submit --master $SPARK_MASTER --class com.redpanda.examples.spark.RedpandaSparkStream /project/target/scala-2.12/redpanda-examples-assembly-1.0.0.jar 172.24.1.4:9092 spx_history spx_history_diff

Note that the Redpanda internal address:port must be specified from within the container 172.24.1.4:9092 and the external address:port from the local machine localhost:19092.

Read Message from Redpanda

Consume the modified messages from Redpanda using ConsumerExample:

$ sbt "run redpanda.config spx_history_diff"
Multiple main classes detected. Select one to run:
 [1] com.redpanda.examples.clients.ConsumerExample *
 [2] com.redpanda.examples.clients.ProducerExample
 [3] com.redpanda.examples.spark.RedpandaSparkStream

Or...

$ scala -classpath target/scala-2.12/redpanda-examples-assembly-1.0.0.jar com.redpanda.examples.clients.ConsumerExample redpanda.config spx_history_diff

Or simply use rpk:

$ rpk topic consume spx_history_diff --brokers localhost:19092 --offset "start"