2020# Examples Catalog
2121
2222<!-- TOC -->
23+
2324* [ Examples Catalog] ( #examples-catalog )
24- * [ Wordcount] ( #wordcount )
25- * [ Transforms] ( #transforms )
26- * [ Aggregation] ( #aggregation )
27- * [ Blueprints] ( #blueprints )
28- * [ Element-wise] ( #element-wise )
29- * [ IO] ( #io )
30- * [ ML] ( #ml )
25+ * [ Wordcount] ( #wordcount )
26+ * [ Transforms] ( #transforms )
27+ * [ Aggregation] ( #aggregation )
28+ * [ Blueprints] ( #blueprints )
29+ * [ Element-wise] ( #element-wise )
30+ * [ IO] ( #io )
31+ * [ ML] ( #ml )
3132
3233<!-- TOC -->
3334
3435## Prerequistes
36+
3537Build this jar for running with the run command in the next stage:
38+
3639```
3740cd <path_to_beam_repo>/beam; ./gradlew sdks:java:io:google-cloud-platform:expansion-service:shadowJar
3841```
3942
4043## Example Run
44+
4145This module contains a series of Beam YAML code samples that can be run using
4246the command:
47+
4348```
44- python -m apache_beam.yaml.main --pipeline_spec_file =/path/to/example.yaml
49+ python -m apache_beam.yaml.main --yaml_pipeline_file =/path/to/example.yaml
4550```
4651
4752Depending on the yaml pipeline, the output may be emitted to standard output or
4853a file located in the execution folder used.
4954
5055## Wordcount
56+
5157A good starting place is the [ Wordcount] ( wordcount_minimal.yaml ) example under
5258the root example directory.
5359This example reads in a text file, splits the text on each word, groups by each
5460word, and counts the occurrence of each word. This is a classic example used in
5561the other SDK's and shows off many of the functionalities of Beam YAML.
5662
5763## Testing
64+
5865A test file is located in the testing folder that will execute all the example
5966yamls and confirm the expected results.
67+
6068```
6169pytest -v testing/
6270
@@ -71,25 +79,160 @@ Examples in this directory show off the various built-in transforms of the Beam
7179YAML framework.
7280
7381### Aggregation
82+
7483These examples leverage the built-in ` Combine ` transform for performing simple
7584aggregations including sum, mean, count, etc.
7685
7786### Blueprints
87+
7888These examples leverage DF or other existing templates and convert them to yaml
7989blueprints.
8090
8191### Element-wise
92+
8293These examples leverage the built-in mapping transforms including ` MapToFields ` ,
8394` Filter ` and ` Explode ` . More information can be found about mapping transforms
8495[ here] ( https://beam.apache.org/documentation/sdks/yaml-udf/ ) .
8596
8697### IO
87- These examples leverage the built-in ` Spanner_Read ` and ` Spanner_Write `
88- transform for performing simple reads and writes from a spanner DB.
98+
99+ #### Spanner
100+
101+ Examples [ Spanner Read] ( transforms/io/spanner_read.yaml ) and [ Spanner Write] (
102+ transforms/io/spanner_write.yaml) leverage the built-in ` Spanner_Read ` and
103+ ` Spanner_Write ` transforms for performing simple reads and writes from a
104+ Google Spanner database.
105+
106+ #### Kafka
107+
108+ Examples involving Kafka such as [ Kafka Read Write] ( transforms/io/kafka.yaml )
109+ require users to set up a Kafka cluster that Dataflow runner executing the
110+ Beam pipeline has access to.
111+ Please note that ` ReadFromKafka ` transform has
112+ a [ known issue] ( https://github.com/apache/beam/issues/22809 ) when
113+ using non-Dataflow portable runners where reading may get stuck in streaming
114+ pipelines. Hence using the Dataflow runner is recommended for examples that
115+ involve reading from Kafka in a streaming pipeline.
116+
117+ See [ here] ( https://kafka.apache.org/quickstart ) for general instructions on
118+ setting up a Kafka cluster. An option is to use [ Click to Deploy] (
119+ https://console.cloud.google.com/marketplace/details/click-to-deploy-images/kafka ?)
120+ to quickly launch a Kafka cluster on [ GCE] (
121+ https://cloud.google.com/products/compute?hl=en ). [ SASL/PLAIN] (
122+ https://kafka.apache.org/documentation/#security_sasl_plain ) authentication
123+ mechanism is configured for the brokers as part of the deployment. See
124+ also [ here] (
125+ https://github.com/GoogleCloudPlatform/java-docs-samples/tree/main/dataflow/flex-templates/kafka_to_bigquery )
126+ for an alternative step-by-step guide on setting up Kafka on GCE without the
127+ authentication mechanism.
128+
129+ Let's assume one of the bootstrap servers is on VM instance ` kafka-vm-0 `
130+ with the internal IP address ` 123.45.67.89 ` and port ` 9092 ` that the bootstrap
131+ server is listening on. SASL/PLAIN ` USERNAME ` and ` PASSWORD ` can be viewed from
132+ the VM instance's metadata on the GCE console, or with gcloud CLI:
133+
134+ ``` sh
135+ gcloud compute instances describe kafka-vm-0 \
136+ --format=' value[](metadata.items.kafka-user)'
137+ gcloud compute instances describe kafka-vm-0 \
138+ --format=' value[](metadata.items.kafka-password)'
139+ ```
140+
141+ Beam pipeline [ Kafka Read Write] ( transforms/io/kafka.yaml ) first writes data to
142+ the Kafka topic using the ` WriteToKafka ` transform and then reads that data back
143+ using the ` ReadFromKafka ` transform. Run the pipeline:
144+
145+ ``` sh
146+ export PROJECT=" $( gcloud config get-value project) "
147+ export TEMP_LOCATION=" gs://MY-BUCKET/tmp"
148+ export REGION=" us-central1"
149+ export JOB_NAME=" demo-kafka-` date +%Y%m%d-%H%M%S` "
150+ export NUM_WORKERS=" 1"
151+
152+ python -m apache_beam.yaml.main \
153+ --yaml_pipeline_file transforms/io/kafka.yaml \
154+ --runner DataflowRunner \
155+ --temp_location $TEMP_LOCATION \
156+ --project $PROJECT \
157+ --region $REGION \
158+ --num_workers $NUM_WORKERS \
159+ --job_name $JOB_NAME \
160+ --jinja_variables ' { "BOOTSTRAP_SERVERS": "123.45.67.89:9092",
161+ "TOPIC": "MY-TOPIC", "USERNAME": "USERNAME", "PASSWORD": "PASSWORD" }'
162+ ```
163+
164+ ** _ Optional_ ** : If Kafka cluster is set up with no SASL/PLAINTEXT authentication
165+ configured for the brokers, there's no SASL/PLAIN ` USERNAME ` and ` PASSWORD `
166+ needed. In the pipelines, omit the configurations ` producer_config_updates ` and
167+ ` consumer_config ` from the ` WriteToKafka ` and ` ReadFromKafka ` transforms.
168+ Run the commands above without specifying the username and password in
169+ ` --jinja_variables ` flag.
170+
171+ #### Iceberg
172+
173+ Beam pipelines [ Iceberg Write] ( transforms/io/iceberg_write.yaml ) and
174+ [ Iceberg Read] ( transforms/io/iceberg_read.yaml ) are examples of how to interact
175+ with Iceberg tables on GCS storage and with Hadoop catalog configured.
176+
177+ To create a GCS bucket as our warehouse storage,
178+ see [ here] ( https://cloud.google.com/storage/docs/creating-buckets#command-line ) .
179+ To run the pipelines locally, an option is to create a service account key in
180+ order to access GCS (see
181+ [ here] ( https://cloud.google.com/iam/docs/keys-create-delete#creating ) ).
182+ Within the pipelines, specify GCS bucket name and the path to the saved service
183+ account key .json file.
184+
185+ ** _ Note_ ** : With Hadoop catalog, Iceberg will use Hadoop connector for GCS.
186+ See [ here] ( https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md )
187+ for full list of configuration options for Hadoop catalog when use with GCS.
188+
189+ To create and write to Iceberg tables on GCS, run:
190+
191+ ``` sh
192+ python -m apache_beam.yaml.main \
193+ --yaml_pipeline_file transforms/io/iceberg_write.yaml
194+ ```
195+
196+ The pipeline uses [ Dynamic destinations] (
197+ https://cloud.google.com/dataflow/docs/guides/managed-io#dynamic-destinations )
198+ write to dynamically create and select a table destination based on field
199+ values in the incoming records.
200+
201+ To read from a created Iceberg table on GCS, run:
202+
203+ ``` sh
204+ python -m apache_beam.yaml.main \
205+ --yaml_pipeline_file transforms/io/iceberg_read.yaml
206+ ```
207+
208+ ** _ Optional_ ** : To run the pipeline on Dataflow, service account key is
209+ [ not needed] (
210+ https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md ).
211+ Omit the authentication settings in the Hadoop catalog configuration `
212+ config_properties`, and run:
213+
214+ ``` sh
215+ export REGION=" us-central1"
216+ export JOB_NAME=" demo-iceberg_write-` date +%Y%m%d-%H%M%S` "
217+
218+ gcloud dataflow yaml run $JOB_NAME \
219+ --yaml-pipeline-file transforms/io/iceberg_write.yaml \
220+ --region $REGION
221+ ```
222+
223+ ``` sh
224+ export REGION=" us-central1"
225+ export JOB_NAME=" demo-iceberg_read-` date +%Y%m%d-%H%M%S` "
226+
227+ gcloud dataflow yaml run $JOB_NAME \
228+ --yaml-pipeline-file transforms/io/iceberg_read.yaml \
229+ --region $REGION
230+ ```
89231
90232### ML
233+
91234These examples leverage the built-in ` Enrichment ` transform for performing
92235ML enrichments.
93236
94237More information can be found about aggregation transforms
95- [ here] ( https://beam.apache.org/documentation/sdks/yaml-combine/ ) .
238+ [ here] ( https://beam.apache.org/documentation/sdks/yaml-combine/ ) .
0 commit comments