Skip to content

Commit 42aed71

Browse files
authored
blog: GSoC 2025 - Beam YAML, Kafka and Iceberg User Accessibility (#36233)
* blog: GSoC 2025 - Beam YAML, Kafka and Iceberg User Accessibility * fixup! blog: GSoC 2025 - Beam YAML, Kafka and Iceberg User Accessibility * typo * Touch up with more details * Typos and touch up with more details
1 parent 8057963 commit 42aed71

File tree

1 file changed

+113
-0
lines changed

1 file changed

+113
-0
lines changed
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
---
2+
title: "Google Summer of Code 2025 - Beam YAML, Kafka and Iceberg User
3+
Accessibility"
4+
date: 2025-09-23 00:00:00 -0400
5+
categories:
6+
- blog
7+
- gsoc
8+
aliases:
9+
- /blog/2025/09/23/gsoc-25-yaml-user-accessibility.html
10+
authors:
11+
- charlespnh
12+
13+
---
14+
<!--
15+
Licensed under the Apache License, Version 2.0 (the "License");
16+
you may not use this file except in compliance with the License.
17+
You may obtain a copy of the License at
18+
19+
http://www.apache.org/licenses/LICENSE-2.0
20+
21+
Unless required by applicable law or agreed to in writing, software
22+
distributed under the License is distributed on an "AS IS" BASIS,
23+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
24+
See the License for the specific language governing permissions and
25+
limitations under the License.
26+
-->
27+
28+
The relatively new Beam YAML SDK was introduced in the spirit of making data processing easy,
29+
but it has gained little adoption for complex ML tasks and hasn’t been widely used with
30+
[Managed I/O](beam.apache.org/documentation/io/managed-io/) such as Kafka and Iceberg.
31+
As part of Google Summer of Code 2025, new illustrative, production-ready pipeline examples
32+
of ML use cases with Kafka and Iceberg data sources using the YAML SDK have been developed
33+
to address this adoption gap.
34+
35+
## Context
36+
The YAML SDK was introduced in Spring 2024 as Beam’s first no-code SDK. It follows a declarative approach
37+
of defining a data processing pipeline using a YAML DSL, as opposed to other programming language specific SDKs.
38+
At the time, it had few meaningful examples and documentation to go along with it. Key missing examples
39+
were ML workflows and integration with the Kafka and Iceberg Managed I/O. Foundational work had already been done
40+
to add support for ML capabilities as well as Kafka and Iceberg IO connectors in the YAML SDK, but there were no
41+
end-to-end examples demonstrating their usage.
42+
43+
Beam, as well as Kafka and Iceberg, are mainstream big data technologies but they also have a learning curve.
44+
The overall theme of the project is to help democratize data processing for scientists and analysts who traditionally
45+
don’t have a strong background in software engineering. They can now refer to these meaningful examples as the starting point,
46+
helping them onboard faster and be more productive when authoring ML/data pipelines to their use cases with Beam and its YAML DSL.
47+
48+
## Contributions
49+
The data pipelines/workflows developed are production-ready: Kafka and Iceberg data sources are set up on GCP,
50+
and the data used are raw public datasets. The pipelines are tested end-to-end on Google Cloud Dataflow and
51+
are also unit tested to ensure correct transformation logic.
52+
53+
Delivered pipelines/workflows, each with documentation as README.md, address 4 main ML use cases below:
54+
55+
1. **Streaming Classification Inference**: A streaming ML pipeline that demonstrates Beam YAML capability to perform
56+
classification inference on a stream of incoming data from Kafka. The overall workflow also includes
57+
DistilBERT model deployment and serving on Google Cloud Vertex AI where the pipeline can access for remote inferences.
58+
The pipeline is applied to a sentiment analysis task on a stream of YouTube comments, preprocessing data and classifying
59+
whether a comment is positive or negative. See [pipeline](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/sentiment_analysis/streaming_sentiment_analysis.yaml) and [documentation](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml/examples/transforms/ml/sentiment_analysis).
60+
61+
62+
2. **Streaming Regression Inference**: A streaming ML pipeline that demonstrates Beam YAML capability to perform
63+
regression inference on a stream of incoming data from Kafka. The overall workflow also includes
64+
custom model training, deployment and serving on Google Cloud Vertex AI where the pipeline can access for remote inferences.
65+
The pipeline is applied to a regression task on a stream of taxi rides, preprocessing data and predicting the fare amount
66+
for every ride. See [pipeline](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/taxi_fare/streaming_taxifare_prediction.yaml) and [documentation](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml/examples/transforms/ml/taxi_fare).
67+
68+
69+
3. **Batch Anomaly Detection**: A ML workflow that demonstrates ML-specific transformations
70+
and reading from/writing to Iceberg IO. The workflow contains unsupervised model training and several pipelines that leverage
71+
Iceberg for storing results, BigQuery for storing vector embeddings and MLTransform for computing embeddings to demonstrate
72+
an end-to-end anomaly detection workflow on a dataset of system logs. See [workflow](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/log_analysis/batch_log_analysis.sh) and [documentation](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml/examples/transforms/ml/log_analysis).
73+
74+
75+
4. **Feature Engineering & Model Evaluation**: A ML workflow that demonstrates Beam YAML capability to do feature engineering
76+
which is subsequently used for model evaluation, and its integration with Iceberg IO. The workflow contains model training
77+
and several pipelines, showcasing an end-to-end Fraud Detection MLOps solution that generates features and evaluates models
78+
to detect credit card transaction frauds. See [workflow](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/fraud_detection/fraud_detection_mlops_beam_yaml_sdk.ipynb) and [documentation](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml/examples/transforms/ml/fraud_detection).
79+
80+
## Challenges
81+
The main challenge of the project was a lack of previous YAML pipeline examples and good documentation to rely on.
82+
Unlike the Python or Java SDKs where there are already many notebooks and end-to-end examples demonstrating various use cases,
83+
the examples for YAML SDK only involved simple transformations such as filter, group by, etc. More complex transforms like
84+
`MLTransform` and `ReadFromIceberg` had no examples and requires configurations that didn't have clear API reference at the time.
85+
As a result, there were a lot of deep dives into the actual implementation of the PTransforms across YAML, Python and Java SDKs to
86+
understand the error messages and how to correctly use the transforms.
87+
88+
Another challenge was writing unit tests for the pipeline to ensure that the pipeline’s logic is correct.
89+
It was a learning curve to understand how the existing test suite is set up and how it can be used to write unit tests for
90+
the data pipelines. A lot of time was spent on properly writing mocks for the pipeline's sources and sinks, as well as for the
91+
transforms that require external services such as Vertex AI.
92+
93+
## Conclusion & Personal Thoughts
94+
These production-ready pipelines demonstrate the potential of Beam YAML SDK to author complex ML workflows
95+
that interact with Iceberg and Kafka. The examples are a nice addition to Beam, especially with Beam 3.0.0 milestones
96+
coming up where low-code/no-code, ML capabilities and Managed I/O are focused on.
97+
98+
I had an amazing time working with the big data technologies Beam, Iceberg, and Kafka as well as many Google Cloud services
99+
(Dataflow, Vertex AI and Google Kubernetes Engine, to name a few). I’ve always wanted to work more in the ML space, and this
100+
experience has been a great growth opportunity for me. Google Summer of Code this year has been selective, and the project's success
101+
would not have been possible without the support of my mentor, Chamikara Jayalath. It's been a pleasure working closely
102+
with him and the broader Beam community to contribute to this open-source project that has a meaningful impact on the
103+
data engineering community.
104+
105+
My advice for future Google Summer of Code participants is to first and foremost research and choose a project that aligns closely
106+
with your interest. Most importantly, spend a lot of time making yourself visible and writing a good proposal when the program
107+
is opened for applications. Being visible (e.g. by sharing your proposal, or generally any ideas and questions on the project's
108+
communication channel early on) makes it more likely for you to be selected; and a good proposal not only will make you even
109+
more likely to be in the program, but also give you a lot of confidence when contributing to and completing the project.
110+
111+
## References
112+
- [Google Summer of Code Project Listing](https://summerofcode.withgoogle.com/programs/2025/projects/f4kiDdus)
113+
- [Google Summer of Code Final Report](https://docs.google.com/document/d/1MSAVF6X9ggtVZbqz8YJGmMgkolR_dve0Lr930cByyac/edit?usp=sharing)

0 commit comments

Comments
 (0)