|
| 1 | +--- |
| 2 | +title: "Google Summer of Code 2025 - Beam YAML, Kafka and Iceberg User |
| 3 | +Accessibility" |
| 4 | +date: 2025-09-23 00:00:00 -0400 |
| 5 | +categories: |
| 6 | + - blog |
| 7 | + - gsoc |
| 8 | +aliases: |
| 9 | + - /blog/2025/09/23/gsoc-25-yaml-user-accessibility.html |
| 10 | +authors: |
| 11 | + - charlespnh |
| 12 | + |
| 13 | +--- |
| 14 | +<!-- |
| 15 | +Licensed under the Apache License, Version 2.0 (the "License"); |
| 16 | +you may not use this file except in compliance with the License. |
| 17 | +You may obtain a copy of the License at |
| 18 | +
|
| 19 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 20 | +
|
| 21 | +Unless required by applicable law or agreed to in writing, software |
| 22 | +distributed under the License is distributed on an "AS IS" BASIS, |
| 23 | +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 24 | +See the License for the specific language governing permissions and |
| 25 | +limitations under the License. |
| 26 | +--> |
| 27 | + |
| 28 | +The relatively new Beam YAML SDK was introduced in the spirit of making data processing easy, |
| 29 | +but it has gained little adoption for complex ML tasks and hasn’t been widely used with |
| 30 | +[Managed I/O](beam.apache.org/documentation/io/managed-io/) such as Kafka and Iceberg. |
| 31 | +As part of Google Summer of Code 2025, new illustrative, production-ready pipeline examples |
| 32 | +of ML use cases with Kafka and Iceberg data sources using the YAML SDK have been developed |
| 33 | +to address this adoption gap. |
| 34 | + |
| 35 | +## Context |
| 36 | +The YAML SDK was introduced in Spring 2024 as Beam’s first no-code SDK. It follows a declarative approach |
| 37 | +of defining a data processing pipeline using a YAML DSL, as opposed to other programming language specific SDKs. |
| 38 | +At the time, it had few meaningful examples and documentation to go along with it. Key missing examples |
| 39 | +were ML workflows and integration with the Kafka and Iceberg Managed I/O. Foundational work had already been done |
| 40 | +to add support for ML capabilities as well as Kafka and Iceberg IO connectors in the YAML SDK, but there were no |
| 41 | +end-to-end examples demonstrating their usage. |
| 42 | + |
| 43 | +Beam, as well as Kafka and Iceberg, are mainstream big data technologies but they also have a learning curve. |
| 44 | +The overall theme of the project is to help democratize data processing for scientists and analysts who traditionally |
| 45 | +don’t have a strong background in software engineering. They can now refer to these meaningful examples as the starting point, |
| 46 | +helping them onboard faster and be more productive when authoring ML/data pipelines to their use cases with Beam and its YAML DSL. |
| 47 | + |
| 48 | +## Contributions |
| 49 | +The data pipelines/workflows developed are production-ready: Kafka and Iceberg data sources are set up on GCP, |
| 50 | +and the data used are raw public datasets. The pipelines are tested end-to-end on Google Cloud Dataflow and |
| 51 | +are also unit tested to ensure correct transformation logic. |
| 52 | + |
| 53 | +Delivered pipelines/workflows, each with documentation as README.md, address 4 main ML use cases below: |
| 54 | + |
| 55 | +1. **Streaming Classification Inference**: A streaming ML pipeline that demonstrates Beam YAML capability to perform |
| 56 | +classification inference on a stream of incoming data from Kafka. The overall workflow also includes |
| 57 | +DistilBERT model deployment and serving on Google Cloud Vertex AI where the pipeline can access for remote inferences. |
| 58 | +The pipeline is applied to a sentiment analysis task on a stream of YouTube comments, preprocessing data and classifying |
| 59 | +whether a comment is positive or negative. See [pipeline](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/sentiment_analysis/streaming_sentiment_analysis.yaml) and [documentation](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml/examples/transforms/ml/sentiment_analysis). |
| 60 | + |
| 61 | + |
| 62 | +2. **Streaming Regression Inference**: A streaming ML pipeline that demonstrates Beam YAML capability to perform |
| 63 | +regression inference on a stream of incoming data from Kafka. The overall workflow also includes |
| 64 | +custom model training, deployment and serving on Google Cloud Vertex AI where the pipeline can access for remote inferences. |
| 65 | +The pipeline is applied to a regression task on a stream of taxi rides, preprocessing data and predicting the fare amount |
| 66 | +for every ride. See [pipeline](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/taxi_fare/streaming_taxifare_prediction.yaml) and [documentation](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml/examples/transforms/ml/taxi_fare). |
| 67 | + |
| 68 | + |
| 69 | +3. **Batch Anomaly Detection**: A ML workflow that demonstrates ML-specific transformations |
| 70 | +and reading from/writing to Iceberg IO. The workflow contains unsupervised model training and several pipelines that leverage |
| 71 | +Iceberg for storing results, BigQuery for storing vector embeddings and MLTransform for computing embeddings to demonstrate |
| 72 | +an end-to-end anomaly detection workflow on a dataset of system logs. See [workflow](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/log_analysis/batch_log_analysis.sh) and [documentation](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml/examples/transforms/ml/log_analysis). |
| 73 | + |
| 74 | + |
| 75 | +4. **Feature Engineering & Model Evaluation**: A ML workflow that demonstrates Beam YAML capability to do feature engineering |
| 76 | +which is subsequently used for model evaluation, and its integration with Iceberg IO. The workflow contains model training |
| 77 | +and several pipelines, showcasing an end-to-end Fraud Detection MLOps solution that generates features and evaluates models |
| 78 | +to detect credit card transaction frauds. See [workflow](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/fraud_detection/fraud_detection_mlops_beam_yaml_sdk.ipynb) and [documentation](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml/examples/transforms/ml/fraud_detection). |
| 79 | + |
| 80 | +## Challenges |
| 81 | +The main challenge of the project was a lack of previous YAML pipeline examples and good documentation to rely on. |
| 82 | +Unlike the Python or Java SDKs where there are already many notebooks and end-to-end examples demonstrating various use cases, |
| 83 | +the examples for YAML SDK only involved simple transformations such as filter, group by, etc. More complex transforms like |
| 84 | +`MLTransform` and `ReadFromIceberg` had no examples and requires configurations that didn't have clear API reference at the time. |
| 85 | +As a result, there were a lot of deep dives into the actual implementation of the PTransforms across YAML, Python and Java SDKs to |
| 86 | +understand the error messages and how to correctly use the transforms. |
| 87 | + |
| 88 | +Another challenge was writing unit tests for the pipeline to ensure that the pipeline’s logic is correct. |
| 89 | +It was a learning curve to understand how the existing test suite is set up and how it can be used to write unit tests for |
| 90 | +the data pipelines. A lot of time was spent on properly writing mocks for the pipeline's sources and sinks, as well as for the |
| 91 | +transforms that require external services such as Vertex AI. |
| 92 | + |
| 93 | +## Conclusion & Personal Thoughts |
| 94 | +These production-ready pipelines demonstrate the potential of Beam YAML SDK to author complex ML workflows |
| 95 | +that interact with Iceberg and Kafka. The examples are a nice addition to Beam, especially with Beam 3.0.0 milestones |
| 96 | +coming up where low-code/no-code, ML capabilities and Managed I/O are focused on. |
| 97 | + |
| 98 | +I had an amazing time working with the big data technologies Beam, Iceberg, and Kafka as well as many Google Cloud services |
| 99 | +(Dataflow, Vertex AI and Google Kubernetes Engine, to name a few). I’ve always wanted to work more in the ML space, and this |
| 100 | +experience has been a great growth opportunity for me. Google Summer of Code this year has been selective, and the project's success |
| 101 | +would not have been possible without the support of my mentor, Chamikara Jayalath. It's been a pleasure working closely |
| 102 | +with him and the broader Beam community to contribute to this open-source project that has a meaningful impact on the |
| 103 | +data engineering community. |
| 104 | + |
| 105 | +My advice for future Google Summer of Code participants is to first and foremost research and choose a project that aligns closely |
| 106 | +with your interest. Most importantly, spend a lot of time making yourself visible and writing a good proposal when the program |
| 107 | +is opened for applications. Being visible (e.g. by sharing your proposal, or generally any ideas and questions on the project's |
| 108 | +communication channel early on) makes it more likely for you to be selected; and a good proposal not only will make you even |
| 109 | +more likely to be in the program, but also give you a lot of confidence when contributing to and completing the project. |
| 110 | + |
| 111 | +## References |
| 112 | +- [Google Summer of Code Project Listing](https://summerofcode.withgoogle.com/programs/2025/projects/f4kiDdus) |
| 113 | +- [Google Summer of Code Final Report](https://docs.google.com/document/d/1MSAVF6X9ggtVZbqz8YJGmMgkolR_dve0Lr930cByyac/edit?usp=sharing) |
0 commit comments