[YAML] A Streaming Inference Pipeline - YouTube Comments Sentiment Analysis by charlespnh · Pull Request #35375 · apache/beam

charlespnh · 2025-06-19T14:51:39Z

Please add a meaningful description for your change here

Part of a larger effort #35069 and #35068 to add more examples involving Kafka and ML use cases.

The pipeline first reads the YouTube comments .csv dataset from
GCS bucket and performs some clean-up before writing it to a Kafka
topic. The pipeline then reads from that Kafka topic and applies
various transformation logic before RunInference transform performs
remote inference with the Vertex AI model handler and DistilBERT
deployed to a Vertex AI endpoint. The inference result is then
parsed and written to a BigQuery table.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

charlespnh · 2025-06-19T15:06:59Z

In order for the example to be self-contained as much as possible, I decided to have the pipeline to both pushes the data from GCS to a Kafka topic and also reads from that same topic. It won't be a proper streaming pipeline, but the user doesn't have to separately do additional work to push the data to the Kafka topic before being able to finally run the example.

Also, this example doesn't work with Beam 2.65.0. During the job submission the RunInference transform fails with unexpected keyword argument env_vars, but this has recently been fixed https://github.com/apache/beam/pull/35022/files#diff-00ce93be0981df61026196248c944f073d8ba1bdae9d8cb8bb2710e4fd3494b0L145. I've been testing the pipeline on master branch and overriding with custom sdk harness containers.

charlespnh · 2025-06-19T15:11:03Z

cc @chamikaramj and @damccorm for feedbacks on whether this example makes sense

github-actions · 2025-06-19T16:41:08Z

Assigning reviewers:

R: @jrmccluskey for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

github-actions · 2025-07-01T12:16:00Z

Reminder, please take a look at this pr: @jrmccluskey

codecov · 2025-07-05T18:44:33Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 56.53%. Comparing base (d5eea11) to head (2666307).
Report is 3 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #35375      +/-   ##
============================================
- Coverage     56.53%   56.53%   -0.01%     
  Complexity     3319     3319              
============================================
  Files          1199     1199              
  Lines        183097   183099       +2     
  Branches       3426     3426              
============================================
- Hits         103519   103515       -4     
- Misses        76279    76285       +6     
  Partials       3299     3299

Flag	Coverage Δ
python	`80.77% <100.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

charlespnh · 2025-07-05T20:54:39Z

Retest this please

chamikaramj

Thanks!

chamikaramj · 2025-07-11T02:50:21Z