[YAML] A Streaming Inference Pipeline - YouTube Comments Sentiment Analysis#35375
[YAML] A Streaming Inference Pipeline - YouTube Comments Sentiment Analysis#35375chamikaramj merged 13 commits intoapache:masterfrom
Conversation
|
In order for the example to be self-contained as much as possible, I decided to have the pipeline to both pushes the data from GCS to a Kafka topic and also reads from that same topic. It won't be a proper streaming pipeline, but the user doesn't have to separately do additional work to push the data to the Kafka topic before being able to finally run the example. Also, this example doesn't work with Beam 2.65.0. During the job submission the |
|
cc @chamikaramj and @damccorm for feedbacks on whether this example makes sense |
|
Assigning reviewers: R: @jrmccluskey for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
Reminder, please take a look at this pr: @jrmccluskey |
3863b50 to
47e4a8d
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #35375 +/- ##
============================================
- Coverage 56.53% 56.53% -0.01%
Complexity 3319 3319
============================================
Files 1199 1199
Lines 183097 183099 +2
Branches 3426 3426
============================================
- Hits 103519 103515 -4
- Misses 76279 76285 +6
Partials 3299 3299
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Retest this please |
| names=['video_id', 'comment_text', 'likes', 'replies'], | ||
| on_bad_lines='skip', | ||
| converters={'likes': _to_int, 'replies': _to_int}) | ||
| | beam.Filter(lambda row: |
There was a problem hiding this comment.
So we are filtering out records that has None or any field ? Can we include such records with default values for missing fields or some representation of None ?
There was a problem hiding this comment.
There are csv rows that look like this:
video_id,comment_text,likes,replies
XpVt6Z1Gjjo,I am always,happy0,0
I had the helper function _to_int to try to convert the likes and replies columns when reading the csv file, and in this case it converts to None. The following beam.Filter is to filter out this case. Even if we replace with some default values instead of None, I still think it doesn't make sense to keep broken rows like this...
| # limitations under the License. | ||
| # | ||
|
|
||
| # The pipeline first reads the YouTube comments .csv dataset from GCS bucket |
There was a problem hiding this comment.
Can we make this part pluggable so that someone who has a true Kafka topic with valid values we read here can just use that to execute a true streaming pipeline ?
There was a problem hiding this comment.
You can introduce conditional branch with jinja templatization [1] by adding, for example,
{% if true_streaming == "true" %} ... {% endif %}
... but testing is no longer straightforward with how the current test suit is set up (you have to start passing jinja variables in addition to this yaml pipeline)...
[1] https://beamsummit.org/slides/2024/BeamYAML_Advancedtopics.pdf
| bootstrap_servers: "{{ BOOTSTRAP_SERVERS }}" | ||
| producer_config_updates: | ||
| sasl.jaas.config: "org.apache.kafka.common.security.plain.PlainLoginModule required \ | ||
| username={{ USERNAME }} \ |
There was a problem hiding this comment.
We should update this to use secret managers [1] when it's available. @damccorm is this at a state so that we can try (or is there an ETA) ?
| auto_offset_reset_config: earliest | ||
| consumer_config: | ||
| sasl.jaas.config: "org.apache.kafka.common.security.plain.PlainLoginModule required \ | ||
| username={{ USERNAME }} \ |
There was a problem hiding this comment.
Ditto regarding secret managers
| # comment string, e.g. emojis, ascii characters outside | ||
| # the common day-to-day English. | ||
| - type: MapToFields | ||
| name: RemoveWeirdCharacters |
There was a problem hiding this comment.
May users of the pipeline expect to see such characters ? Can we preserve using a different character encoding ?
There was a problem hiding this comment.
The model distilbert-base-uncased-finetuned-sst-2-english is trained on the text corpus that doesn't include emojis or non-printable ascii characters [1]. Otherwise we would have false positives, e.g. 😩 or §ÁĐ always give POSITIVE label. I don't think it's meaningful to keep these kind of characters.
| | beam.Map(lambda element: beam.Row(**element))) | ||
|
|
||
|
|
||
| @beam.ptransform.ptransform_fn |
There was a problem hiding this comment.
Is it possible to test with a mock model hander instead of completely mocking the transform here ?
cc: @damccorm
| return test_spec | ||
|
|
||
|
|
||
| @YamlExamplesTestSuite.register_test_preprocessor( |
There was a problem hiding this comment.
Is it possible to add these as tests embedded in the YAML file under a "tests:" section instead of being implemented in Python ?
@derrickaw might know more.
There was a problem hiding this comment.
See previous comment
|
|
||
| These examples leverage the built-in `Enrichment` transform for performing | ||
| ML enrichments. | ||
| These examples include the built-in `Enrichment` transform for performing |
There was a problem hiding this comment.
Probably better to list the examples here since we'll add more.
|
|
||
| These examples leverage the built-in `Enrichment` transform for performing | ||
| ML enrichments. | ||
| These examples include the built-in `Enrichment` transform for performing |
There was a problem hiding this comment.
Probably better to list the examples here since we'll add more.
| gcloud storage cp /path/to/UScomments.csv gs://YOUR_BUCKET/UScomments.csv | ||
| ``` | ||
|
|
||
| For setting up Kafka, an option is to use [Click to Deploy]( |
There was a problem hiding this comment.
Might be good to introduce a script that sets up resources for executing the pipeline if possible. Beam already has tests that startup a Docker-based Kafka in GCP and create BQ datasets.
There was a problem hiding this comment.
Setting up Kafka with Click to Deploy only takes a few button clicks. This doesn't require any manual setup/installation. I've also added link to our existing Kafka pipeline example.
As for creating BQ dataset, I've added a command line.
| from apache_beam.ml.inference.base import PredictionResult | ||
| from apache_beam.typehints.row_type import RowTypeConstraint |
There was a problem hiding this comment.
Lets put the imports at the top of file
| if pipeline := test_spec.get('pipeline', None): | ||
| for transform in pipeline.get('transforms', []): | ||
| if transform.get('type', '') == 'PyTransform' and transform.get( | ||
| 'name', '') == 'ReadFromGCS': | ||
| transform['windowing'] = {'type': 'fixed', 'size': '30s'} | ||
|
|
||
| file_name = 'youtube-comments.csv' | ||
| local_path = env.input_file(file_name, INPUT_FILES[file_name]) | ||
| transform['config']['kwargs']['file_pattern'] = local_path | ||
|
|
||
| if pipeline := test_spec.get('pipeline', None): | ||
| for transform in pipeline.get('transforms', []): | ||
| if transform.get('type', '') == 'ReadFromKafka': | ||
| config = transform['config'] | ||
| transform['type'] = 'ReadFromCsv' | ||
| transform['config'] = { | ||
| k: v | ||
| for k, v in config.items() if k.startswith('__') | ||
| } | ||
| transform['config']['path'] = "" | ||
|
|
||
| file_name = 'youtube-comments.csv' | ||
| test_spec = replace_recursive( | ||
| test_spec, | ||
| transform['type'], | ||
| 'path', | ||
| env.input_file(file_name, INPUT_FILES[file_name])) | ||
|
|
||
| if pipeline := test_spec.get('pipeline', None): | ||
| for transform in pipeline.get('transforms', []): | ||
| if transform.get('type', '') == 'RunInference': | ||
| transform['type'] = 'TestRunInference' | ||
|
|
There was a problem hiding this comment.
Do we need three if statements or can we combine and have one for loop iteration looping through to check for each type of transform?
There was a problem hiding this comment.
Yeah because we're using replace_recursive it returns a new test_spec reference. You therefore would need to get the new pipeline reference.
chamikaramj
left a comment
There was a problem hiding this comment.
Thanks. LGTM.
We can merge when Derrick's comments are addressed.
|
Retest this please |
|
Please fix the lint failure.
|
|
Also, ML test suite seems to be green at HEAD ? https://github.com/apache/beam/actions/workflows/beam_PreCommit_Python_ML.yml |
65f26c0 to
1475054
Compare
715c84a to
386641a
Compare
|
Apologies for the super late update on this one... Tests have finally passed now. Thanks for the review! |
Please add a meaningful description for your change here
Part of a larger effort #35069 and #35068 to add more examples involving Kafka and ML use cases.
The pipeline first reads the YouTube comments .csv dataset from
GCS bucket and performs some clean-up before writing it to a Kafka
topic. The pipeline then reads from that Kafka topic and applies
various transformation logic before
RunInferencetransform performsremote inference with the Vertex AI model handler and DistilBERT
deployed to a Vertex AI endpoint. The inference result is then
parsed and written to a BigQuery table.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.