Skip to content

Add PyTorch DistilBERT Sentiment Analysis Batch/Streaming pipelines for ML Benchmarks#34577

Merged
damccorm merged 6 commits intoapache:masterfrom
akvelon:pytorch-sentiment-streaming
May 9, 2025
Merged

Add PyTorch DistilBERT Sentiment Analysis Batch/Streaming pipelines for ML Benchmarks#34577
damccorm merged 6 commits intoapache:masterfrom
akvelon:pytorch-sentiment-streaming

Conversation

@Amar3tto
Copy link
Collaborator

@Amar3tto Amar3tto commented Apr 8, 2025

Changes:

  1. Added new pipeline PyTorch DistilBERT Sentiment Analysis streaming to the beam_Inference_Python_Benchmarks_Dataflow workflow.
  2. Added new pipeline PyTorch DistilBERT Sentiment Analysis batch to the beam_Inference_Python_Benchmarks_Dataflow workflow.
  3. Improved ML pipelines metrics pages on the website.

Successful run: https://github.com/Amar3tto/beam/actions/runs/14516082090/job/40725350012


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@Amar3tto Amar3tto marked this pull request as ready for review April 8, 2025 13:11
@Amar3tto Amar3tto requested a review from damccorm April 8, 2025 13:11
@github-actions
Copy link
Contributor

github-actions bot commented Apr 8, 2025

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @liferoad for label python.
R: @Abacn for label build.
R: @liferoad for label website.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@Amar3tto Amar3tto changed the title Add PyTorch DistilBERT Sentiment Analysis streaming pipeline for Benchmarks Add PyTorch DistilBERT Sentiment Analysis streaming pipeline for ML Benchmarks Apr 10, 2025

pipeline = test_pipeline or beam.Pipeline(options=pipeline_options)

# 1. Load data pipeline: read lines from GCS file and send to Pub/Sub
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add a few things to this:

  1. These should run as separate pipelines and we should only monitor the results of the streaming portion
  2. We need a way of applying a consistent rate to the input elements in pub/sub. A better approach might be to either (a) just do this independently via a script, or (b) within Dataflow use periodic impulse or state/timers to control the rate at which we emit. This will likely work better as a helper which we can use across pipelines
  3. For streaming pipelines, we should enable autoscaling

Copy link
Collaborator Author

@Amar3tto Amar3tto Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added autoscaling.
Can we merge this PR with the current approach (it is the easiest way to run two pipelines simultaneously)?
Then we can implement reusable independent script that controls input rate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should at least handle (1) and run these as two pipelines. Otherwise, I do not think we are measuring a meaningful dataset

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@github-actions
Copy link
Contributor

Reminder, please take a look at this pr: @liferoad @Abacn @liferoad

@liferoad
Copy link
Contributor

@jrmccluskey please review this PR. Thanks.

@jrmccluskey
Copy link
Contributor

As best I can tell @damccorm's comments haven't been addressed yet so I'll defer here

@github-actions
Copy link
Contributor

github-actions bot commented May 3, 2025

Reminder, please take a look at this pr: @liferoad @Abacn @liferoad

@damccorm
Copy link
Contributor

damccorm commented May 6, 2025

waiting on author

@Amar3tto Amar3tto changed the title Add PyTorch DistilBERT Sentiment Analysis streaming pipeline for ML Benchmarks Add PyTorch DistilBERT Sentiment Analysis Batch/Streaming pipelines for ML Benchmarks May 8, 2025
Copy link
Contributor

@damccorm damccorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@damccorm damccorm merged commit dc874bd into apache:master May 9, 2025
92 of 95 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants