complete implementation of open ai text embedding with test #new by aditya0yadav · Pull Request #34700 · apache/beam

aditya0yadav · 2025-04-21T23:00:49Z

complete implementation of open ai text embedding with test

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

github-actions · 2025-04-22T00:28:15Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

aditya0yadav · 2025-04-22T03:36:05Z

assign set of reviewers

github-actions · 2025-04-22T03:37:24Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @jrmccluskey for label python.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

jrmccluskey · 2025-04-22T15:27:46Z

Hey there! I'll try to get a thorough review pass on your PR this afternoon; however, at a quick glance, this seems like a good candidate for inheriting from the new RemoteModelHandler base class I introduced a few weeks ago. Would you be interested in tweaking your implementation to use this class? It'll streamline your code since it handles the client-side throttling work in the parent class.

aditya0yadav · 2025-04-22T16:15:01Z

Let me check first

jrmccluskey

Very good starting point for this, just needs some polish and a few additions.

sdks/python/apache_beam/ml/transforms/embeddings/open_ai.py

jrmccluskey · 2025-04-23T14:26:11Z

sdks/python/apache_beam/ml/transforms/embeddings/open_ai.py

+  def load_model(self):
+    # Create the client just before it's needed during pipeline execution
+    if self.api_key:
+      client = open_ai.OpenAI(


This import is missing. the package is also openai based on the official library docs - https://platform.openai.com/docs/libraries?language=python

same as below it is fixed
it will be shown in upcoming commit

jrmccluskey · 2025-04-23T14:30:03Z

sdks/python/apache_beam/ml/transforms/embeddings/open_ai.py

+        boolean indication whether or not the exception is a Server Error (5xx) or
+          a RateLimitError (429) error.
+    """
+  return isinstance(exception, (RateLimitError, APIError))


Need an import for these exceptions

because of the unknown reason some of the import is missing i just add those import
it will be shown in a upcoming update commit

sdks/python/apache_beam/ml/transforms/embeddings/open_ai.py

sdks/python/apache_beam/ml/transforms/embeddings/open_ai_test.py

jrmccluskey · 2025-04-23T14:52:38Z

sdks/python/apache_beam/ml/transforms/embeddings/open_ai.py

+MLTransformOutputT = TypeVar('MLTransformOutputT')
+
+# Default batch size for OpenAI calls
+_BATCH_SIZE = 20  # OpenAI can handle larger batches than Vertex


This could likely be handled in the model handler instead of being hard-coded here

sdks/python/apache_beam/ml/transforms/embeddings/open_ai_test.py

aditya0yadav · 2025-04-23T18:51:53Z

@jrmccluskey i dont know why some of the import statement is missing
even though place i store the code for backup have the import statement but in pull request it is missing

is it because of the formating ?

aditya0yadav · 2025-04-23T20:11:30Z

@jrmccluskey
nearly all the changes is completed according to your
but i am facing some problem in the lint and formating
i need your help on this one please

jrmccluskey · 2025-04-24T18:01:17Z

For the linting you should be able to pip install yapf==0.29.0 in the virtual environment you're working in and just run it from the command line:

yapf --in-place --recursive . in the inference directory would do it.

The formatting check can catch other things but the failure in this case is just yapf again.

aditya0yadav · 2025-04-25T17:32:37Z

@jrmccluskey I'm trying to fix the linting error, but I'm not having any luck.
I checked run_lint.sh and tried a few other things, but the error still persists.
What should I do next?

jrmccluskey · 2025-04-25T17:37:56Z

Here's the list of specific failures:

For the unused imports/vars you should just remove them, running yapf as i described above should fix the line-too-long errors. You can also add # pylint: disable=line-too-long to the end of the too-long lines if yapf cannot format them down for whatever reason

aditya0yadav · 2025-04-25T22:04:37Z

@jrmccluskey The lint error is fixed, but now there's an error in the Prism runner that's unrelated to OpenAI. What should I do next?

before that this error is not coming

aditya0yadav · 2025-04-25T22:28:34Z

Hi @jrmccluskey,
I was going through the remote handler and was wondering — would it also support the embedding handler?
From what I saw, your implementation seems focused on the model handler. Just wanted to clarify.

aditya0yadav · 2025-04-26T02:02:49Z

Hi @jrmccluskey,
i have one concern about embedding handler as you know openai does not provide any model for embedding generation for image
then what to do about this ?

jrmccluskey · 2025-04-29T14:30:31Z

You should be able to ignore prism and yaml failures, those are generally flaky and not impacted by anything here.

The EmbeddingsManager class is effectively a composite PTransform that produces a RunInference transform with a more traditional model handler, I cannot see a reason why that wouldn't work with the remote handler implementation.

Not having a model for image embeddings is fine since you've clearly labeled the class as a text embedding model, we can always add images / multimodal implementations later as APIs become available.

aditya0yadav · 2025-04-30T01:57:12Z

@jrmccluskey all important test is completed
model handler is changes into a remote model handler
please check this out

thanks

aditya0yadav · 2025-04-30T01:58:30Z

@jrmccluskey

You should be able to ignore prism and yaml failures, those are generally flaky and not impacted by anything here.
thanks for letting me know

The EmbeddingsManager class is effectively a composite PTransform that produces a RunInference transform with a more traditional model handler, I cannot see a reason why that wouldn't work with the remote handler implementation.

sorry , now it is working

Not having a model for image embeddings is fine since you've clearly labeled the class as a text embedding model, we can always add images / multimodal implementations later as APIs become available.

okay

aditya0yadav · 2025-04-30T08:52:25Z

@jrmccluskey
Hope can you please clarify this
what should we do with feast
there is only enrichment handler for the feast
should be create io handler for feast or something else

thanks

sdks/python/apache_beam/ml/transforms/embeddings/open_ai.py

jrmccluskey · 2025-05-06T14:58:59Z

sdks/python/apache_beam/ml/transforms/embeddings/open_ai.py

+      organization: Optional[str] = None,
+      dimensions: Optional[int] = None,
+      user: Optional[str] = None,
+      batch_size: Optional[int] = None,


this is misleading since you're setting a single batch size value then take it as the max. I'd recommend exposing the min and max batch sizes separately (or just taking them as kwargs)

done @jrmccluskey

sdks/python/apache_beam/ml/transforms/embeddings/open_ai.py

jrmccluskey · 2025-05-06T15:07:42Z

I'm not sure what you're asking about for enrichment. Can you clarify?

aditya0yadav · 2025-05-08T06:56:09Z

actually i wanna is there any need of feast io connector
or improvement in enrichmnet handler like inserting feature
@jrmccluskey

aditya0yadav · 2025-05-15T09:24:34Z

@jrmccluskey
if you are free , please check this out

jrmccluskey

Sorry about the delay getting back to this, I think you've gotten it to a good place to merge! Thank you

aditya0yadav · 2025-05-16T06:28:25Z

my pleasure

@jrmccluskey

can i ask what else is needed in case of open ai
or can i change the model handler of the vertex ai to the remote model handler

aditya0yadav added 3 commits April 21, 2025 22:57

openai embedding text implementation

0c43846

openai embedding unit test

5b60a99

Merge branch 'apache:master' into master

7771ef8

github-actions bot added the python label Apr 21, 2025

aditya0yadav added 2 commits April 22, 2025 07:04

Update open_ai.py

e25e5ee

Update open_ai_test.py

0760d5c

github-actions bot added the Next Action: Reviewers label Apr 22, 2025

jrmccluskey requested changes Apr 23, 2025

View reviewed changes

aditya0yadav added 3 commits April 24, 2025 01:33

Update open_ai.py

ddb6907

Update open_ai_test.py

b327620

Create open_ai_test_requirement.txt

ccdf259

aditya0yadav added 2 commits April 24, 2025 05:14

remove the file unless given good detail for the project

99dc050

Rename open_ai_test.py to open_ai_it_test.py

002c98f

aditya0yadav added 2 commits April 25, 2025 05:48

changes based on lint and format error

86d2e03

bypass the type check

9b1d20a

aditya0yadav added 4 commits April 25, 2025 18:26

trying to fix the error related to the linting

f15dfb0

chnages latest

c3dc96e

changes

cf58224

changes

3ba997b

aditya0yadav added 2 commits April 29, 2025 01:12

Merge branch 'apache:master' into master

7f3df7c

Merge branch 'apache:master' into master

fbafd11

aditya0yadav added 6 commits April 30, 2025 02:59

Update open_ai.py

f73e144

Merge branch 'apache:master' into master

e7a5533

Update open_ai.py

7cd37a3

Update open_ai.py

381d835

Update open_ai.py

433958d

completed the whole open_ai

fb48432

jrmccluskey requested changes May 6, 2025

View reviewed changes

aditya0yadav added 4 commits May 7, 2025 09:38

commit with changes to request of the reviewer

53fade7

commit

afc22e9

Merge branch 'apache:master' into master

6247c07

change

34962b1

jrmccluskey approved these changes May 15, 2025

View reviewed changes

jrmccluskey merged commit d5e3766 into apache:master May 15, 2025
90 checks passed

Conversation

aditya0yadav commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Actions Tests Status (on master branch)

Uh oh!

github-actions bot commented Apr 22, 2025

Uh oh!

aditya0yadav commented Apr 22, 2025

Uh oh!

github-actions bot commented Apr 22, 2025

Uh oh!

jrmccluskey commented Apr 22, 2025

Uh oh!

aditya0yadav commented Apr 22, 2025

Uh oh!

jrmccluskey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jrmccluskey Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

aditya0yadav Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

jrmccluskey Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

aditya0yadav Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrmccluskey Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aditya0yadav commented Apr 23, 2025

Uh oh!

aditya0yadav commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrmccluskey commented Apr 24, 2025

Uh oh!

aditya0yadav commented Apr 25, 2025

Uh oh!

jrmccluskey commented Apr 25, 2025

Uh oh!

aditya0yadav commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aditya0yadav commented Apr 25, 2025

Uh oh!

aditya0yadav commented Apr 26, 2025

Uh oh!

jrmccluskey commented Apr 29, 2025

Uh oh!

aditya0yadav commented Apr 30, 2025

Uh oh!

aditya0yadav commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aditya0yadav commented Apr 30, 2025

Uh oh!

Uh oh!

jrmccluskey May 6, 2025

Choose a reason for hiding this comment

Uh oh!

aditya0yadav May 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jrmccluskey commented May 6, 2025

Uh oh!

aditya0yadav commented May 8, 2025

Uh oh!

aditya0yadav commented Apr 21, 2025 •

edited

Loading

aditya0yadav commented Apr 23, 2025 •

edited

Loading

aditya0yadav commented Apr 25, 2025 •

edited

Loading

aditya0yadav commented Apr 30, 2025 •

edited

Loading