RAG: add OpenAI text embedding model #36081

mohamedawnallah · 2025-09-07T16:24:05Z

Description

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

gemini-code-assist

Summary of Changes

Hello @mohamedawnallah, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands Apache Beam's machine learning capabilities by integrating OpenAI's text embedding models into its RAG (Retrieval Augmented Generation) module. This allows users to seamlessly incorporate powerful, state-of-the-art text embeddings from OpenAI into their Beam pipelines for tasks like semantic search and information retrieval, enhancing the overall intelligence and performance of RAG applications built with Beam.

Highlights

OpenAI Text Embedding Integration: Introduces OpenAITextEmbeddings to the Beam RAG module, enabling the use of OpenAI models for generating text embeddings within Apache Beam pipelines.
New Module and Tests: Adds a new Python module (open_ai.py) for the OpenAI embedding implementation and a corresponding test file (open_ai_test.py) to ensure functionality.
Refactored Test Utility: Moves a common test utility function (chunk_approximately_equals) to a shared test_utils module, cleaning up existing test files like huggingface_test.py and vertex_ai_test.py.
Dependency Management: Updates setup.py to include openai as a dependency, making it available for the ml and a new openai extra, ensuring proper installation.
OpenAI API Parameter Fix: Corrects the way dimensions are passed to the OpenAI API in the existing _OpenAITextEmbeddingHandler, ensuring the parameter is sent as an integer.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2025-09-07T18:38:25Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

mohamedawnallah · 2025-09-09T11:35:59Z

@damccorm – is OPENAI_API_KEY GitHub secret currently available? Currently OpenAITextEmbeddings tests aren't actually running due openai python package dependency not added in test extra. With this PR, we resolve this (in addition to small bugs) and utilizing OpenAITextEmbedding model in the RAG package:

beam/sdks/python/apache_beam/ml/transforms/embeddings/open_ai_it_test.py

Line 48 in a2f9fb8

self.api_key = os.environ.get('OPENAI_API_KEY')

beam/sdks/python/apache_beam/ml/transforms/embeddings/open_ai_it_test.py

Lines 27 to 30 in a2f9fb8

    
           try: 
        
             from sdks.python.apache_beam.ml.transforms.embeddings.open_ai import OpenAITextEmbeddings 
        
           except ImportError: 
        
             OpenAITextEmbeddings = None

The only blocker for this PR as far as I could see is confirming the availability of OPENAI_API_KEY GitHub secret with available quota – already did a keyword search for that env variable in Beam codebase through my IDE haven't found it in other places other than the open_ai_it_test.py tests file. I can then submit a follow-up PR to add that secret in the relevant workflows and this PR can be moved forward from there

damccorm · 2025-09-09T15:26:00Z

I don't think this is enabled - @jrmccluskey have you looked into ways to enable this at all as part of the OpenAI work/reviews you've done?

jrmccluskey · 2025-09-09T15:31:30Z

No, at the moment we do not have an OpenAI API key for Beam. I was looking into what the options were for that, I think it comes down to needing an Apache billing account for OpenAI

github-actions · 2025-11-12T12:47:02Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

mohamedawnallah · 2025-11-12T12:57:36Z

waiting on the author

mohamedawnallah added 6 commits September 7, 2025 15:39

sdks/python: properly access dimensions in OpenAITextEmbedding

ff8c05b

sdks/python: properly scale embeddings on 0 to 1 range in OpenAI

71c5841

sdks/python: properly import OpenAITextEmbeddings in the test file

1315494

sdks/python: add openai as extra dependency

9031331

sdks/python; add OpenAITextEmbeddings to RAG module

466d533

sdks/python: reuse chunk_approximately_equals function across RAG

ab0aea8

github-actions bot added the python label Sep 7, 2025

gemini-code-assist bot reviewed Sep 7, 2025

View reviewed changes

CHANGES.md: update release notes

8a15186

mohamedawnallah force-pushed the addOpenAITextEmbeddingModelToRAG branch from caeb740 to 8a15186 Compare September 7, 2025 16:25

sdks/python: dont skip OpenAIEmbeddingsTest openAI supposed to exist

ae9b076

mohamedawnallah changed the title ~~RAG: Add OpenAI text embedding model~~ RAG: add OpenAI text embedding model Sep 7, 2025

mohamedawnallah added 2 commits September 7, 2025 16:46

sdks/python: fix formatting issues

0bba943

CHANGES.md: make PreCommit Whitespace workflow pass

f00f5c0

mohamedawnallah added 4 commits September 7, 2025 19:34

sdks/python: add missing test_utils.py to RAG embeddings

973a633

sdks/python: fix linting issues

60edb28

sdks/python: add open API dependency to tests

0fe9e6f

CHANGES.md: ref issue instead of PR

e9c45c2

mohamedawnallah force-pushed the addOpenAITextEmbeddingModelToRAG branch from 9ecb52f to e9c45c2 Compare September 8, 2025 01:17

sdks/python: fix linting issues

b260dd5

damccorm mentioned this pull request Sep 23, 2025

[Tracking]: Beam 3.0.0 - Milestone 1 Key Features #36173

Open

github-actions bot added the stale label Nov 12, 2025

github-actions bot removed the stale label Nov 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RAG: add OpenAI text embedding model #36081

RAG: add OpenAI text embedding model #36081

Uh oh!

mohamedawnallah commented Sep 7, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

github-actions bot commented Sep 7, 2025

Uh oh!

mohamedawnallah commented Sep 9, 2025 •

edited

Loading

Uh oh!

damccorm commented Sep 9, 2025

Uh oh!

jrmccluskey commented Sep 9, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

mohamedawnallah commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RAG: add OpenAI text embedding model #36081

Are you sure you want to change the base?

RAG: add OpenAI text embedding model #36081

Uh oh!

Conversation

mohamedawnallah commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

GitHub Actions Tests Status (on master branch)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Sep 7, 2025

Uh oh!

mohamedawnallah commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

damccorm commented Sep 9, 2025

Uh oh!

jrmccluskey commented Sep 9, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

mohamedawnallah commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mohamedawnallah commented Sep 7, 2025 •

edited

Loading

mohamedawnallah commented Sep 9, 2025 •

edited

Loading