Skip to content

Conversation

@mohamedawnallah
Copy link
Contributor

@mohamedawnallah mohamedawnallah commented Sep 7, 2025

Description

Closes #36083.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@github-actions github-actions bot added the python label Sep 7, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @mohamedawnallah, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands Apache Beam's machine learning capabilities by integrating OpenAI's text embedding models into its RAG (Retrieval Augmented Generation) module. This allows users to seamlessly incorporate powerful, state-of-the-art text embeddings from OpenAI into their Beam pipelines for tasks like semantic search and information retrieval, enhancing the overall intelligence and performance of RAG applications built with Beam.

Highlights

  • OpenAI Text Embedding Integration: Introduces OpenAITextEmbeddings to the Beam RAG module, enabling the use of OpenAI models for generating text embeddings within Apache Beam pipelines.
  • New Module and Tests: Adds a new Python module (open_ai.py) for the OpenAI embedding implementation and a corresponding test file (open_ai_test.py) to ensure functionality.
  • Refactored Test Utility: Moves a common test utility function (chunk_approximately_equals) to a shared test_utils module, cleaning up existing test files like huggingface_test.py and vertex_ai_test.py.
  • Dependency Management: Updates setup.py to include openai as a dependency, making it available for the ml and a new openai extra, ensuring proper installation.
  • OpenAI API Parameter Fix: Corrects the way dimensions are passed to the OpenAI API in the existing _OpenAITextEmbeddingHandler, ensuring the parameter is sent as an integer.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mohamedawnallah mohamedawnallah force-pushed the addOpenAITextEmbeddingModelToRAG branch from caeb740 to 8a15186 Compare September 7, 2025 16:25
@mohamedawnallah mohamedawnallah changed the title RAG: Add OpenAI text embedding model RAG: add OpenAI text embedding model Sep 7, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Sep 7, 2025

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@mohamedawnallah mohamedawnallah force-pushed the addOpenAITextEmbeddingModelToRAG branch from 9ecb52f to e9c45c2 Compare September 8, 2025 01:17
@mohamedawnallah
Copy link
Contributor Author

mohamedawnallah commented Sep 9, 2025

@damccorm – is OPENAI_API_KEY GitHub secret currently available? Currently OpenAITextEmbeddings tests aren't actually running due openai python package dependency not added in test extra. With this PR, we resolve this (in addition to small bugs) and utilizing OpenAITextEmbedding model in the RAG package:

self.api_key = os.environ.get('OPENAI_API_KEY')

try:
from sdks.python.apache_beam.ml.transforms.embeddings.open_ai import OpenAITextEmbeddings
except ImportError:
OpenAITextEmbeddings = None

The only blocker for this PR as far as I could see is confirming the availability of OPENAI_API_KEY GitHub secret with available quota – already did a keyword search for that env variable in Beam codebase through my IDE haven't found it in other places other than the open_ai_it_test.py tests file. I can then submit a follow-up PR to add that secret in the relevant workflows and this PR can be moved forward from there

@damccorm
Copy link
Contributor

damccorm commented Sep 9, 2025

I don't think this is enabled - @jrmccluskey have you looked into ways to enable this at all as part of the OpenAI work/reviews you've done?

@jrmccluskey
Copy link
Contributor

No, at the moment we do not have an OpenAI API key for Beam. I was looking into what the options were for that, I think it comes down to needing an Apache billing account for OpenAI

@github-actions
Copy link
Contributor

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Nov 12, 2025
@mohamedawnallah
Copy link
Contributor Author

waiting on the author

@github-actions github-actions bot removed the stale label Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RAG: add OpenAI text embedding model

3 participants