-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Python: feat(oracle): add new Oracle connector for Semantic Kernel #13229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
@microsoft-github-policy-service agree [company="Oracle"] |
|
@microsoft-github-policy-service agree company="Oracle" |
| @@ -0,0 +1,1349 @@ | |||
| # Copyright (c) 2025, Oracle Corporation. All rights reserved. | |||
|
|
|||
| from __future__ import annotations | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shouldn't be necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it.
| VectorSearchExecutionException, | ||
| VectorStoreOperationException | ||
| ) | ||
| from semantic_kernel.exceptions.memory_connector_exceptions import MemoryConnectorConnectionException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| from semantic_kernel.exceptions.memory_connector_exceptions import MemoryConnectorConnectionException | |
| from semantic_kernel.exceptions import MemoryConnectorConnectionException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated code.
| async connection pools for Oracle. | ||
| """ | ||
|
|
||
| user: str | None = Field(default=None, validation_alias=ORACLE_USER_ENV_VAR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the intended use of a KernelBaseSettings object, make sure in this class to set:
env_prefix: ClassVar[str] = "ORACLE_"and then each of the parameters will be prefixed by that + the name of the param capatilized. And then you can remove all the validation_alias's
So only, min and max, should then become pool_min and pool_max.
And it is important to document in the docstring which parameters there are and what their respective env variable name is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated code!
|
|
||
| connection_pool: oracledb.AsyncConnectionPool | None = None | ||
|
|
||
| model_config = SettingsConfigDict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also likely not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it!
| wallet_location: str | None = Field(default=None, validation_alias=ORACLE_WALLET_LOCATION_ENV_VAR) | ||
| wallet_password: SecretStr | None = Field(default=None, validation_alias=ORACLE_WALLET_PASSWORD_ENV_VAR) | ||
|
|
||
| connection_pool: oracledb.AsyncConnectionPool | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a PrivateAttr or changed to _connection_pool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated code!
| def _unwrap_secret(self, value): | ||
| if value is None: | ||
| return None | ||
| return value.get_secret_value() if hasattr(value, "get_secret_value") else str(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this really needed, since you only use this for parameters that you know are secrets...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated code!
| # Create pool with extra user-supplied kwargs | ||
| self.connection_pool = oracledb.create_pool_async( | ||
| user=self.user, | ||
| password=self._unwrap_secret(self.password), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| password=self._unwrap_secret(self.password), | |
| password=self.password.get_secret_value() if self.password else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated code!
| connection_pool: oracledb.AsyncConnectionPool | None = None | ||
| db_schema: str | None = None | ||
| pool_args: dict[str, Any] | None = None | ||
| supported_key_types: ClassVar[set[str] | None] = {"str", "int", "UUID"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is UUID a separate type (in python)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, UUID is a separate type in Python, not just a string.
| query, bind, columns = await self._inner_search_vector(options, values, vector, **kwargs) | ||
|
|
||
| # If total count is requested, fetch all rows to count. | ||
| if options.include_total_count: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is set, but the database doesn't support a parameter, then we shouldnt pull everything in, just ignore the setting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having considered options.include_total_count for Oracle, here are some possible approaches,
- Raise a warning or log message (e.g., RuntimeWarning or logger.warning) indicating that include_total_count is not supported for Oracle and will be ignored.
- Raise a NotSupportedError if we want stricter enforcement and to prevent misuse in performance-sensitive scenarios.
- Fetch all rows and log a warning a balanced approach,
- Users still get the total count.
- A clear log message or warning highlights that fetching all rows may be inefficient for large datasets.
- The behavior will be properly documented to ensure developers are aware of the performance implications.
I recommend option 3 as it maintains correctness while providing transparency and guidance but like to hear your thoughts or if you prefer a stricter approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eavanvalkenburg Need your input here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would go with 1, I think that's what we do in most other cases, just looked at some, and in many we actually don't even log anything when it's not supported, I will update that and log something in those cases. Thanks for looking at this
|
|
||
| # Third-party Libraries | ||
| import numpy as np | ||
| import oracledb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these present in the pyproject under a appropriate extra?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this dependency isn't included under an appropriate extra in the pyproject.toml yet. It is missing at the moment and I’ll add it so the extras section is complete and consistent.
| """ | ||
|
|
||
| env_prefix: ClassVar[str] = "ORACLE_" | ||
| user: str | None = Field(default=None, validation_alias=USER_ENV_VAR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need all these aliases? as far as I can see, the used variable name for this line would be: ORACLE_USER and that is also the value of the USER_ENV_VAR, so this doesn't add anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that USER_ENV_VAR currently has the value "ORACLE_USER" and with validation_alias=USER_ENV_VAR the model just maps ORACLE_USER → user. Technically this doesn't add functionality if we don't need a difference between the field name and the environment variable name.
The only reason to keep the alias would be if we intentionally want Python friendly field names (user) while still reading uppercase environment variables (ORACLE_USER). If we don’t need that separation, we can remove the alias and either:
- rename the field to match the env var, or
- rely on the field name directly without aliasing.
So yes, if the field name and the env var name always match, the alias doesn't provide additional value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're misunderstanding, because of the env_prefix the field user will already map to env variable ORACLE_USER so that is already the name of the field when it loads, so the validation_alias doesn't change anything, we can always reintroduce it later if we want to change it without changing the code that uses the settings. TLDR if we remove it, the same variables are loaded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification, that makes sense now. Since env_prefix already ensures user maps to ORACLE_USER, the validation alias wasn’t actually doing anything there. I had assumed it was required for the mapping, but if the prefix already guarantees the correct variable name, then removing it is fine and keeps things clean.
I did keep the validation alias for a few fields like the pool min/max/increment values because those use different environment variable names, so in those cases the alias is still needed. Good point as well about being able to reintroduce aliases later if we ever need backward compatible names.
| ) | ||
|
|
||
| # Build settings from env if we need to manage our own pool | ||
| self._settings = settings or OracleSettings(env_file_path=env_file_path, env_file_encoding=env_file_encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer that this is created first and then used to create the connection_pool which is then always present, or are there benefits to late initialization of the connection pool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm my understanding of your suggestion:
“This is created first” → you want the OracleSettings object to be created in the store, not inside the collection.
“Then used to create the connection_pool” → the store would also manage the pool creation (either eagerly or via centralized async logic), so that collections never handle settings or pool creation themselves.
Collections would then simply consume the store’s connection_pool, without reading env vars or creating _settings.
Is this the approach you were envisioning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, Collection's should be usable without a Store. So what we do in most connectors is:
- check if there is something that we need before calling
super.__init__() - create those (often involving creating settings and with those settings, create a client or connection pool)
- pass the newly created connection_pool and the
managed_clientappropriately set to super
remember that super.init in this case is all about pydantic, so it's about filling out the defined and validated fields.
The advantage of this approach is that you can actually set connection_pool: oracledb.AsyncConnectionPool | None = None to connection_pool: oracledb.AsyncConnectionPool simplying the logic later on.
I do recognize that connection pools are always somewhat special so if this makes more sense for oracle connection pools then by al means create them on first run instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, with your guidance, I’m going to switch to eager initialization.
That means:
Before calling super().init, I will:
- Build the connector settings.
- Create the Oracle AsyncConnectionPool eagerly.
- Determine whether the collection should manage the client (managed_client=True if the pool was created internally).
Then I will call super().init(...) with the final validated fields, including:
- connection_pool=
- managed_client=<True/False>
Because initialization is now eager, there’s no need for lazy initialization inside aenter.
aenter will simply return self.
In aexit, I’ll close the pool only if managed_client=True.
| env_file_path: str | None = None, | ||
| env_file_encoding: str | None = None, | ||
| settings: OracleSettings | None = None, | ||
| pool_args: dict[str, Any] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these aren't documented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add to docstring
| VectorSearchExecutionException, | ||
| VectorStoreOperationException | ||
| ) | ||
| from semantic_kernel.exceptions import MemoryConnectorConnectionException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you have the pre-commit setup, these should be merged, this is also checked here as a GH action, so have a look at the DEV_SETUP.md in the python root!
…ync support
Motivation and Context
This change is required to enable Semantic Kernel users to store and retrieve embeddings using Oracle databases. Currently, Semantic Kernel supports vector storage for several backends, but Oracle was missing. This connector solves that gap by providing full async support, native VECTOR type handling, and vector index management.
Description
This PR introduces a new Oracle connector for Semantic Kernel with the following features:
Asynchronous upsert, get, delete and search operations for memory records.
Native Oracle VECTOR type support for storing embeddings efficiently.
Support for HNSW and IVFFLAT vector indexes for similarity search.
Integration with Semantic Kernel collections, enabling semantic search and memory operations.
Comprehensive unit tests to ensure correctness and stability.
The connector is designed to work seamlessly with existing Semantic Kernel memory abstractions and follows the same async patterns as other vector stores.
Integration tests have also been implemented and verified locally; however, they are not included in this PR because the current CI environment setup for Oracle Database support is unknown.
Once guidance is provided on Oracle DB availability in the CI pipeline, integration tests can be enabled and added in a follow-up PR.
Contribution Checklist