Skip to content

Conversation

@monita1208
Copy link

@monita1208 monita1208 commented Oct 7, 2025

…ync support

Motivation and Context

This change is required to enable Semantic Kernel users to store and retrieve embeddings using Oracle databases. Currently, Semantic Kernel supports vector storage for several backends, but Oracle was missing. This connector solves that gap by providing full async support, native VECTOR type handling, and vector index management.

Description

This PR introduces a new Oracle connector for Semantic Kernel with the following features:

  • Asynchronous upsert, get, delete and search operations for memory records.

  • Native Oracle VECTOR type support for storing embeddings efficiently.

  • Support for HNSW and IVFFLAT vector indexes for similarity search.

  • Integration with Semantic Kernel collections, enabling semantic search and memory operations.

  • Comprehensive unit tests to ensure correctness and stability.

The connector is designed to work seamlessly with existing Semantic Kernel memory abstractions and follows the same async patterns as other vector stores.

Integration tests have also been implemented and verified locally; however, they are not included in this PR because the current CI environment setup for Oracle Database support is unknown.
Once guidance is provided on Oracle DB availability in the CI pipeline, integration tests can be enabled and added in a follow-up PR.

Contribution Checklist

@monita1208 monita1208 requested a review from a team as a code owner October 7, 2025 19:17
@monita1208
Copy link
Author

@monita1208 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="oracle"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@monita1208 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="oracle"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@monita1208 monita1208 closed this Oct 7, 2025
@monita1208 monita1208 reopened this Oct 7, 2025
@alexkeh
Copy link

alexkeh commented Oct 7, 2025

@microsoft-github-policy-service agree [company="Oracle"]

@monita1208
Copy link
Author

monita1208 commented Oct 7, 2025

@microsoft-github-policy-service agree company="Oracle"

@markwallace-microsoft markwallace-microsoft added the msft.ext.vectordata Related to Microsoft.Extensions.VectorData label Oct 7, 2025
@@ -0,0 +1,1349 @@
# Copyright (c) 2025, Oracle Corporation. All rights reserved.

from __future__ import annotations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be necessary

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it.

VectorSearchExecutionException,
VectorStoreOperationException
)
from semantic_kernel.exceptions.memory_connector_exceptions import MemoryConnectorConnectionException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from semantic_kernel.exceptions.memory_connector_exceptions import MemoryConnectorConnectionException
from semantic_kernel.exceptions import MemoryConnectorConnectionException

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated code.

async connection pools for Oracle.
"""

user: str | None = Field(default=None, validation_alias=ORACLE_USER_ENV_VAR)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the intended use of a KernelBaseSettings object, make sure in this class to set:

env_prefix: ClassVar[str] = "ORACLE_"

and then each of the parameters will be prefixed by that + the name of the param capatilized. And then you can remove all the validation_alias's

So only, min and max, should then become pool_min and pool_max.

And it is important to document in the docstring which parameters there are and what their respective env variable name is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated code!


connection_pool: oracledb.AsyncConnectionPool | None = None

model_config = SettingsConfigDict(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also likely not needed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it!

wallet_location: str | None = Field(default=None, validation_alias=ORACLE_WALLET_LOCATION_ENV_VAR)
wallet_password: SecretStr | None = Field(default=None, validation_alias=ORACLE_WALLET_PASSWORD_ENV_VAR)

connection_pool: oracledb.AsyncConnectionPool | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a PrivateAttr or changed to _connection_pool

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated code!

def _unwrap_secret(self, value):
if value is None:
return None
return value.get_secret_value() if hasattr(value, "get_secret_value") else str(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really needed, since you only use this for parameters that you know are secrets...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated code!

# Create pool with extra user-supplied kwargs
self.connection_pool = oracledb.create_pool_async(
user=self.user,
password=self._unwrap_secret(self.password),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
password=self._unwrap_secret(self.password),
password=self.password.get_secret_value() if self.password else None

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated code!

connection_pool: oracledb.AsyncConnectionPool | None = None
db_schema: str | None = None
pool_args: dict[str, Any] | None = None
supported_key_types: ClassVar[set[str] | None] = {"str", "int", "UUID"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is UUID a separate type (in python)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, UUID is a separate type in Python, not just a string.

query, bind, columns = await self._inner_search_vector(options, values, vector, **kwargs)

# If total count is requested, fetch all rows to count.
if options.include_total_count:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is set, but the database doesn't support a parameter, then we shouldnt pull everything in, just ignore the setting

Copy link
Author

@monita1208 monita1208 Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having considered options.include_total_count for Oracle, here are some possible approaches,

  1. Raise a warning or log message (e.g., RuntimeWarning or logger.warning) indicating that include_total_count is not supported for Oracle and will be ignored.
  2. Raise a NotSupportedError if we want stricter enforcement and to prevent misuse in performance-sensitive scenarios.
  3. Fetch all rows and log a warning a balanced approach,
  • Users still get the total count.
  • A clear log message or warning highlights that fetching all rows may be inefficient for large datasets.
  • The behavior will be properly documented to ensure developers are aware of the performance implications.

I recommend option 3 as it maintains correctness while providing transparency and guidance but like to hear your thoughts or if you prefer a stricter approach.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eavanvalkenburg Need your input here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go with 1, I think that's what we do in most other cases, just looked at some, and in many we actually don't even log anything when it's not supported, I will update that and log something in those cases. Thanks for looking at this

@markwallace-microsoft markwallace-microsoft added the python Pull requests for the Python Semantic Kernel label Nov 4, 2025
@github-actions github-actions bot changed the title feat(oracle): add new Oracle connector for Semantic Kernel Python: feat(oracle): add new Oracle connector for Semantic Kernel Nov 4, 2025

# Third-party Libraries
import numpy as np
import oracledb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these present in the pyproject under a appropriate extra?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this dependency isn't included under an appropriate extra in the pyproject.toml yet. It is missing at the moment and I’ll add it so the extras section is complete and consistent.

"""

env_prefix: ClassVar[str] = "ORACLE_"
user: str | None = Field(default=None, validation_alias=USER_ENV_VAR)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need all these aliases? as far as I can see, the used variable name for this line would be: ORACLE_USER and that is also the value of the USER_ENV_VAR, so this doesn't add anything

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that USER_ENV_VAR currently has the value "ORACLE_USER" and with validation_alias=USER_ENV_VAR the model just maps ORACLE_USER → user. Technically this doesn't add functionality if we don't need a difference between the field name and the environment variable name.

The only reason to keep the alias would be if we intentionally want Python friendly field names (user) while still reading uppercase environment variables (ORACLE_USER). If we don’t need that separation, we can remove the alias and either:

  • rename the field to match the env var, or
  • rely on the field name directly without aliasing.

So yes, if the field name and the env var name always match, the alias doesn't provide additional value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're misunderstanding, because of the env_prefix the field user will already map to env variable ORACLE_USER so that is already the name of the field when it loads, so the validation_alias doesn't change anything, we can always reintroduce it later if we want to change it without changing the code that uses the settings. TLDR if we remove it, the same variables are loaded

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, that makes sense now. Since env_prefix already ensures user maps to ORACLE_USER, the validation alias wasn’t actually doing anything there. I had assumed it was required for the mapping, but if the prefix already guarantees the correct variable name, then removing it is fine and keeps things clean.
I did keep the validation alias for a few fields like the pool min/max/increment values because those use different environment variable names, so in those cases the alias is still needed. Good point as well about being able to reintroduce aliases later if we ever need backward compatible names.

)

# Build settings from env if we need to manage our own pool
self._settings = settings or OracleSettings(env_file_path=env_file_path, env_file_encoding=env_file_encoding)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer that this is created first and then used to create the connection_pool which is then always present, or are there benefits to late initialization of the connection pool?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm my understanding of your suggestion:

“This is created first” → you want the OracleSettings object to be created in the store, not inside the collection.

“Then used to create the connection_pool” → the store would also manage the pool creation (either eagerly or via centralized async logic), so that collections never handle settings or pool creation themselves.

Collections would then simply consume the store’s connection_pool, without reading env vars or creating _settings.

Is this the approach you were envisioning?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, Collection's should be usable without a Store. So what we do in most connectors is:

  • check if there is something that we need before calling super.__init__()
  • create those (often involving creating settings and with those settings, create a client or connection pool)
  • pass the newly created connection_pool and the managed_client appropriately set to super

remember that super.init in this case is all about pydantic, so it's about filling out the defined and validated fields.
The advantage of this approach is that you can actually set connection_pool: oracledb.AsyncConnectionPool | None = None to connection_pool: oracledb.AsyncConnectionPool simplying the logic later on.

I do recognize that connection pools are always somewhat special so if this makes more sense for oracle connection pools then by al means create them on first run instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, with your guidance, I’m going to switch to eager initialization.
That means:
Before calling super().init, I will:

  • Build the connector settings.
  • Create the Oracle AsyncConnectionPool eagerly.
  • Determine whether the collection should manage the client (managed_client=True if the pool was created internally).

Then I will call super().init(...) with the final validated fields, including:

  • connection_pool=
  • managed_client=<True/False>

Because initialization is now eager, there’s no need for lazy initialization inside aenter.
aenter will simply return self.

In aexit, I’ll close the pool only if managed_client=True.

env_file_path: str | None = None,
env_file_encoding: str | None = None,
settings: OracleSettings | None = None,
pool_args: dict[str, Any] | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these aren't documented

Copy link
Author

@monita1208 monita1208 Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add to docstring

VectorSearchExecutionException,
VectorStoreOperationException
)
from semantic_kernel.exceptions import MemoryConnectorConnectionException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you have the pre-commit setup, these should be merged, this is also checked here as a GH action, so have a look at the DEV_SETUP.md in the python root!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

msft.ext.vectordata Related to Microsoft.Extensions.VectorData python Pull requests for the Python Semantic Kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants