Skip to content

Adding Location Cache #2801

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open

Adding Location Cache #2801

wants to merge 21 commits into from

Conversation

gsa9989
Copy link
Contributor

@gsa9989 gsa9989 commented Jul 17, 2025

PR Summary:

  • Location cache is responsible for organizing the database account data gathered by the global endpoint manager. It holds the client’s preferred locations as well as lists of write and read regions separated by location and availability. Its purpose is to provide an available endpoint based on the client's preferred locations.
  • Added location cache functionality and tests.

Implementation details:

  • APIs include:
    • update
    • refresh_endpoints
    • refresh_stale_endpoints
    • get_endpoints_by_location
    • get_preferred_available_endpoints
    • mark_endpoint_unavailable
    • is_endpoint_unavailable
    • resolve_service_endpoint

Current limitations:

  • Currently does not support multi-write capabilities, planning to add it in a follow-up pr with global endpoint manager.

gsa9989 and others added 8 commits June 3, 2025 11:42
  ItemOptions-changes 41f2acf if_match_etag changes with tests
* main                b1d2f9f [origin/main] Merge pull request #1 from gsa9989/read-me-change
  read-me-change      653b308 read me change Please enter a commit message to explain why this merge is necessary,
  ItemOption-changes 41f2acf if_match_etag changes with tests
* main                b1d2f9f [origin/main] Merge pull request #1 from gsa9989/read-me-change
  read-me-change      653b308 merge upstream/main
…ter a commit message to explain why this merge is necessary,
@github-actions github-actions bot added the Cosmos The azure_cosmos crate label Jul 17, 2025
@gsa9989 gsa9989 changed the title Adding Global Endpoint Manager and Location Cache Adding Location Cache Aug 7, 2025
@gsa9989 gsa9989 marked this pull request as ready for review August 7, 2025 18:03
@Copilot Copilot AI review requested due to automatic review settings August 7, 2025 18:03
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds location cache functionality to the Azure Cosmos SDK, providing a system to organize database account data and manage endpoint availability. The location cache tracks client preferred locations, available read/write regions, and endpoint availability to provide optimal endpoint selection for database operations.

Key Changes:

  • Added comprehensive location cache implementation with endpoint management and availability tracking
  • Implemented API methods for updating, refreshing, and resolving service endpoints
  • Added extensive test coverage for all location cache functionality

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
sdk/cosmos/azure_data_cosmos/src/location_cache.rs Complete implementation of LocationCache with availability tracking, endpoint management, and comprehensive test suite
sdk/cosmos/azure_data_cosmos/src/lib.rs Added module declarations for location_cache and global_endpoint_manager
sdk/cosmos/azure_data_cosmos/src/global_endpoint_manager.rs Added empty GlobalEndpointManager struct placeholder

Copy link
Member

@analogrelay analogrelay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial stuff, but nothing significant! Looking good!

let now = std::time::SystemTime::now();

{
let mut location_unavailability_info_map =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to log a message when an endpoint is marked unavailable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in return a message to the user that the function worked?

Copy link
Member

@tvaron3 tvaron3 Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like the examples here https://docs.rs/log/latest/log/ and the message can "<endpoint> was marked unavailable for <write/read>". It could help debugging if a customer asks why there traffic switched from one region to another.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added info!( "Endpoint {} was marked as unavailable for {:?}", endpoint, operation ); without any issues other than importing tracing::info. However, if I wanted to test it I'd have to add the log crate to cargo.toml. Should I do that or is adding the message sufficient?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracing is the correct API to use (we don't actually use the log crate in the Rust SDK, we use tracing instead). I don't think we need a test for this case. We do test some of our diagnostics/telemetry, but I don't think individual log messages need to be tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit on the phrasing you had above: Our log messages generally start with lowercase letters. For example:

2025-08-12T19:40:21.225780Z DEBUG azure_data_cosmos::pipeline::signature_target: generating Cosmos auth signature signature_payload="get\ndocs\ndbs/SampleDB/colls/SampleContainer/docs/14174164-F6C0-47FC-83FB-604C6A63408D\ntue, 12 aug 2025 19:40:21 gmt\n\n"
2025-08-12T19:40:21.225858Z DEBUG typespec_client_core::http::policies::transport: sending request 'https://localhost:8081/dbs/SampleDB/colls/SampleContainer/docs/14174164-F6C0-47FC-83FB-604C6A63408D' request=Request { url: "https://localhost:8081/dbs/SampleDB/colls/SampleContainer/docs/14174164-F6C0-47FC-83FB-604C6A63408D", method: Get, headers: {"user-agent": "*****", "x-ms-documentdb-partitionkey": "*****", "x-ms-client-request-id": "*****", "x-ms-date": "*****", "x-ms-version": "*****", "authorization": "*****"}, body: Bytes {} }
2025-08-12T19:40:21.226064Z DEBUG typespec_client_core::http::clients::reqwest: performing request GET 'https://localhost:8081/dbs/SampleDB/colls/SampleContainer/docs/14174164-F6C0-47FC-83FB-604C6A63408D' with `reqwest`
2025-08-12T19:40:21.226185Z DEBUG reqwest::connect: starting new connection: https://localhost:8081/
2025-08-12T19:40:21.226863Z DEBUG hyper_util::client::legacy::connect::http: connecting to [::1]:8081
2025-08-12T19:40:21.227041Z DEBUG hyper_util::client::legacy::connect::http: connected to [::1]:8081

Also, "was marked" is passive voice, which we try to avoid where possible (we're not perfect ;)). We try to use active statements as much as possible, so "endpoint marked unavailable" (describing what the application is doing right now) rather than "endpoint was marked unavailable" (describing what the application did sometime in the past).

So it should be endpoint {} marked as unavailable for {:?}

Copy link
Member

@analogrelay analogrelay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I think we're close! The currently build failure isn't your responsibility, it's an issue in a different crate that's also failing in main. We'll have to wait for it to be resolved before merging, but it shouldn't be an issue.

fn resolve_service_endpoint_default() {
let cache = create_test_location_cache();

let endpoint = cache.resolve_service_endpoint(5, RequestOperation::Read);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We should be able to access the read_endpoints list, so rather than picking a magic number we know is currently more than the number of locations in the test data, we can use the length itself:

Suggested change
let endpoint = cache.resolve_service_endpoint(5, RequestOperation::Read);
let endpoint = cache.resolve_service_endpoint(cache.read_endpoints.len() + 1, RequestOperation::Read);

let now = std::time::SystemTime::now();

{
let mut location_unavailability_info_map =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit on the phrasing you had above: Our log messages generally start with lowercase letters. For example:

2025-08-12T19:40:21.225780Z DEBUG azure_data_cosmos::pipeline::signature_target: generating Cosmos auth signature signature_payload="get\ndocs\ndbs/SampleDB/colls/SampleContainer/docs/14174164-F6C0-47FC-83FB-604C6A63408D\ntue, 12 aug 2025 19:40:21 gmt\n\n"
2025-08-12T19:40:21.225858Z DEBUG typespec_client_core::http::policies::transport: sending request 'https://localhost:8081/dbs/SampleDB/colls/SampleContainer/docs/14174164-F6C0-47FC-83FB-604C6A63408D' request=Request { url: "https://localhost:8081/dbs/SampleDB/colls/SampleContainer/docs/14174164-F6C0-47FC-83FB-604C6A63408D", method: Get, headers: {"user-agent": "*****", "x-ms-documentdb-partitionkey": "*****", "x-ms-client-request-id": "*****", "x-ms-date": "*****", "x-ms-version": "*****", "authorization": "*****"}, body: Bytes {} }
2025-08-12T19:40:21.226064Z DEBUG typespec_client_core::http::clients::reqwest: performing request GET 'https://localhost:8081/dbs/SampleDB/colls/SampleContainer/docs/14174164-F6C0-47FC-83FB-604C6A63408D' with `reqwest`
2025-08-12T19:40:21.226185Z DEBUG reqwest::connect: starting new connection: https://localhost:8081/
2025-08-12T19:40:21.226863Z DEBUG hyper_util::client::legacy::connect::http: connecting to [::1]:8081
2025-08-12T19:40:21.227041Z DEBUG hyper_util::client::legacy::connect::http: connected to [::1]:8081

Also, "was marked" is passive voice, which we try to avoid where possible (we're not perfect ;)). We try to use active statements as much as possible, so "endpoint marked unavailable" (describing what the application is doing right now) rather than "endpoint was marked unavailable" (describing what the application did sometime in the past).

So it should be endpoint {} marked as unavailable for {:?}


let endpoint1 = "https://location1.documents.example.com";

let before_marked_unavailable_time = SystemTime::now();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like asserting this, but there's a potential problem. There's no guarantee that SystemTime::now() will have advanced between when you call it, and when mark_endpoint_unavailable runs. The SystemTime type in Rust is not guaranteed to be monotonic (i.e. it is not guaranteed to consistently increase). On most modern OSes, it will be, but that's not a guarantee.

Also, it's possible the CPU instructions between here and when you capture SystemTime::now again inside mark_endpoint_unavailable all execute before the clock moves forward. In most cases, this will work fine, there are enough instructions and CPU cycles between each call to get the time that it's likely the clock will have moved forward, but there's no guarantee of that and slower/faster processors may disrupt the test.

All that is to say that asserting the passage of time in tests is tricky ;). Instead of capturing now here, pick a time several seconds before now to use as before_marked_unavailable_time (SystemTime::now() - Duration::from_secs(10)) and then set the last_check_time of the endpoint to that value (similar to what you're doing with the staleness check). Then, when you assert below that last_check_time is greater than before_marked_unavailable_time, you have a much higher chance of it being true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cosmos The azure_cosmos crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants