Add Concurrency Sample (#33637)

bambriz · simorenoh · web-flow · commit c13ec181cd6c · 2024-01-19T14:28:13.000-08:00
* create concurrency sample

This adds a new sample file to showcase using concurrency to improve performance.

* update changelog

* update Changelog and Sample Comments

Removed changelog entry as this is just a sample file. I also added a note to remove confusion between the use technical use of the word batch and the feature of batch operations in cosmos db.

* update readme

Added an entry for the sample in the README. Also included a warning that the sample will consume a lot of RUs

* Update sdk/cosmos/azure-cosmos/README.md

Co-authored-by: Simon Moreno &lt;30335873+simorenoh@users.noreply.github.com&gt;

---------

Co-authored-by: Simon Moreno &lt;30335873+simorenoh@users.noreply.github.com&gt;
diff --git a/sdk/cosmos/azure-cosmos/README.md b/sdk/cosmos/azure-cosmos/README.md
@@ -178,6 +178,13 @@ Streamable queries like `SELECT * FROM WHERE` *do* support continuation tokens.
 
 Typically, you can use [Azure Portal](https://portal.azure.com/), [Azure Cosmos DB Resource Provider REST API](https://docs.microsoft.com/rest/api/cosmos-db-resource-provider), [Azure CLI](https://docs.microsoft.com/cli/azure/azure-cli-reference-for-cosmos-db) or [PowerShell](https://docs.microsoft.com/azure/cosmos-db/manage-with-powershell) for the control plane unsupported limitations.
 
+### Using The Async Client as a Workaround to Bulk
+While the SDK supports transactional batch, support for bulk requests is not yet implemented in the Python SDK. You can use the async client along with this [concurrency sample][concurrency_sample] we have developed as a reference for a possible workaround. 
+>[WARNING]
+> Using the asynchronous client for concurrent operations like shown in this sample will consume a lot of RUs very fast. We **strongly recommend** testing this out against the cosmos emulator first to verify your code works well and avoid incurring charges.
+
+
+
 ## Boolean Data Type
 
 While the Python language [uses](https://docs.python.org/3/library/stdtypes.html?highlight=boolean#truth-value-testing) "True" and "False" for boolean types, Cosmos DB [accepts](https://docs.microsoft.com/azure/cosmos-db/sql-query-is-bool) "true" and "false" only. In other words, the Python language uses Boolean values with the first uppercase letter and all other lowercase letters, while Cosmos DB and its SQL language use only lowercase letters for those same Boolean values. How to deal with this challenge?
@@ -757,6 +764,7 @@ For more extensive documentation on the Cosmos DB service, see the [Azure Cosmos
 [telemetry_sample]: https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/cosmos/azure-cosmos/samples/tracing_open_telemetry.py
 [timeouts_document]: https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/cosmos/azure-cosmos/docs/TimeoutAndRetriesConfig.md
 [cosmos_transactional_batch]: https://learn.microsoft.com/azure/cosmos-db/nosql/transactional-batch
+[cosmos_concurrency_sample]: https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/cosmos/azure-cosmos/samples/concurrency_sample.py
 
 ## Contributing
 
diff --git a/sdk/cosmos/azure-cosmos/samples/concurrency_sample.py b/sdk/cosmos/azure-cosmos/samples/concurrency_sample.py
@@ -0,0 +1,108 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License. See LICENSE.txt in the project root for
+# license information.
+# -------------------------------------------------------------------------
+# These examples are ingested by the documentation system, and are
+# displayed in the SDK reference documentation. When editing these
+# example snippets, take into consideration how this might affect
+# the readability and usability of the reference documentation.
+
+import os
+from azure.cosmos import PartitionKey, ThroughputProperties
+from azure.cosmos.aio import CosmosClient
+import asyncio
+import time
+
+# Specify information to connect to the client.
+CLEAR_DATABASE = True
+CONN_STR = os.environ['CONN_STR']
+# Specify information for Database and container.
+DB_ID = "Cosmos_Concurrency_DB"
+CONT_ID = "Cosmos_Concurrency_Cont"
+# specify partition key for the container
+pk = PartitionKey(path="/id")
+
+# Batch the creation of items for better optimization on performance.
+# Note: Error handling should be in the method being batched. As you will get
+# an error for each failed Cosmos DB Operation.
+# Note: While the Word `Batch` here is used to describe the subsets of data being created, it is not referring
+# to batch operations such as `Transactional Batching` which is a feature of Cosmos DB.
+async def create_all_the_items(prefix, c, i):
+    await asyncio.wait(
+        [asyncio.create_task(c.create_item({"id": prefix + str(j)})) for j in range(100)]
+    )
+    print(f"Batch {i} done!")
+
+# The following demonstrates the performance difference between using sequential item creation,
+# sequential item creation in batches, and concurrent item creation in batches. This is to show best practice
+# in using Cosmos DB for performance.
+# It’s important to note that batching a bunch of operations can affect throughput/RUs.
+# To avoid using resources, it’s recommended to test things on the emulator of Cosmos DB first.
+# The performance improvement shown on the emulator is relative to what you will see on a live account
+async def main():
+    try:
+        async with CosmosClient.from_connection_string(CONN_STR) as client:
+            # For emulator: default Throughput needs to be increased
+            # throughput_properties = ThroughputProperties(auto_scale_max_throughput=5000)
+            # db = await client.create_database_if_not_exists(id=DB_ID, offer_throughput=throughput_properties)
+            db = await client.create_database_if_not_exists(id=DB_ID)
+            container = await db.create_container_if_not_exists(CONT_ID, partition_key=pk)
+
+            # A: Sequential without batching
+            timer = time.time()
+            print("Starting Sequential Item Creation.")
+            for i in range(20):
+                for j in range(100):
+                    await container.create_item({"id": f"{i}-sequential-{j}"})
+                print(f"{(i + 1) * 100} items created!")
+            sequential_item_time = time.time() - timer
+            print("Time taken: " + str(sequential_item_time))
+
+
+            # B: Sequential batches
+            # Batching operations can improve performance by dealing with multiple operations at a time.
+            timer = time.time()
+            print("Starting Sequential Batched Item Creation.")
+            for i in range(20):
+                await create_all_the_items(f"{i}-sequential-Batch-", container, i)
+            sequential_batch_time = time.time() - timer
+            print("Time taken: " + str(sequential_batch_time))
+
+            # C: Concurrent batches
+            # By using asyncio with batching, we can create multiple batches of items concurrently, which means that
+            # while one connection is waiting for IO (like waiting for data to arrive),
+            # Python can switch context to another connection and make progress there.
+            # This can lead to better utilization of system resources and can give the appearance of parallelism,
+            # as multiple connections are making progress seemingly at the same time
+            timer = time.time()
+            print("Starting Concurrent Batched Item Creation.")
+            await asyncio.wait(
+                [asyncio.create_task(create_all_the_items(f"{i}-concurrent-Batch", container, i)) for i in range(20)]
+            )
+            concurrent_batch_time = time.time() - timer
+            print("Time taken: " + str(concurrent_batch_time))
+
+            # Calculate performance improvement on time metrics.
+            sequential_per = round((sequential_item_time - sequential_batch_time / sequential_item_time) * 100, 2)
+            print(f"Sequential Batching is {sequential_per}% faster than Sequential Item Creation")
+            concurrent_per = round((sequential_item_time - concurrent_batch_time / sequential_item_time) * 100, 2)
+            print(f"Concurrent Batching is {concurrent_per}% faster than Sequential Item Creation")
+
+            item_list = [i async for i in container.read_all_items()]
+            print(f"End of the test. Read {len(item_list)} items.")
+
+    finally:
+        if CLEAR_DATABASE:
+            await clear_database()
+
+
+async def clear_database():
+    async with CosmosClient.from_connection_string(CONN_STR) as client:
+        await asyncio.create_task(client.delete_database(DB_ID))
+    print(f"Deleted {DB_ID} database.")
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
+