edit pass: three-spark-connector-articles

paulth1 · paulth1 · commit 3c0e51e3a408 · 2024-05-20T11:40:14.000-07:00
diff --git a/articles/cosmos-db/nosql/how-to-spark-service-principal.md b/articles/cosmos-db/nosql/how-to-spark-service-principal.md
@@ -14,7 +14,7 @@ zone_pivot_groups: programming-languages-spark-all-minus-sql-r-csharp
 
 # Use a service principal with the Spark 3 connector for Azure Cosmos DB for NoSQL
 
-In this article, you learn how to create a Microsoft Entra application and service principal that can be used with the role-based access control. You can then use this service principal to connect to an Azure Cosmos DB for NoSQL account from Spark 3.
+In this article, you learn how to create a Microsoft Entra application and service principal that can be used with role-based access control. You can then use this service principal to connect to an Azure Cosmos DB for NoSQL account from Spark 3.
 
 ## Prerequisites
 
@@ -117,14 +117,14 @@ Now that you've created a Microsoft Entra application and service principal, cre
     | --- | --- |
     | Runtime version | `13.3 LTS (Scala 2.12, Spark 3.4.1)` |
 
-1. Use the workspace interface to search for **Maven** packages from **Maven Central** with a **Group ID** of `com.azure.cosmos.spark`. Install the package specific for Spark 3.4 with an **Artifact ID** prefixed with `azure-cosmos-spark_3-4` to the cluster.
+1. Use the workspace interface to search for **Maven** packages from **Maven Central** with a **Group ID** of `com.azure.cosmos.spark`. Install the package specifically for Spark 3.4 with an **Artifact ID** prefixed with `azure-cosmos-spark_3-4` to the cluster.
 
 1. Finally, create a new **notebook**.
 
     > [!TIP]
-    > By default, the notebook will be attached to the recently created cluster.
+    > By default, the notebook is attached to the recently created cluster.
 
-1. Within the notebook, set Azure Cosmos DB Spark Connector configuration settings for the NoSQL account endpoint, database name, and container name. Use the **Subscription ID**, **Resource Group**, **Application (client) ID**, **Directory (tenant) ID**, and **Client Secret** values recorded earlier in this article.
+1. Within the notebook, set Azure Cosmos DB Spark connector configuration settings for the NoSQL account endpoint, database name, and container name. Use the **Subscription ID**, **Resource Group**, **Application (client) ID**, **Directory (tenant) ID**, and **Client Secret** values recorded earlier in this article.
 
     ::: zone pivot="programming-language-python"
 
@@ -297,7 +297,7 @@ Now that you've created a Microsoft Entra application and service principal, cre
     ::: zone-end
 
     > [!TIP]
-    > In this quickstart example, credentials are assigned to variables in clear text. For security, we recommend that you use secrets. For more information on configuring secrets, see [Add secrets to your Spark configuration](/azure/databricks/security/secrets/secrets#read-a-secret).
+    > In this quickstart example, credentials are assigned to variables in clear text. For security, we recommend that you use secrets. For more information on how to configure secrets, see [Add secrets to your Spark configuration](/azure/databricks/security/secrets/secrets#read-a-secret).
 
 ## Related content
 
diff --git a/articles/cosmos-db/nosql/throughput-control-spark.md b/articles/cosmos-db/nosql/throughput-control-spark.md
@@ -1,6 +1,6 @@
 ---
 title: 'Azure Cosmos DB Spark connector: Throughput control'
-description: In this article, you learn how to control throughput for bulk data movements in the Azure Cosmos DB Spark connector.
+description: Learn how you can control throughput for bulk data movements in the Azure Cosmos DB Spark connector.
 author: TheovanKraay
 ms.service: cosmos-db
 ms.subservice: nosql
@@ -15,26 +15,26 @@ ms.author: thvankra
 
 The [Spark connector](quickstart-spark.md) allows you to communicate with Azure Cosmos DB by using [Apache Spark](https://spark.apache.org/). This article describes how the throughput control feature works. Check out our [Spark samples in GitHub](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples) to get started using throughput control.
 
-This article documents the use of global throughput control groups in the Azure Cosmos DB Spark connector, but the functionality is also available in the [Java SDK](./sdk-java-v4.md). In the SDK, you can use both global and local throughput control groups to limit the RU consumption in the context of a single client connection instance. For example, you can apply this to different operations within a single microservice, or maybe to a single data loading program. For more information, see how to [use throughput control](quickstart-java.md) in the Java SDK.
+This article documents the use of global throughput control groups in the Azure Cosmos DB Spark connector, but the functionality is also available in the [Java SDK](./sdk-java-v4.md). In the SDK, you can use global and local throughput control groups to limit the request unit (RU) consumption in the context of a single client connection instance. For example, you can apply this approach to different operations within a single microservice, or maybe to a single data loading program. For more information, see how to [use throughput control](quickstart-java.md) in the Java SDK.
 
 > [!WARNING]
-> Throughput control isn't supported for gateway mode. Currently, for [serverless Azure Cosmos DB accounts](../serverless.md), attempting to use `targetThroughputThreshold` to define a percentage results in failure. You can only provide an absolute value for target throughput/RU by using `spark.cosmos.throughputControl.targetThroughput`.  
+> Throughput control isn't supported for gateway mode. Currently, for [serverless Azure Cosmos DB accounts](../serverless.md), attempting to use `targetThroughputThreshold` to define a percentage results in failure. You can only provide an absolute value for target throughput/RU by using `spark.cosmos.throughputControl.targetThroughput`.
 
 ## Why is throughput control important?
 
- Throughput control helps to isolate the performance needs of applications that run against a container by limiting the amount of [request units](../request-units.md) (RUs) that can be consumed by a specific Spark client.
+ Throughput control helps to isolate the performance needs of applications that run against a container. Throughput control limits the amount of [RUs](../request-units.md) that a specific Spark client can consume.
 
 Several advanced scenarios benefit from client-side throughput control:
 
 - **Different operations and tasks have different priorities:** There can be a need to prevent normal transactions from being throttled because of data ingestion or copy activities. Some operations or tasks aren't sensitive to latency and are more tolerant to being throttled than others.
-- **Provide fairness/isolation to different users or tenants:** An application will usually have many users. Some users might send too many requests, which consume all available throughput and cause others to get throttled.
+- **Provide fairness/isolation to different users or tenants:** An application usually has many users. Some users might send too many requests, which consume all available throughput and cause others to get throttled.
 - **Load balancing of throughput between different Azure Cosmos DB clients:** In some use cases, it's important to make sure all the clients get a fair (equal) share of the throughput.
 
 Throughput control enables the capability for more granular-level RU rate limiting, as needed.
 
 ## How does throughput control work?
 
-Throughput control for the Spark connector is configured by first creating a container that defines throughput control metadata, with a partition key of `groupId`, and `ttl` enabled. Here, we create this container by using Spark SQL and call it `ThroughputControl`:
+To configure throughput control for the Spark connector, you first create a container that defines throughput control metadata. The partition key is `groupId` and `ttl` is enabled. Here, you create this container by using Spark SQL and call it `ThroughputControl`:
 
 ```sql
     %sql
@@ -47,9 +47,9 @@ Throughput control for the Spark connector is configured by first creating a con
 The preceding example creates a container with [autoscale](../provision-throughput-autoscale.md). If you prefer standard provisioning, you can replace `autoScaleMaxThroughput` with `manualThroughput`.
 
 > [!IMPORTANT]
-> The partition key must be defined as `/groupId`, and `ttl` must be enabled for the throughput control feature to work.
+> The partition key must be defined as `/groupId` and `ttl` must be enabled for the throughput control feature to work.
 
-Within the Spark configuration of a specific application, we can then specify parameters for our workload. The following example sets throughput control as `enabled` and defines a throughput control group `name` parameter and a `targetThroughputThreshold` parameter. We also define the `database` and `container` parameters in which the throughput control group is maintained:  
+Within the Spark configuration of a specific application, you can then specify parameters for the workload. The following example sets throughput control as `enabled`. The example defines a throughput control group `name` parameter and a `targetThroughputThreshold` parameter. You also define the `database` and `container` parameters in which the throughput control group is maintained:
 
 ```scala
     "spark.cosmos.throughputControl.enabled" -> "true",
@@ -59,7 +59,7 @@ Within the Spark configuration of a specific application, we can then specify pa
     "spark.cosmos.throughputControl.globalControl.container" -> "ThroughputControl"
 ```
 
-In the preceding example, the `targetThroughputThreshold` parameter is defined as **0.95**. Rate limiting occurs (and requests are retried) when clients consume more than 95 percent (+/- 5-10 percent) of the throughput that's allocated to the container. This configuration is stored as a document in the throughput container, which looks like this example:
+In the preceding example, the `targetThroughputThreshold` parameter is defined as **0.95**. Rate limiting occurs (and requests are retried) when clients consume more than 95 percent (+/- 5-10 percent) of the throughput allocated to the container. This configuration is stored as a document in the throughput container, which looks like this example:
 
 ```json
     {
@@ -76,16 +76,16 @@ In the preceding example, the `targetThroughputThreshold` parameter is defined a
     }
 ```
 
-Throughput control doesn't do RU pre-calculation of each operation. Instead, it tracks the RU usages *after* the operation based on the response header. As such, throughput control is based on an approximation and *doesn't guarantee* that amount of throughput is available for the group at any certain time.
+Throughput control doesn't do RU precalculation of each operation. Instead, it tracks the RU usages *after* the operation based on the response header. As such, throughput control is based on an approximation and *doesn't guarantee* that amount of throughput is available for the group at any certain time.
 
-For this reason, if the configured RU is so low that a single operation can use it all, throughput control can't avoid the RU exceeding the configured limit. Therefore, throughput control works best when the configured limit is higher than any single operation that can be executed by a client in the specific control group.
+For this reason, if the configured RU is so low that a single operation can use it all, throughput control can't avoid the RU exceeding the configured limit. Throughput control works best when the configured limit is higher than any single operation that a client in the specific control group can execute.
 
-When you read via query or change feed, you should configure the page size in `spark.cosmos.read.maxItemCount` (default 1000) to be a modest amount so that client throughput control can be recalculated with higher frequency, and therefore reflected more accurately at any specific time. However, when you use throughput control for a write job using bulk, the number of documents executed in a single request are automatically tuned based on the throttling rate to allow the throughput control to kick in as early as possible.
+When you read via query or change feed, you should configure the page size in `spark.cosmos.read.maxItemCount` (default 1000) to be a modest amount. In this way, the client throughput control can be recalculated with higher frequency and reflected more accurately at any specific time. When you use throughput control for a write job using bulk, the number of documents executed in a single request is automatically tuned based on the throttling rate to allow the throughput control to begin as early as possible.
 
 > [!WARNING]
-> The `targetThroughputThreshold` parameter is *immutable*. If you change the target throughput threshold value, a new throughput control group is created. (As long as you use Version 4.10.0 or later, it can have the same name.) You need to restart all Spark jobs that are using the group if you want to ensure that they all consume the new threshold immediately. Otherwise, they pick up the new threshold after the next restart.
+> The `targetThroughputThreshold` parameter is *immutable*. If you change the target throughput threshold value, a new throughput control group is created. (If you use version 4.10.0 or later, it can have the same name.) You need to restart all Spark jobs that are using the group if you want to ensure that they all consume the new threshold immediately. Otherwise, they pick up the new threshold after the next restart.
 
-For each Spark client that uses the throughput control group, a record is created in the `ThroughputControl` container, with a ttl of a few seconds. As a result, the documents vanish pretty quickly if a Spark client isn't actively running anymore. Here's an example:
+For each Spark client that uses the throughput control group, a record is created in the `ThroughputControl` container, with a `ttl` of a few seconds. As a result, the documents vanish quickly if a Spark client isn't actively running anymore. Here's an example:
 
 ```json
     {
@@ -108,6 +108,6 @@ In each client record, the `loadFactor` attribute represents the load on the spe
 
 ## Related content
 
-* [Spark samples in GitHub](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples)
-* [Manage data with Azure Cosmos DB Spark 3 OLTP connector for API for NoSQL](quickstart-spark.md).
-* Learn more about [Apache Spark](https://spark.apache.org/)
+* See [Spark samples in GitHub](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples).
+* Learn how to [manage data with Azure Cosmos DB Spark 3 OLTP connector for API for NoSQL](quickstart-spark.md).
+* Learn more about [Apache Spark](https://spark.apache.org/).
diff --git a/articles/cosmos-db/nosql/tutorial-spark-connector.md b/articles/cosmos-db/nosql/tutorial-spark-connector.md
@@ -17,7 +17,7 @@ zone_pivot_groups: programming-languages-spark-all-minus-sql-r-csharp
 
 [!INCLUDE[NoSQL](../includes/appliesto-nosql.md)]
 
-In this tutorial, you use the Azure Cosmos DB Spark connector to read or write data from an Azure Cosmos DB for NoSQL account. This tutorial uses Azure Databricks and a Jupyter notebook to illustrate how to integrate with the API for NoSQL from Spark. This tutorial focuses on Python and Scala even though you can use any language or interface supported by Spark.
+In this tutorial, you use the Azure Cosmos DB Spark connector to read or write data from an Azure Cosmos DB for NoSQL account. This tutorial uses Azure Databricks and a Jupyter notebook to illustrate how to integrate with the API for NoSQL from Spark. This tutorial focuses on Python and Scala, although you can use any language or interface supported by Spark.
 
 In this tutorial, you learn how to:
 
@@ -47,12 +47,12 @@ Use your existing Azure Databricks workspace to create a compute cluster ready t
     | --- | --- |
     | Runtime version | 13.3 LTS (Scala 2.12, Spark 3.4.1) |
 
-1. Use the workspace interface to search for **Maven** packages from **Maven Central** with a **Group ID** of `com.azure.cosmos.spark`. Install the package specific for Spark 3.4 with an **Artifact ID** prefixed with `azure-cosmos-spark_3-4` to the cluster.
+1. Use the workspace interface to search for **Maven** packages from **Maven Central** with a **Group ID** of `com.azure.cosmos.spark`. Install the package specifically for Spark 3.4 with an **Artifact ID** prefixed with `azure-cosmos-spark_3-4` to the cluster.
 
 1. Finally, create a new **notebook**.
 
     > [!TIP]
-    > By default, the notebook will be attached to the recently created cluster.
+    > By default, the notebook is attached to the recently created cluster.
 
 1. Within the notebook, set online transaction processing (OLTP) configuration settings for the NoSQL account endpoint, database name, and container name.
 
@@ -132,7 +132,7 @@ Use the Catalog API to manage account resources such as databases and containers
 
     ::: zone-end
 
-1. Create a new container named `products` by using `CREATE TABLE IF NOT EXISTS`. Ensure that you set the partition key path to `/category` and enable autoscale throughput with a maximum throughput of `1000` request units per second (RUs).
+1. Create a new container named `products` by using `CREATE TABLE IF NOT EXISTS`. Ensure that you set the partition key path to `/category` and enable autoscale throughput with a maximum throughput of `1000` request units (RUs) per second.
 
     ::: zone pivot="programming-language-python"
 
@@ -152,7 +152,7 @@ Use the Catalog API to manage account resources such as databases and containers
 
     ::: zone-end
 
-1. Create another container named `employees` by using a hierarchical partition key configuration with `/organization`, `/department`, and `/team` as the set of partition key paths. Follow that specific order. Also, set the throughput to a manual amount of `400` RUs.
+1. Create another container named `employees` by using a hierarchical partition key configuration. Use `/organization`, `/department`, and `/team` as the set of partition key paths. Follow that specific order. Also, set the throughput to a manual amount of `400` RUs.
 
     ::: zone pivot="programming-language-python"
 
@@ -240,7 +240,7 @@ Create a sample dataset. Then use OLTP to ingest that data to the API for NoSQL
 
 Load OLTP data into a data frame to perform common queries on the data. You can use various syntaxes to filter or query data.
 
-1. Use `spark.read` to load the OLTP data into a dataframe object. Use the same configuration you used earlier in this tutorial. Also, set `spark.cosmos.read.inferSchema.enabled` to `true` to allow the Spark connector to infer the schema by sampling existing items.
+1. Use `spark.read` to load the OLTP data into a data-frame object. Use the same configuration you used earlier in this tutorial. Also, set `spark.cosmos.read.inferSchema.enabled` to `true` to allow the Spark connector to infer the schema by sampling existing items.
 
     ::: zone pivot="programming-language-python"
 
@@ -266,7 +266,7 @@ Load OLTP data into a data frame to perform common queries on the data. You can
 
     ::: zone-end
 
-1. Render the schema of the data loaded in the dataframe by using `printSchema`.
+1. Render the schema of the data loaded in the data frame by using `printSchema`.
 
     ::: zone pivot="programming-language-python"
 
@@ -380,7 +380,7 @@ When you work with API for NoSQL data in Spark, you can perform partial updates
 
 1. To perform a partial update of an item:
 
-    1. Copy the existing `config` configuration variable and modify the properties in the new copy. Specifically, configure the write strategy to `ItemPatch`, disable bulk support, set the columns and mapped operations, and finally set the default operation type to `Set`.
+    1. Copy the existing `config` configuration variable and modify the properties in the new copy. Specifically, configure the write strategy to `ItemPatch`. Then disable bulk support. Set the columns and mapped operations. Finally, set the default operation type to `Set`.
 
         ::: zone pivot="programming-language-python"