Skip to content

Commit 3c0e51e

Browse files
committed
edit pass: three-spark-connector-articles
1 parent 313376d commit 3c0e51e

File tree

3 files changed

+30
-30
lines changed

3 files changed

+30
-30
lines changed

articles/cosmos-db/nosql/how-to-spark-service-principal.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ zone_pivot_groups: programming-languages-spark-all-minus-sql-r-csharp
1414

1515
# Use a service principal with the Spark 3 connector for Azure Cosmos DB for NoSQL
1616

17-
In this article, you learn how to create a Microsoft Entra application and service principal that can be used with the role-based access control. You can then use this service principal to connect to an Azure Cosmos DB for NoSQL account from Spark 3.
17+
In this article, you learn how to create a Microsoft Entra application and service principal that can be used with role-based access control. You can then use this service principal to connect to an Azure Cosmos DB for NoSQL account from Spark 3.
1818

1919
## Prerequisites
2020

@@ -117,14 +117,14 @@ Now that you've created a Microsoft Entra application and service principal, cre
117117
| --- | --- |
118118
| Runtime version | `13.3 LTS (Scala 2.12, Spark 3.4.1)` |
119119
120-
1. Use the workspace interface to search for **Maven** packages from **Maven Central** with a **Group ID** of `com.azure.cosmos.spark`. Install the package specific for Spark 3.4 with an **Artifact ID** prefixed with `azure-cosmos-spark_3-4` to the cluster.
120+
1. Use the workspace interface to search for **Maven** packages from **Maven Central** with a **Group ID** of `com.azure.cosmos.spark`. Install the package specifically for Spark 3.4 with an **Artifact ID** prefixed with `azure-cosmos-spark_3-4` to the cluster.
121121
122122
1. Finally, create a new **notebook**.
123123
124124
> [!TIP]
125-
> By default, the notebook will be attached to the recently created cluster.
125+
> By default, the notebook is attached to the recently created cluster.
126126
127-
1. Within the notebook, set Azure Cosmos DB Spark Connector configuration settings for the NoSQL account endpoint, database name, and container name. Use the **Subscription ID**, **Resource Group**, **Application (client) ID**, **Directory (tenant) ID**, and **Client Secret** values recorded earlier in this article.
127+
1. Within the notebook, set Azure Cosmos DB Spark connector configuration settings for the NoSQL account endpoint, database name, and container name. Use the **Subscription ID**, **Resource Group**, **Application (client) ID**, **Directory (tenant) ID**, and **Client Secret** values recorded earlier in this article.
128128
129129
::: zone pivot="programming-language-python"
130130
@@ -297,7 +297,7 @@ Now that you've created a Microsoft Entra application and service principal, cre
297297
::: zone-end
298298
299299
> [!TIP]
300-
> In this quickstart example, credentials are assigned to variables in clear text. For security, we recommend that you use secrets. For more information on configuring secrets, see [Add secrets to your Spark configuration](/azure/databricks/security/secrets/secrets#read-a-secret).
300+
> In this quickstart example, credentials are assigned to variables in clear text. For security, we recommend that you use secrets. For more information on how to configure secrets, see [Add secrets to your Spark configuration](/azure/databricks/security/secrets/secrets#read-a-secret).
301301
302302
## Related content
303303

articles/cosmos-db/nosql/throughput-control-spark.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: 'Azure Cosmos DB Spark connector: Throughput control'
3-
description: In this article, you learn how to control throughput for bulk data movements in the Azure Cosmos DB Spark connector.
3+
description: Learn how you can control throughput for bulk data movements in the Azure Cosmos DB Spark connector.
44
author: TheovanKraay
55
ms.service: cosmos-db
66
ms.subservice: nosql
@@ -15,26 +15,26 @@ ms.author: thvankra
1515

1616
The [Spark connector](quickstart-spark.md) allows you to communicate with Azure Cosmos DB by using [Apache Spark](https://spark.apache.org/). This article describes how the throughput control feature works. Check out our [Spark samples in GitHub](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples) to get started using throughput control.
1717

18-
This article documents the use of global throughput control groups in the Azure Cosmos DB Spark connector, but the functionality is also available in the [Java SDK](./sdk-java-v4.md). In the SDK, you can use both global and local throughput control groups to limit the RU consumption in the context of a single client connection instance. For example, you can apply this to different operations within a single microservice, or maybe to a single data loading program. For more information, see how to [use throughput control](quickstart-java.md) in the Java SDK.
18+
This article documents the use of global throughput control groups in the Azure Cosmos DB Spark connector, but the functionality is also available in the [Java SDK](./sdk-java-v4.md). In the SDK, you can use global and local throughput control groups to limit the request unit (RU) consumption in the context of a single client connection instance. For example, you can apply this approach to different operations within a single microservice, or maybe to a single data loading program. For more information, see how to [use throughput control](quickstart-java.md) in the Java SDK.
1919

2020
> [!WARNING]
21-
> Throughput control isn't supported for gateway mode. Currently, for [serverless Azure Cosmos DB accounts](../serverless.md), attempting to use `targetThroughputThreshold` to define a percentage results in failure. You can only provide an absolute value for target throughput/RU by using `spark.cosmos.throughputControl.targetThroughput`.
21+
> Throughput control isn't supported for gateway mode. Currently, for [serverless Azure Cosmos DB accounts](../serverless.md), attempting to use `targetThroughputThreshold` to define a percentage results in failure. You can only provide an absolute value for target throughput/RU by using `spark.cosmos.throughputControl.targetThroughput`.
2222
2323
## Why is throughput control important?
2424

25-
Throughput control helps to isolate the performance needs of applications that run against a container by limiting the amount of [request units](../request-units.md) (RUs) that can be consumed by a specific Spark client.
25+
Throughput control helps to isolate the performance needs of applications that run against a container. Throughput control limits the amount of [RUs](../request-units.md) that a specific Spark client can consume.
2626

2727
Several advanced scenarios benefit from client-side throughput control:
2828

2929
- **Different operations and tasks have different priorities:** There can be a need to prevent normal transactions from being throttled because of data ingestion or copy activities. Some operations or tasks aren't sensitive to latency and are more tolerant to being throttled than others.
30-
- **Provide fairness/isolation to different users or tenants:** An application will usually have many users. Some users might send too many requests, which consume all available throughput and cause others to get throttled.
30+
- **Provide fairness/isolation to different users or tenants:** An application usually has many users. Some users might send too many requests, which consume all available throughput and cause others to get throttled.
3131
- **Load balancing of throughput between different Azure Cosmos DB clients:** In some use cases, it's important to make sure all the clients get a fair (equal) share of the throughput.
3232

3333
Throughput control enables the capability for more granular-level RU rate limiting, as needed.
3434

3535
## How does throughput control work?
3636

37-
Throughput control for the Spark connector is configured by first creating a container that defines throughput control metadata, with a partition key of `groupId`, and `ttl` enabled. Here, we create this container by using Spark SQL and call it `ThroughputControl`:
37+
To configure throughput control for the Spark connector, you first create a container that defines throughput control metadata. The partition key is `groupId` and `ttl` is enabled. Here, you create this container by using Spark SQL and call it `ThroughputControl`:
3838

3939
```sql
4040
%sql
@@ -47,9 +47,9 @@ Throughput control for the Spark connector is configured by first creating a con
4747
The preceding example creates a container with [autoscale](../provision-throughput-autoscale.md). If you prefer standard provisioning, you can replace `autoScaleMaxThroughput` with `manualThroughput`.
4848

4949
> [!IMPORTANT]
50-
> The partition key must be defined as `/groupId`, and `ttl` must be enabled for the throughput control feature to work.
50+
> The partition key must be defined as `/groupId` and `ttl` must be enabled for the throughput control feature to work.
5151
52-
Within the Spark configuration of a specific application, we can then specify parameters for our workload. The following example sets throughput control as `enabled` and defines a throughput control group `name` parameter and a `targetThroughputThreshold` parameter. We also define the `database` and `container` parameters in which the throughput control group is maintained:
52+
Within the Spark configuration of a specific application, you can then specify parameters for the workload. The following example sets throughput control as `enabled`. The example defines a throughput control group `name` parameter and a `targetThroughputThreshold` parameter. You also define the `database` and `container` parameters in which the throughput control group is maintained:
5353

5454
```scala
5555
"spark.cosmos.throughputControl.enabled" -> "true",
@@ -59,7 +59,7 @@ Within the Spark configuration of a specific application, we can then specify pa
5959
"spark.cosmos.throughputControl.globalControl.container" -> "ThroughputControl"
6060
```
6161

62-
In the preceding example, the `targetThroughputThreshold` parameter is defined as **0.95**. Rate limiting occurs (and requests are retried) when clients consume more than 95 percent (+/- 5-10 percent) of the throughput that's allocated to the container. This configuration is stored as a document in the throughput container, which looks like this example:
62+
In the preceding example, the `targetThroughputThreshold` parameter is defined as **0.95**. Rate limiting occurs (and requests are retried) when clients consume more than 95 percent (+/- 5-10 percent) of the throughput allocated to the container. This configuration is stored as a document in the throughput container, which looks like this example:
6363

6464
```json
6565
{
@@ -76,16 +76,16 @@ In the preceding example, the `targetThroughputThreshold` parameter is defined a
7676
}
7777
```
7878

79-
Throughput control doesn't do RU pre-calculation of each operation. Instead, it tracks the RU usages *after* the operation based on the response header. As such, throughput control is based on an approximation and *doesn't guarantee* that amount of throughput is available for the group at any certain time.
79+
Throughput control doesn't do RU precalculation of each operation. Instead, it tracks the RU usages *after* the operation based on the response header. As such, throughput control is based on an approximation and *doesn't guarantee* that amount of throughput is available for the group at any certain time.
8080

81-
For this reason, if the configured RU is so low that a single operation can use it all, throughput control can't avoid the RU exceeding the configured limit. Therefore, throughput control works best when the configured limit is higher than any single operation that can be executed by a client in the specific control group.
81+
For this reason, if the configured RU is so low that a single operation can use it all, throughput control can't avoid the RU exceeding the configured limit. Throughput control works best when the configured limit is higher than any single operation that a client in the specific control group can execute.
8282

83-
When you read via query or change feed, you should configure the page size in `spark.cosmos.read.maxItemCount` (default 1000) to be a modest amount so that client throughput control can be recalculated with higher frequency, and therefore reflected more accurately at any specific time. However, when you use throughput control for a write job using bulk, the number of documents executed in a single request are automatically tuned based on the throttling rate to allow the throughput control to kick in as early as possible.
83+
When you read via query or change feed, you should configure the page size in `spark.cosmos.read.maxItemCount` (default 1000) to be a modest amount. In this way, the client throughput control can be recalculated with higher frequency and reflected more accurately at any specific time. When you use throughput control for a write job using bulk, the number of documents executed in a single request is automatically tuned based on the throttling rate to allow the throughput control to begin as early as possible.
8484

8585
> [!WARNING]
86-
> The `targetThroughputThreshold` parameter is *immutable*. If you change the target throughput threshold value, a new throughput control group is created. (As long as you use Version 4.10.0 or later, it can have the same name.) You need to restart all Spark jobs that are using the group if you want to ensure that they all consume the new threshold immediately. Otherwise, they pick up the new threshold after the next restart.
86+
> The `targetThroughputThreshold` parameter is *immutable*. If you change the target throughput threshold value, a new throughput control group is created. (If you use version 4.10.0 or later, it can have the same name.) You need to restart all Spark jobs that are using the group if you want to ensure that they all consume the new threshold immediately. Otherwise, they pick up the new threshold after the next restart.
8787
88-
For each Spark client that uses the throughput control group, a record is created in the `ThroughputControl` container, with a ttl of a few seconds. As a result, the documents vanish pretty quickly if a Spark client isn't actively running anymore. Here's an example:
88+
For each Spark client that uses the throughput control group, a record is created in the `ThroughputControl` container, with a `ttl` of a few seconds. As a result, the documents vanish quickly if a Spark client isn't actively running anymore. Here's an example:
8989

9090
```json
9191
{
@@ -108,6 +108,6 @@ In each client record, the `loadFactor` attribute represents the load on the spe
108108

109109
## Related content
110110

111-
* [Spark samples in GitHub](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples)
112-
* [Manage data with Azure Cosmos DB Spark 3 OLTP connector for API for NoSQL](quickstart-spark.md).
113-
* Learn more about [Apache Spark](https://spark.apache.org/)
111+
* See [Spark samples in GitHub](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples).
112+
* Learn how to [manage data with Azure Cosmos DB Spark 3 OLTP connector for API for NoSQL](quickstart-spark.md).
113+
* Learn more about [Apache Spark](https://spark.apache.org/).

articles/cosmos-db/nosql/tutorial-spark-connector.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ zone_pivot_groups: programming-languages-spark-all-minus-sql-r-csharp
1717

1818
[!INCLUDE[NoSQL](../includes/appliesto-nosql.md)]
1919

20-
In this tutorial, you use the Azure Cosmos DB Spark connector to read or write data from an Azure Cosmos DB for NoSQL account. This tutorial uses Azure Databricks and a Jupyter notebook to illustrate how to integrate with the API for NoSQL from Spark. This tutorial focuses on Python and Scala even though you can use any language or interface supported by Spark.
20+
In this tutorial, you use the Azure Cosmos DB Spark connector to read or write data from an Azure Cosmos DB for NoSQL account. This tutorial uses Azure Databricks and a Jupyter notebook to illustrate how to integrate with the API for NoSQL from Spark. This tutorial focuses on Python and Scala, although you can use any language or interface supported by Spark.
2121

2222
In this tutorial, you learn how to:
2323

@@ -47,12 +47,12 @@ Use your existing Azure Databricks workspace to create a compute cluster ready t
4747
| --- | --- |
4848
| Runtime version | 13.3 LTS (Scala 2.12, Spark 3.4.1) |
4949

50-
1. Use the workspace interface to search for **Maven** packages from **Maven Central** with a **Group ID** of `com.azure.cosmos.spark`. Install the package specific for Spark 3.4 with an **Artifact ID** prefixed with `azure-cosmos-spark_3-4` to the cluster.
50+
1. Use the workspace interface to search for **Maven** packages from **Maven Central** with a **Group ID** of `com.azure.cosmos.spark`. Install the package specifically for Spark 3.4 with an **Artifact ID** prefixed with `azure-cosmos-spark_3-4` to the cluster.
5151

5252
1. Finally, create a new **notebook**.
5353

5454
> [!TIP]
55-
> By default, the notebook will be attached to the recently created cluster.
55+
> By default, the notebook is attached to the recently created cluster.
5656
5757
1. Within the notebook, set online transaction processing (OLTP) configuration settings for the NoSQL account endpoint, database name, and container name.
5858

@@ -132,7 +132,7 @@ Use the Catalog API to manage account resources such as databases and containers
132132

133133
::: zone-end
134134

135-
1. Create a new container named `products` by using `CREATE TABLE IF NOT EXISTS`. Ensure that you set the partition key path to `/category` and enable autoscale throughput with a maximum throughput of `1000` request units per second (RUs).
135+
1. Create a new container named `products` by using `CREATE TABLE IF NOT EXISTS`. Ensure that you set the partition key path to `/category` and enable autoscale throughput with a maximum throughput of `1000` request units (RUs) per second.
136136

137137
::: zone pivot="programming-language-python"
138138

@@ -152,7 +152,7 @@ Use the Catalog API to manage account resources such as databases and containers
152152

153153
::: zone-end
154154

155-
1. Create another container named `employees` by using a hierarchical partition key configuration with `/organization`, `/department`, and `/team` as the set of partition key paths. Follow that specific order. Also, set the throughput to a manual amount of `400` RUs.
155+
1. Create another container named `employees` by using a hierarchical partition key configuration. Use `/organization`, `/department`, and `/team` as the set of partition key paths. Follow that specific order. Also, set the throughput to a manual amount of `400` RUs.
156156

157157
::: zone pivot="programming-language-python"
158158

@@ -240,7 +240,7 @@ Create a sample dataset. Then use OLTP to ingest that data to the API for NoSQL
240240

241241
Load OLTP data into a data frame to perform common queries on the data. You can use various syntaxes to filter or query data.
242242

243-
1. Use `spark.read` to load the OLTP data into a dataframe object. Use the same configuration you used earlier in this tutorial. Also, set `spark.cosmos.read.inferSchema.enabled` to `true` to allow the Spark connector to infer the schema by sampling existing items.
243+
1. Use `spark.read` to load the OLTP data into a data-frame object. Use the same configuration you used earlier in this tutorial. Also, set `spark.cosmos.read.inferSchema.enabled` to `true` to allow the Spark connector to infer the schema by sampling existing items.
244244

245245
::: zone pivot="programming-language-python"
246246

@@ -266,7 +266,7 @@ Load OLTP data into a data frame to perform common queries on the data. You can
266266

267267
::: zone-end
268268

269-
1. Render the schema of the data loaded in the dataframe by using `printSchema`.
269+
1. Render the schema of the data loaded in the data frame by using `printSchema`.
270270

271271
::: zone pivot="programming-language-python"
272272

@@ -380,7 +380,7 @@ When you work with API for NoSQL data in Spark, you can perform partial updates
380380

381381
1. To perform a partial update of an item:
382382

383-
1. Copy the existing `config` configuration variable and modify the properties in the new copy. Specifically, configure the write strategy to `ItemPatch`, disable bulk support, set the columns and mapped operations, and finally set the default operation type to `Set`.
383+
1. Copy the existing `config` configuration variable and modify the properties in the new copy. Specifically, configure the write strategy to `ItemPatch`. Then disable bulk support. Set the columns and mapped operations. Finally, set the default operation type to `Set`.
384384

385385
::: zone pivot="programming-language-python"
386386

0 commit comments

Comments
 (0)