You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This tutorial is a quick start guide to show how to use Cosmos DB Spark Connector to read from or write to Cosmos DB. Cosmos DB Spark Connector supports Spark 3.1.x and 3.2.x. Without a credit card or an Azure subscription, you can set up a free [Try Azure Cosmos DB account](https://aka.ms/trycosmosdb)
29
+
This tutorial is a quick start guide to show how to use Cosmos DB Spark Connector to read from or write to Cosmos DB. Cosmos DB Spark Connector supports Spark 3.1.x and 3.2.x.
30
30
31
-
Throughout this quick tutorial, we rely on [Azure Databricks Runtime 8.0 with Spark 3.1.1](/azure/databricks/release-notes/runtime/8.0) and a Jupyter Notebook to show how to use the Cosmos DB Spark Connector, but you can also use [Azure Databricks Runtime 10.3 with Spark 3.2.1](/azure/databricks/release-notes/runtime/10.3).
31
+
Throughout this quick tutorial, we rely on [Azure Databricks Runtime 10.4 with Spark 3.2.1](/azure/databricks/release-notes/runtime/10.4) and a Jupyter Notebook to show how to use the Cosmos DB Spark Connector.
32
32
33
-
You can use any other Spark 3.1.1 or 3.2.1 spark offering as well, also you should be able to use any language supported by Spark (PySpark, Scala, Java, etc.), or any Spark interface you are familiar with (Jupyter Notebook, Livy, etc.).
33
+
You can use any other Spark (for e.g., spark 3.1.1) offering as well, also you should be able to use any language supported by Spark (PySpark, Scala, Java, etc.), or any Spark interface you are familiar with (Jupyter Notebook, Livy, etc.).
34
34
35
35
## Prerequisites
36
36
37
-
* An active Azure account. If you don't have one, you can sign up for a [free account](https://aka.ms/trycosmosdb). Alternatively, you can use the [use Azure Cosmos DB Emulator](../local-emulator.md) for development and testing.
37
+
* An active Azure account. If you don't have one, you can sign up for a [free account](https://azure.microsoft.com/try/cosmosdb/). Alternatively, you can use the [use Azure Cosmos DB Emulator](../local-emulator.md) for development and testing.
38
38
39
-
*[Azure Databricks](/azure/databricks/release-notes/runtime/8.0) runtime 8.0 with Spark 3.1.1 or [Azure Databricks](/azure/databricks/release-notes/runtime/10.3) runtime 10.3 with Spark 3.2.1.
39
+
*[Azure Databricks](/azure/databricks/release-notes/runtime/10.4) runtime 10.4 with Spark 3.2.1
40
40
41
41
* (Optional) [SLF4J binding](https://www.slf4j.org/manual.html) is used to associate a specific logging framework with SLF4J.
42
42
43
43
SLF4J is only needed if you plan to use logging, also download an SLF4J binding, which will link the SLF4J API with the logging implementation of your choice. See the [SLF4J user manual](https://www.slf4j.org/manual.html) for more information.
44
44
45
-
Install Cosmos DB Spark Connector in your spark cluster [using the latest version for Spark 3.1.x](https://aka.ms/azure-cosmos-spark-3-1-download) or [using the latest version for Spark 3.2.x](https://aka.ms/azure-cosmos-spark-3-2-download).
45
+
Install Cosmos DB Spark Connector in your spark cluster [using the latest version for Spark 3.2.x](https://aka.ms/azure-cosmos-spark-3-2-download).
46
46
47
-
The getting started guide is based on PySpark however you can use the equivalent scala version as well, and you can run the following code snippet in an Azure Databricks PySpark notebook.
47
+
The getting started guide is based on PySpark/Scala and you can run the following code snippet in an Azure Databricks PySpark/Scala notebook.
48
48
49
49
## Create databases and containers
50
50
51
51
First, set Cosmos DB account credentials, and the Cosmos DB Database name and container name.
@@ -79,6 +99,22 @@ spark.sql("CREATE DATABASE IF NOT EXISTS cosmosCatalog.{};".format(cosmosDatabas
79
99
spark.sql("CREATE TABLE IF NOT EXISTS cosmosCatalog.{}.{} using cosmos.oltp TBLPROPERTIES(partitionKeyPath = '/id', manualThroughput = '1100')".format(cosmosDatabaseName, cosmosContainerName))
spark.sql(s"CREATE DATABASE IF NOT EXISTS cosmosCatalog.${cosmosDatabaseName};")
112
+
113
+
// create a cosmos container using catalog api
114
+
spark.sql(s"CREATE TABLE IF NOT EXISTS cosmosCatalog.${cosmosDatabaseName}.${cosmosContainerName} using cosmos.oltp TBLPROPERTIES(partitionKeyPath = '/id', manualThroughput = '1100')")
115
+
```
116
+
---
117
+
82
118
When creating containers with the Catalog API, you can set the throughput and [partition key path](../partitioning-overview.md#choose-partitionkey) for the container to be created.
83
119
84
120
For more information, see the full [Catalog API](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3_2-12/docs/catalog-api.md) documentation.
@@ -87,6 +123,8 @@ For more information, see the full [Catalog API](https://github.com/Azure/azure-
87
123
88
124
The name of the data source is `cosmos.oltp`, and the following example shows how you can write a memory dataframe consisting of two items to Cosmos DB:
Note that `id` is a mandatory field for Cosmos DB.
101
152
102
153
For more information related to ingesting data, see the full [write configuration](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3_2-12/docs/configuration-reference.md#write-config) documentation.
@@ -105,6 +156,8 @@ For more information related to ingesting data, see the full [write configuratio
105
156
106
157
Using the same `cosmos.oltp` data source, we can query data and use `filter` to push down filters:
For more information related to querying data, see the full [query configuration](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3_2-12/docs/configuration-reference.md#query-config) documentation.
120
186
187
+
## Partial document update using Patch
188
+
189
+
Using the same `cosmos.oltp` data source, we can do partial update in Cosmos DB using Patch API:
For more samples related to partial document update, see the Github code sample [Patch Sample](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples/Python/patch-sample.py).
For more samples related to partial document update, see the Github code sample [Patch Sample](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples/Scala/PatchSample.scala).
255
+
256
+
---
257
+
121
258
## Schema inference
122
259
123
260
When querying data, the Spark Connector can infer the schema based on sampling existing items by setting `spark.cosmos.read.inferSchema.enabled` to `true`.
For more information related to schema inference, see the full [schema inference configuration](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3_2-12/docs/configuration-reference.md#schema-inference-config) documentation.
0 commit comments