Merge pull request #210168 from richagaur/richagaur/add-patch-samples

v-dirichards · web-flow · commit 0fc2bef817d0 · 2022-09-07T09:29:26.000-05:00
updated samples for scala and added patch example
diff --git a/articles/cosmos-db/sql/create-sql-api-spark.md b/articles/cosmos-db/sql/create-sql-api-spark.md
@@ -26,30 +26,32 @@ ms.custom: seo-java-august2019, seo-java-september2019, devx-track-java, mode-ap
 > * [Go](create-sql-api-go.md)
 >
 
-This tutorial is a quick start guide to show how to use Cosmos DB Spark Connector to read from or write to Cosmos DB. Cosmos DB Spark Connector supports Spark 3.1.x and 3.2.x. Without a credit card or an Azure subscription, you can set up a free [Try Azure Cosmos DB account](https://aka.ms/trycosmosdb)
+This tutorial is a quick start guide to show how to use Cosmos DB Spark Connector to read from or write to Cosmos DB. Cosmos DB Spark Connector supports Spark 3.1.x and 3.2.x.
 
-Throughout this quick tutorial, we rely on [Azure Databricks Runtime 8.0 with Spark 3.1.1](/azure/databricks/release-notes/runtime/8.0) and a Jupyter Notebook to show how to use the Cosmos DB Spark Connector, but you can also use [Azure Databricks Runtime 10.3 with Spark 3.2.1](/azure/databricks/release-notes/runtime/10.3).
+Throughout this quick tutorial, we rely on [Azure Databricks Runtime 10.4 with Spark 3.2.1](/azure/databricks/release-notes/runtime/10.4) and a Jupyter Notebook to show how to use the Cosmos DB Spark Connector.
 
-You can use any other Spark 3.1.1 or 3.2.1 spark offering as well, also you should be able to use any language supported by Spark (PySpark, Scala, Java, etc.), or any Spark interface you are familiar with (Jupyter Notebook, Livy, etc.).
+You can use any other Spark (for e.g., spark 3.1.1) offering as well, also you should be able to use any language supported by Spark (PySpark, Scala, Java, etc.), or any Spark interface you are familiar with (Jupyter Notebook, Livy, etc.).
 
 ## Prerequisites
 
-* An active Azure account. If you don't have one, you can sign up for a [free account](https://aka.ms/trycosmosdb). Alternatively, you can use the [use Azure Cosmos DB Emulator](../local-emulator.md) for development and testing.
+* An active Azure account. If you don't have one, you can sign up for a [free account](https://azure.microsoft.com/try/cosmosdb/). Alternatively, you can use the [use Azure Cosmos DB Emulator](../local-emulator.md) for development and testing.
 
-* [Azure Databricks](/azure/databricks/release-notes/runtime/8.0) runtime 8.0 with Spark 3.1.1 or [Azure Databricks](/azure/databricks/release-notes/runtime/10.3) runtime 10.3 with Spark 3.2.1.
+* [Azure Databricks](/azure/databricks/release-notes/runtime/10.4) runtime 10.4 with Spark 3.2.1
 
 * (Optional) [SLF4J binding](https://www.slf4j.org/manual.html) is used to associate a specific logging framework with SLF4J.
 
 SLF4J is only needed if you plan to use logging, also download an SLF4J binding, which will link the SLF4J API with the logging implementation of your choice. See the [SLF4J user manual](https://www.slf4j.org/manual.html) for more information.
 
-Install Cosmos DB Spark Connector in your spark cluster [using the latest version for Spark 3.1.x](https://aka.ms/azure-cosmos-spark-3-1-download) or [using the latest version for Spark 3.2.x](https://aka.ms/azure-cosmos-spark-3-2-download).
+Install Cosmos DB Spark Connector in your spark cluster [using the latest version for Spark 3.2.x](https://aka.ms/azure-cosmos-spark-3-2-download).
 
-The getting started guide is based on PySpark however you can use the equivalent scala version as well, and you can run the following code snippet in an Azure Databricks PySpark notebook.
+The getting started guide is based on PySpark/Scala and you can run the following code snippet in an Azure Databricks PySpark/Scala notebook.
 
 ## Create databases and containers
 
 First, set Cosmos DB account credentials, and the Cosmos DB Database name and container name.
 
+#### [Python](#tab/python)
+
 ```python
 cosmosEndpoint = "https://REPLACEME.documents.azure.com:443/"
 cosmosMasterKey = "REPLACEME"
@@ -64,8 +66,26 @@ cfg = {
 }
 ```
 
+#### [Scala](#tab/scala)
+
+```scala
+val cosmosEndpoint = "https://REPLACEME.documents.azure.com:443/"
+val cosmosMasterKey = "REPLACEME"
+val cosmosDatabaseName = "sampleDB"
+val cosmosContainerName = "sampleContainer"
+
+val cfg = Map("spark.cosmos.accountEndpoint" -> cosmosEndpoint,
+  "spark.cosmos.accountKey" -> cosmosMasterKey,
+  "spark.cosmos.database" -> cosmosDatabaseName,
+  "spark.cosmos.container" -> cosmosContainerName
+)
+```
+---
+
 Next, you can use the new Catalog API to create a Cosmos DB Database and Container through Spark.
 
+#### [Python](#tab/python)
+
 ```python
 # Configure Catalog Api to be used
 spark.conf.set("spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog")
@@ -79,6 +99,22 @@ spark.sql("CREATE DATABASE IF NOT EXISTS cosmosCatalog.{};".format(cosmosDatabas
 spark.sql("CREATE TABLE IF NOT EXISTS cosmosCatalog.{}.{} using cosmos.oltp TBLPROPERTIES(partitionKeyPath = '/id', manualThroughput = '1100')".format(cosmosDatabaseName, cosmosContainerName))
 ```
 
+#### [Scala](#tab/scala)
+
+```scala
+// Configure Catalog Api to be used
+spark.conf.set(s"spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog")
+spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", cosmosEndpoint)
+spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.accountKey", cosmosMasterKey)
+
+// create a cosmos database using catalog api
+spark.sql(s"CREATE DATABASE IF NOT EXISTS cosmosCatalog.${cosmosDatabaseName};")
+
+// create a cosmos container using catalog api
+spark.sql(s"CREATE TABLE IF NOT EXISTS cosmosCatalog.${cosmosDatabaseName}.${cosmosContainerName} using cosmos.oltp TBLPROPERTIES(partitionKeyPath = '/id', manualThroughput = '1100')")
+```
+---
+
 When creating containers with the Catalog API, you can set the throughput and [partition key path](../partitioning-overview.md#choose-partitionkey) for the container to be created.
 
 For more information, see the full [Catalog API](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3_2-12/docs/catalog-api.md) documentation.
@@ -87,6 +123,8 @@ For more information, see the full [Catalog API](https://github.com/Azure/azure-
 
 The name of the data source is `cosmos.oltp`, and the following example shows how you can write a memory dataframe consisting of two items to Cosmos DB:
 
+#### [Python](#tab/python)
+
 ```python
 spark.createDataFrame((("cat-alive", "Schrodinger cat", 2, True), ("cat-dead", "Schrodinger cat", 2, False)))\
   .toDF("id","name","age","isAlive") \
@@ -97,6 +135,19 @@ spark.createDataFrame((("cat-alive", "Schrodinger cat", 2, True), ("cat-dead", "
    .save()
 ```
 
+#### [Scala](#tab/scala)
+
+```scala
+spark.createDataFrame(Seq(("cat-alive", "Schrodinger cat", 2, true), ("cat-dead", "Schrodinger cat", 2, false)))
+  .toDF("id","name","age","isAlive")
+   .write
+   .format("cosmos.oltp")
+   .options(cfg)
+   .mode("APPEND")
+   .save()
+```
+---
+
 Note that `id` is a mandatory field for Cosmos DB.
 
 For more information related to ingesting data, see the full [write configuration](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3_2-12/docs/configuration-reference.md#write-config) documentation.
@@ -105,6 +156,8 @@ For more information related to ingesting data, see the full [write configuratio
 
 Using the same `cosmos.oltp` data source, we can query data and use `filter` to push down filters:
 
+#### [Python](#tab/python)
+
 ```python
 from pyspark.sql.functions import col
 
@@ -116,23 +169,108 @@ df.filter(col("isAlive") == True)\
  .show()
 ```
 
+#### [Scala](#tab/scala)
+
+```scala
+import org.apache.spark.sql.functions.col
+
+val df = spark.read.format("cosmos.oltp").options(cfg).load()
+
+df.filter(col("isAlive") === true)
+ .withColumn("age", col("age") + 1)
+ .show()
+```
+---
+
 For more information related to querying data, see the full [query configuration](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3_2-12/docs/configuration-reference.md#query-config) documentation.
 
+## Partial document update using Patch
+
+Using the same `cosmos.oltp` data source, we can do partial update in Cosmos DB using Patch API:
+
+#### [Python](#tab/python)
+
+```python
+cfgPatch = {"spark.cosmos.accountEndpoint": cosmosEndpoint,
+          "spark.cosmos.accountKey": cosmosMasterKey,
+          "spark.cosmos.database": cosmosDatabaseName,
+          "spark.cosmos.container": cosmosContainerName,
+          "spark.cosmos.write.strategy": "ItemPatch",
+          "spark.cosmos.write.bulk.enabled": "false",
+          "spark.cosmos.write.patch.defaultOperationType": "Set",
+          "spark.cosmos.write.patch.columnConfigs": "[col(name).op(set)]"
+          }
+
+id = "<document-id>"
+query = "select * from cosmosCatalog.{}.{} where id = '{}';".format(
+    cosmosDatabaseName, cosmosContainerName, id)
+
+dfBeforePatch = spark.sql(query)
+print("document before patch operation")
+dfBeforePatch.show()
+
+data = [{"id": id, "name": "Joel Brakus"}]
+patchDf = spark.createDataFrame(data)
+
+patchDf.write.format("cosmos.oltp").mode("Append").options(**cfgPatch).save()
+
+dfAfterPatch = spark.sql(query)
+print("document after patch operation")
+dfAfterPatch.show()
+```
+
+For more samples related to partial document update, see the Github code sample [Patch Sample](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples/Python/patch-sample.py).
+
+
+#### [Scala](#tab/scala)
+
+```scala
+val cfgPatch = Map("spark.cosmos.accountEndpoint" -> cosmosEndpoint,
+        "spark.cosmos.accountKey" -> cosmosMasterKey,
+        "spark.cosmos.database" -> cosmosDatabaseName,
+        "spark.cosmos.container" -> cosmosContainerName,
+        "spark.cosmos.write.strategy" -> "ItemPatch",
+        "spark.cosmos.write.bulk.enabled" -> "false",
+         
+        "spark.cosmos.write.patch.columnConfigs" -> "[col(name).op(set)]"
+      )
+
+val id = "<document-id>"
+val query = s"select * from cosmosCatalog.${cosmosDatabaseName}.${cosmosContainerName} where id = '$id';"
+
+val dfBeforePatch = spark.sql(query)
+println("document before patch operation")
+dfBeforePatch.show()
+val patchDf =  Seq(
+        (id,  "Joel Brakus")
+      ).toDF("id", "name")
+
+patchDf.write.format("cosmos.oltp").mode("Append").options(cfgPatch).save()
+val dfAfterPatch = spark.sql(query)
+println("document after patch operation")
+dfAfterPatch.show()
+```
+
+For more samples related to partial document update, see the Github code sample [Patch Sample](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples/Scala/PatchSample.scala).
+
+---
+
 ## Schema inference
 
 When querying data, the Spark Connector can infer the schema based on sampling existing items by setting `spark.cosmos.read.inferSchema.enabled` to `true`.
 
+#### [Python](#tab/python)
+
 ```python
 df = spark.read.format("cosmos.oltp").options(**cfg)\
  .option("spark.cosmos.read.inferSchema.enabled", "true")\
  .load()
  
 df.printSchema()
-```
 
-Alternatively, you can pass the custom schema you want to be used to read the data:
 
-```python
+# Alternatively, you can pass the custom schema you want to be used to read the data:
+
 customSchema = StructType([
       StructField("id", StringType()),
       StructField("name", StringType()),
@@ -145,17 +283,32 @@ df = spark.read.schema(customSchema).format("cosmos.oltp").options(**cfg)\
  .load()
  
 df.printSchema()
-```
 
-If no custom schema is specified and schema inference is disabled, then the resulting data will be returning the raw Json content of the items:
+# If no custom schema is specified and schema inference is disabled, then the resulting data will be returning the raw Json content of the items:
 
-```python
 df = spark.read.format("cosmos.oltp").options(**cfg)\
  .load()
  
 df.printSchema()
 ```
 
+#### [Scala](#tab/scala)
+
+```scala
+val cfgWithAutoSchemaInference = Map("spark.cosmos.accountEndpoint" -> cosmosEndpoint,
+  "spark.cosmos.accountKey" -> cosmosMasterKey,
+  "spark.cosmos.database" -> cosmosDatabaseName,
+  "spark.cosmos.container" -> cosmosContainerName,
+  "spark.cosmos.read.inferSchema.enabled" -> "true"                          
+)
+
+val df = spark.read.format("cosmos.oltp").options(cfgWithAutoSchemaInference).load()
+df.printSchema()
+
+df.show()
+```
+---
+
 For more information related to schema inference, see the full [schema inference configuration](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3_2-12/docs/configuration-reference.md#schema-inference-config) documentation.
 
 ## Configuration reference