Skip to content

Commit 4d98cda

Browse files
committed
Modified connect-from-databricks.md for Acrolinx score
1 parent 62e85eb commit 4d98cda

File tree

1 file changed

+24
-24
lines changed

1 file changed

+24
-24
lines changed

articles/cosmos-db/mongodb/vcore/connect-from-databricks.md

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ ms.date: 03/08/2024
1313
# Connect to Azure Cosmos DB for MongoDB vCore from Azure Databricks
1414
[!INCLUDE[MongoDB vCore](./introduction.md)]
1515

16-
This article walks you through connecting Azure Cosmos DB for MongoDB vCore using Spark connector for Databricks. It walks through basic basic Data Manipulation Language(DML) operations like Read, Write, Create Views or Temporary Tables, Filtering and Running Aggregations using python code.
16+
This article walks you through connecting Azure Cosmos DB for MongoDB vCore using Spark connector for Databricks. It walks through basic Data Manipulation Language(DML) operations like Read, Write, Create Views or Temporary Tables, Filtering and Running Aggregations using python code.
1717

1818
## Prerequisites
1919
* [Provision an Azure Cosmos DB for MongoDB vCore cluster.](quickstart-portal.md)
@@ -22,7 +22,7 @@ This article walks you through connecting Azure Cosmos DB for MongoDB vCore usin
2222

2323
## Dependencies for connectivity
2424
* **Spark connector for MongoDV vCore:**
25-
Spark connector is used to connect to Azure Cosmos DB for MongoDB Atlas. Identify and use the version of the connector located in [Maven central](hhttps://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector) that is compatible with the Spark and Scala versions of your Spark environment. We recommend an environment that supports Spark 3.2.1 or higher, and the spark connector available at maven coordinates `org.mongodb.spark:mongo-spark-connector_2.12:3.0.1`.
25+
Spark connector is used to connect to Azure Cosmos DB for MongoDB Atlas. Identify and use the version of the connector located in [Maven central](hhttps://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector) that is compatible with the Spark and Scala versions of your Spark environment. We recommend an environment that supports Spark 3.2.1 or higher, and the spark connector available at maven coordinates `org.mongodb.spark:mongo-spark-connector_2.12:3.0.1`.
2626

2727
* **Azure Cosmos DB for MongoDB connection strings:** Your Azure Cosmos DB for MongoDB vCore connection string, user name, and passwords.
2828

@@ -48,13 +48,13 @@ After that, you may create a Scala or Python notebook for migration.
4848

4949
## Create Python notebook to connect to Azure Cosmos DB for MongoDB vCore
5050

51-
Create a Python Notebook in Databricks. Make sure to enter the right values for the variables before running the following codes
51+
Create a Python Notebook in Databricks. Make sure to enter the right values for the variables before running the following codes.
5252

5353
### Update Spark configuration with the Azure Cosmos DB for MongoDB connection string
5454

5555
1. Note the connect string under the **Settings** -> **Connection strings** in Azure Cosmos DB MongoDB vCore Resource in Azure Portal. It has the form of "mongodb+srv://\<user>\:\<password>\@\<database_name>.mongocluster.cosmos.azure.com"
56-
2. Back in Databricks in your cluster configuration, under **Advanced Options** (bottom of page), paste the connection string for both the `spark.mongodb.output.uri` and `spark.mongodb.input.uri` variables. Plase populate the username and password field appropriatly. This way all the workbooks you are running on the cluster will use this configuration.
57-
3. Alternativley you can explictly set the `option` when calling APIs like: `spark.read.format("mongo").option("spark.mongodb.input.uri", connectionString).load()`. If congigured the variables in the cluster, you don't have to set the option.
56+
2. Back in Databricks in your cluster configuration, under **Advanced Options** (bottom of page), paste the connection string for both the `spark.mongodb.output.uri` and `spark.mongodb.input.uri` variables. Populate the username and password field appropriate. This way all the workbooks, which running on the cluster uses this configuration.
57+
3. Alternatively you can explicitly set the `option` when calling APIs like: `spark.read.format("mongo").option("spark.mongodb.input.uri", connectionString).load()`. If you configure the variables in the cluster, you don't have to set the option.
5858

5959
```python
6060
connectionString_vcore="mongodb+srv://<user>:<password>@<database_name>.mongocluster.cosmos.azure.com/?tls=true&authMechanism=SCRAM-SHA-256&retrywrites=false&maxIdleTimeMS=120000"
@@ -64,36 +64,36 @@ collection="<collection_name>"
6464

6565
### Data Sample Set
6666

67-
For the purpose of this lab we will be using the Mongo Citibike2019 data set. You can import it from here
68-
[CitiBike Trip History 2019](https://citibikenyc.com/system-data)
69-
We have loaded it into a database called "CitiBikeDB" and the collection "CitiBike2019"
70-
We are setting the variables database and collection to point to the data loaded and we shall be using these variable in the examples below
67+
For the purpose with this lab, we're using the Mongo 'Citibike2019' data set. You can import it from here
68+
[CitiBike Trip History 2019](https://citibikenyc.com/system-data).
69+
We loaded it into a database called "CitiBikeDB" and the collection "CitiBike2019".
70+
We're setting the variables database and collection to point to the data loaded and we're using variables in the examples.
7171
```python
7272
database="CitiBikeDB"
7373
collection="CitiBike2019"
7474
```
7575

7676
### Read data from Azure Cosmos DB for MongoDB vCore
7777

78-
The general syntax looks like this :
78+
The general syntax looks like this:
7979
```python
8080
df_vcore = spark.read.format("mongo").option("database", database).option("spark.mongodb.input.uri", connectionString_vcore).option("collection",collection).load()
8181
```
8282

83-
You can validate the data frame loaded as follows :
83+
You can validate the data frame loaded as follows:
8484
```python
8585
df_vcore.printSchema()
8686
display(df_vcore)
8787
```
8888

89-
Let's see this with an example :
89+
Let's see an example:
9090
```python
9191
df_vcore = spark.read.format("mongo").option("database", database).option("spark.mongodb.input.uri", connectionString_vcore).option("collection",collection).load()
9292
df_vcore.printSchema()
9393
display(df_vcore)
9494
```
9595

96-
Output :
96+
Output:
9797
**Schema**
9898
:::image type="content" source="./media/connect-from-databricks/print-schema.png" alt-text="Screenshot of the Print Schema.":::
9999

@@ -102,13 +102,13 @@ Output :
102102

103103
### Filter data from Azure Cosmos DB for MongoDB vCore
104104

105-
The general syntax looks like this :
105+
The general syntax looks like this:
106106
```python
107107
df_v = df_vcore.filter(df_vcore[column number/column name] == [filter condition])
108108
display(df_v)
109109
```
110110

111-
Let's see this with an example :
111+
Let's see an example:
112112
```python
113113
df_v = df_vcore.filter(df_vcore[2] == 1970)
114114
display(df_v)
@@ -119,13 +119,13 @@ Output:
119119

120120
### Create a view or temporary table and run SQL queries against it
121121

122-
The general syntax looks like this :
122+
The general syntax looks like this:
123123
```python
124124
df_[dataframename].createOrReplaceTempView("[View Name]")
125125
spark.sql("SELECT * FROM [View Name]")
126126
```
127127

128-
Let's see this with an example :
128+
Let's see an example:
129129
```python
130130
df_vcore.createOrReplaceTempView("T_VCORE")
131131
df_v = spark.sql(" SELECT * FROM T_VCORE WHERE birth_year == 1970 and gender == 2 ")
@@ -137,25 +137,25 @@ Output:
137137

138138
### Write data to Azure Cosmos DB for MongoDB vCore
139139

140-
The general syntax looks like this :
140+
The general syntax looks like this:
141141
```python
142142
df.write.format("mongo").option("spark.mongodb.output.uri", connectionString).option("database",database).option("collection","<collection_name>").mode("append").save()
143143
```
144144

145-
Let's see this with an example :
145+
Let's see an example:
146146
```python
147147
df_vcore.write.format("mongo").option("spark.mongodb.output.uri", connectionString_vcore).option("database",database).option("collection","CitiBike2019").mode("append").save()
148148
```
149149

150-
This command does not have an output as it will write directly to the collection. You can cross check if the record is updated using a read command.
150+
This command doesn't have an output as it writes directly to the collection. You can cross check if the record is updated using a read command.
151151

152152
### Read data from Azure Cosmos DB for MongoDB vCore collection running an Aggregation Pipeline
153153

154-
[Aggregation Pipeline](../tutorial-aggregation.md) is a powerful capability that allows to pre-process and transform data within Azure CosmosDB for MongoDB. It's a great match for real-time analytics, dashboards, report generation with roll-ups, sums & averages with 'server-side' data post-processing. (Note: there is a [whole book written about it](https://www.practical-mongodb-aggregations.com/front-cover.html)). <br/>
155-
Azure Cosmos DB for MongoDB even supports [rich secondary/compound indexes](../indexing.md) to extract, filter, and process only the data it needs – for example, analyzing all customers located in a specific geography right within the database without first having to load the full data-set, minimizing data-movement and reducing latency. <br/>
156-
You can find the syntax in the hyperlinks above.
154+
[Aggregation Pipeline](../tutorial-aggregation.md) is a powerful capability that allows to preprocess and transform data within Azure Cosmos DB for MongoDB. It's a great match for real-time analytics, dashboards, report generation with roll-ups, sums & averages with 'server-side' data post-processing. (Note: there's a [whole book written about it](https://www.practical-mongodb-aggregations.com/front-cover.html)). <br/>
155+
Azure Cosmos DB for MongoDB even supports [rich secondary/compound indexes](../indexing.md) to extract, filter, and process only the data it needs.
156+
For example, analyzing all customers located in a specific geography right within the database without first having to load the full data-set, minimizing data-movement and reducing latency. <br/>
157157

158-
Below is an example of using aggregate function :
158+
Here's an example of using aggregate function:
159159

160160
```python
161161
pipeline = "[{ $group : { _id : '$birth_year', totaldocs : { $count : 1 }, totalduration: {$sum: '$tripduration'}} }]"

0 commit comments

Comments
 (0)