Skip to content

Commit ed24493

Browse files
Merge pull request #105230 from yossi-karp/user/v-yokarp/spark
Updated pics and text
2 parents ee28615 + 8202402 commit ed24493

File tree

5 files changed

+82
-60
lines changed

5 files changed

+82
-60
lines changed
-10.9 KB
Loading
113 KB
Loading
112 KB
Loading
60.4 KB
Loading

articles/data-explorer/spark-connector.md

Lines changed: 82 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -9,16 +9,15 @@ ms.topic: conceptual
99
ms.date: 1/14/2020
1010
---
1111

12-
# Azure Data Explorer Connector for Apache Spark (Preview)
12+
# Azure Data Explorer Connector for Apache Spark
1313

1414
[Apache Spark](https://spark.apache.org/) is a unified analytics engine for large-scale data processing. Azure Data Explorer is a fast, fully managed data analytics service for real-time analysis on large volumes of data.
1515

16-
Azure Data Explorer connector for Spark implements data source and data sink for moving data across Azure Data Explorer and Spark clusters to use both of their capabilities. Using Azure Data Explorer and Apache Spark, you can build fast and scalable applications targeting data driven scenarios, such as machine learning (ML), Extract-Transform-Load (ETL), and Log Analytics.
17-
Writing to Azure Data Explorer can be done in batch and streaming mode.
18-
Reading from Azure Data Explorer supports column pruning and predicate pushdown, which reduces the volume of transferred data by filtering out data in Azure Data Explorer.
16+
The Azure Data Explorer connector for Spark is an [open source project](https://github.com/Azure/azure-kusto-spark) that can run on any Spark cluster. It implements data source and data sink for moving data across Azure Data Explorer and Spark clusters. Using Azure Data Explorer and Apache Spark, you can build fast and scalable applications targeting data driven scenarios. For example, machine learning (ML), Extract-Transform-Load (ETL), and Log Analytics. With the connector, Azure Data Explorer becomes a valid data store for standard Spark source and sink operations, such as write, read, and writeStream.
1917

20-
Azure Data Explorer Spark connector is an [open source project](https://github.com/Azure/azure-kusto-spark) that can run on any Spark cluster. The Azure Data Explorer Spark connector makes Azure Data Explorer a valid data store for standard Spark source
21-
and sink operations such as write, read and writeStream.
18+
You can write to Azure Data Explorer in either batch or streaming mode. Reading from Azure Data Explorer supports column pruning and predicate pushdown, which filters the data in Azure Data Explorer, reducing the volume of transferred data.
19+
20+
This topic describes how to install and configure the Azure Data Explorer Spark connector and move data between Azure Data Explorer and Apache Spark clusters.
2221

2322
> [!NOTE]
2423
> Although some of the examples below refer to an [Azure Databricks](https://docs.azuredatabricks.net/) Spark cluster, Azure Data Explorer Spark connector does not take direct dependencies on Databricks or any other Spark distribution.
@@ -27,36 +26,36 @@ and sink operations such as write, read and writeStream.
2726

2827
* [Create an Azure Data Explorer cluster and database](/azure/data-explorer/create-cluster-database-portal)
2928
* Create a Spark cluster
30-
* Install Azure Data Explorer connector library, and libraries listed in [dependencies](https://github.com/Azure/azure-kusto-spark#dependencies) including the following [Kusto Java SDK](/azure/kusto/api/java/kusto-java-client-library) libraries:
31-
* [Kusto Data Client](https://mvnrepository.com/artifact/com.microsoft.azure.kusto/kusto-data)
32-
* [Kusto Ingest Client](https://mvnrepository.com/artifact/com.microsoft.azure.kusto/kusto-ingest)
33-
* Pre-built libraries for [Spark 2.4, Scala 2.11](https://github.com/Azure/azure-kusto-spark/releases) and [Maven repo](https://mvnrepository.com/artifact/com.microsoft.azure.kusto/spark-kusto-connector)
29+
* Install Azure Data Explorer connector library:
30+
* Pre-built libraries for [Spark 2.4, Scala 2.11](https://github.com/Azure/azure-kusto-spark/releases)
31+
* [Maven repo](https://mvnrepository.com/artifact/com.microsoft.azure.kusto/spark-kusto-connector)
32+
* [Maven 3.x](https://maven.apache.org/download.cgi) installed
3433

35-
## How to build the Spark connector
34+
> [!TIP]
35+
> 2.3.x versions are also supported, but may require some changes in pom.xml dependencies.
3636
37-
Spark Connector can be built from [sources](https://github.com/Azure/azure-kusto-spark) as detailed below.
37+
## How to build the Spark connector
3838

3939
> [!NOTE]
4040
> This step is optional. If you are using pre-built libraries go to [Spark cluster setup](#spark-cluster-setup).
4141
4242
### Build prerequisites
4343

44-
* Java 1.8 SDK installed
45-
* [Maven 3.x](https://maven.apache.org/download.cgi) installed
46-
* Apache Spark version 2.4.0 or higher
47-
48-
> [!TIP]
49-
> 2.3.x versions are also supported, but may require some changes in pom.xml dependencies.
44+
1. Install the libraries listed in [dependencies](https://github.com/Azure/azure-kusto-spark#dependencies) including the following [Kusto Java SDK](/azure/kusto/api/java/kusto-java-client-library) libraries:
45+
* [Kusto Data Client](https://mvnrepository.com/artifact/com.microsoft.azure.kusto/kusto-data)
46+
* [Kusto Ingest Client](https://mvnrepository.com/artifact/com.microsoft.azure.kusto/kusto-ingest)
5047

51-
For Scala/Java applications using Maven project definitions, link your application with the following artifact (latest version may differ):
48+
1. Refer to [this source](https://github.com/Azure/azure-kusto-spark) for building the Spark Connector.
5249

53-
```Maven
54-
<dependency>
55-
<groupId>com.microsoft.azure</groupId>
56-
<artifactId>spark-kusto-connector</artifactId>
57-
<version>1.0.0-Beta-02</version>
58-
</dependency>
59-
```
50+
1. For Scala/Java applications using Maven project definitions, link your application with the following artifact (latest version may differ):
51+
52+
```Maven
53+
<dependency>
54+
<groupId>com.microsoft.azure</groupId>
55+
<artifactId>spark-kusto-connector</artifactId>
56+
<version>1.1.0</version>
57+
</dependency>
58+
```
6059
6160
### Build commands
6261
@@ -77,27 +76,37 @@ For more information, see [connector usage](https://github.com/Azure/azure-kusto
7776
## Spark cluster setup
7877
7978
> [!NOTE]
80-
> It is recommended to use the latest Azure Data Explorer Spark connector release when performing the following steps:
79+
> It's recommended to use the latest Azure Data Explorer Spark connector release when performing the following steps.
8180
82-
1. Set the following Spark cluster settings, based on Azure Databricks cluster using Spark 2.4.4 and Scala 2.11:
81+
1. Configure the following Spark cluster settings, based on Azure Databricks cluster using Spark 2.4.4 and Scala 2.11:
8382
8483
![Databricks cluster settings](media/spark-connector/databricks-cluster.png)
8584
8685
1. Install the latest spark-kusto-connector library from Maven:
87-
88-
![Import Azure Data Explorer library](media/spark-connector/db-create-library.png)
86+
87+
![Import libraries](media/spark-connector/db-libraries-view.png)
88+
![Select Spark-Kusto-Connector](media/spark-connector/db-dependencies.png)
8989
9090
1. Verify that all required libraries are installed:
9191
9292
![Verify libraries installed](media/spark-connector/db-libraries-view.png)
9393
94+
1. For installation using a JAR file, verify that additional dependencies were installed:
95+
96+
![Add dependencies](media/spark-connector/db-not-maven.png)
97+
9498
## Authentication
9599
96-
Azure Data Explorer Spark connector allows you to authenticate with Azure Active Directory (Azure AD) using an [Azure AD application](#azure-ad-application-authentication), [Azure AD access token](https://github.com/Azure/azure-kusto-spark/blob/dev/docs/Authentication.md#direct-authentication-with-access-token), [device authentication](https://github.com/Azure/azure-kusto-spark/blob/dev/docs/Authentication.md#device-authentication) (for non-production scenarios), or [Azure Key Vault](https://github.com/Azure/azure-kusto-spark/blob/dev/docs/Authentication.md#key-vault). The user must install azure-keyvault package and provide application credentials to access the Key Vault resource.
100+
Azure Data Explorer Spark connector enables you to authenticate with Azure Active Directory (Azure AD) using one of the following methods:
101+
* An [Azure AD application](#azure-ad-application-authentication)
102+
* An [Azure AD access token](https://github.com/Azure/azure-kusto-spark/blob/dev/docs/Authentication.md#direct-authentication-with-access-token)
103+
* [Device authentication](https://github.com/Azure/azure-kusto-spark/blob/dev/docs/Authentication.md#device-authentication) (for non-production scenarios)
104+
* An [Azure Key Vault](https://github.com/Azure/azure-kusto-spark/blob/dev/docs/Authentication.md#key-vault)
105+
To access the Key Vault resource, install the azure-keyvault package and provide application credentials.
97106
98107
### Azure AD application authentication
99108
100-
Most simple and common authentication method. This method is recommended for Azure Data Explorer Spark connector usage.
109+
Azure AD application authentication is the simplest and most common authentication method and is recommended for the Azure Data Explorer Spark connector.
101110
102111
|Properties |Description |
103112
|---------|---------|
@@ -107,10 +116,10 @@ Most simple and common authentication method. This method is recommended for Azu
107116
108117
### Azure Data Explorer privileges
109118
110-
The following privileges must be granted on an Azure Data Explorer cluster:
119+
Grant the following privileges on an Azure Data Explorer cluster:
111120
112-
* For reading (data source), Azure AD application must have *viewer* privileges on the target database, or *admin* privileges on the target table.
113-
* For writing (data sink), Azure AD application must have *ingestor* privileges on the target database. It must also have *user* privileges on the target database to create new tables. If the target table already exists, *admin* privileges on the target table can be configured.
121+
* For reading (data source), the Azure AD identity must have *viewer* privileges on the target database, or *admin* privileges on the target table.
122+
* For writing (data sink), the Azure AD identity must have *ingestor* privileges on the target database. It must also have *user* privileges on the target database to create new tables. If the target table already exists, you must configure *admin* privileges on the target table.
114123
115124
For more information on Azure Data Explorer principal roles, see [role-based authorization](/azure/kusto/management/access-control/role-based-authorization). For managing security roles, see [security roles management](/azure/kusto/management/security-roles).
116125
@@ -167,10 +176,9 @@ For more information on Azure Data Explorer principal roles, see [role-based aut
167176
import java.util.concurrent.TimeUnit
168177
import org.apache.spark.sql.streaming.Trigger
169178
170-
// Set up a checkpoint and disable codeGen. Set up a checkpoint and disable codeGen as a workaround for an known issue 
179+
// Set up a checkpoint and disable codeGen.
171180
spark.conf.set("spark.sql.streaming.checkpointLocation", "/FileStore/temp/checkpoint")
172-
spark.conf.set("spark.sql.codegen.wholeStage","false") // Use in case a NullPointerException is thrown inside codegen iterator
173-
181+
174182
// Write to a Kusto table from a streaming source
175183
val kustoQ = df
176184
.writeStream
@@ -183,7 +191,7 @@ For more information on Azure Data Explorer principal roles, see [role-based aut
183191
184192
## Spark source: reading from Azure Data Explorer
185193
186-
1. When reading small amounts of data, define the data query:
194+
1. When reading [small amounts of data](/azure/kusto/concepts/querylimits), define the data query:
187195
188196
```scala
189197
import com.microsoft.kusto.spark.datasource.KustoSourceOptions
@@ -212,7 +220,8 @@ For more information on Azure Data Explorer principal roles, see [role-based aut
212220
display(df2)
213221
```
214222
215-
1. When reading large amounts of data, transient blob storage must be provided. Provide storage container SAS key, or storage account name, account key, and container name. This step is only required for the current preview release of the Spark connector.
223+
1. Optional: If **you** provide the transient blob storage (and not Azure Data Explorer) the blobs are created are under the caller's responsibility. This includes provisioning the storage, rotating access keys, and deleting transient artifacts.
224+
The KustoBlobStorageUtils module contains helper functions for deleting blobs based on either account and container coordinates and account credentials, or a full SAS URL with write, read and list permissions. When the corresponding RDD is no longer needed, each transaction stores transient blob artifacts in a separate directory. This directory is captured as part of read-transaction information logs reported on the Spark Driver node.
216225
217226
```scala
218227
// Use either container/account-key/account name, or container SaS
@@ -222,28 +231,41 @@ For more information on Azure Data Explorer principal roles, see [role-based aut
222231
// val storageSas = dbutils.secrets.get(scope = "KustoDemos", key = "blobStorageSasUrl")
223232
```
224233
225-
In the example above, we don't access the Key Vault using the connector interface. Alternatively, we use a simpler method of using the Databricks secrets.
226-
227-
1. Read from Azure Data Explorer:
228-
229-
```scala
230-
val conf3 = Map(
231-
KustoSourceOptions.KUSTO_AAD_CLIENT_ID -> appId,
232-
KustoSourceOptions.KUSTO_AAD_CLIENT_PASSWORD -> appKey
233-
KustoSourceOptions.KUSTO_BLOB_STORAGE_SAS_URL -> storageSas)
234-
val df2 = spark.read.kusto(cluster, database, "ReallyBigTable", conf3)
234+
In the example above, the Key Vault isn't accessed using the connector interface; a simpler method of using the Databricks secrets is used.
235+
236+
1. Read from Azure Data Explorer.
237+
238+
* If **you** provide the transient blob storage, read from Azure Data Explorer as follows:
239+
240+
```scala
241+
val conf3 = Map(
242+
KustoSourceOptions.KUSTO_AAD_CLIENT_ID -> appId,
243+
KustoSourceOptions.KUSTO_AAD_CLIENT_PASSWORD -> appKey
244+
KustoSourceOptions.KUSTO_BLOB_STORAGE_SAS_URL -> storageSas)
245+
val df2 = spark.read.kusto(cluster, database, "ReallyBigTable", conf3)
246+
247+
val dfFiltered = df2
248+
.where(df2.col("ColA").startsWith("row-2"))
249+
.filter("ColB > 12")
250+
.filter("ColB <= 21")
251+
.select("ColA")
252+
253+
display(dfFiltered)
254+
```
255+
256+
* If **Azure Data Explorer** provides the transient blob storage, read from Azure Data Explorer as follows:
235257
236-
val dfFiltered = df2
237-
.where(df2.col("ColA").startsWith("row-2"))
238-
.filter("ColB > 12")
239-
.filter("ColB <= 21")
240-
.select("ColA")
241-
242-
display(dfFiltered)
243-
```
258+
```scala
259+
val dfFiltered = df2
260+
.where(df2.col("ColA").startsWith("row-2"))
261+
.filter("ColB > 12")
262+
.filter("ColB <= 21")
263+
.select("ColA")
264+
265+
display(dfFiltered)
266+
```
244267
245268
## Next steps
246269
247270
* Learn more about the [Azure Data Explorer Spark Connector](https://github.com/Azure/azure-kusto-spark/tree/master/docs)
248-
* [Sample code](https://github.com/Azure/azure-kusto-spark/tree/master/samples/src/main)
249-
271+
* [Sample code for Java and Python](https://github.com/Azure/azure-kusto-spark/tree/master/samples/src/main)

0 commit comments

Comments
 (0)