Skip to content

Commit a89ad11

Browse files
committed
Acrolinx
1 parent 2d67b0f commit a89ad11

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

articles/purview/how-to-lineage-spark-atlas-connector.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ Since Microsoft Purview supports Atlas API and Atlas native hook, the connector
2424

2525
## Configuration requirement
2626

27-
The connectors require a version of Spark 2.4.0+. But Spark version 3 is not supported. The Spark supports three types of listener required to be set:
27+
The connectors require a version of Spark 2.4.0+. But Spark version 3 isn't supported. The Spark supports three types of listener required to be set:
2828

2929
| Listener | Since Spark Version|
3030
| ------------------- | ------------------- |
@@ -42,7 +42,7 @@ The following steps are documented based on DataBricks as an example:
4242

4343
1. Generate package
4444
1. Pull code from GitHub: https://github.com/hortonworks-spark/spark-atlas-connector
45-
2. [For Windows] Comment out the **maven-enforcer-plugin** in spark-atlas-connector\pom.xml to remove the dependency on Unix.
45+
2. [For Windows], Comment out the **maven-enforcer-plugin** in spark-atlas-connector\pom.xml to remove the dependency on Unix.
4646

4747
```web
4848
<requireOS>
@@ -161,14 +161,14 @@ Kick off The Spark job and check the lineage info in your Microsoft Purview acco
161161
:::image type="content" source="./media/how-to-lineage-spark-atlas-connector/purview-with-spark-lineage.png" alt-text="Screenshot showing purview with spark lineage" lightbox="./media/how-to-lineage-spark-atlas-connector/purview-with-spark-lineage.png":::
162162

163163
## Known limitations with the connector for Spark lineage
164-
1. Supports SQL/DataFrame API (in other words, it does not support RDD). This connector relies on query listener to retrieve query and examine the impacts.
164+
1. Supports SQL/DataFrame API (in other words, it doesn't support RDD). This connector relies on query listener to retrieve query and examine the impacts.
165165

166166
2. All "inputs" and "outputs" from multiple queries are combined into single "spark_process" entity.
167167

168168
"spark_process" maps to an "applicationId" in Spark. It allows admin to track all changes that occurred as part of an application. But also causes lineage/relationship graph in "spark_process" to be complicated and less meaningful.
169169
3. Only part of inputs is tracked in Streaming query.
170170

171-
* Kafka source supports subscribing with "pattern" and this connector does not enumerate all existing matching topics, or even all possible topics
171+
* Kafka source supports subscribing with "pattern" and this connector doesn't enumerate all existing matching topics, or even all possible topics
172172

173173
* The "executed plan" provides actual topics with (micro) batch reads and processes. As a result, only inputs that participate in (micro) batch are included as "inputs" of "spark_process" entity.
174174

@@ -178,7 +178,7 @@ Kick off The Spark job and check the lineage info in your Microsoft Purview acco
178178

179179
The "drop table" event from Spark only provides db and table name, which is NOT sufficient to create the unique key to recognize the table.
180180

181-
The connector depends on reading the Spark Catalog to get table information. Spark have already dropped the table when this connector notices the table is dropped, so drop table will not work.
181+
The connector depends on reading the Spark Catalog to get table information. Spark have already dropped the table when this connector notices the table is dropped, so drop table won't work.
182182

183183

184184
## Next steps

0 commit comments

Comments
 (0)