You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/purview/how-to-lineage-spark-atlas-connector.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ Since Microsoft Purview supports Atlas API and Atlas native hook, the connector
24
24
25
25
## Configuration requirement
26
26
27
-
The connectors require a version of Spark 2.4.0+. But Spark version 3 is not supported. The Spark supports three types of listener required to be set:
27
+
The connectors require a version of Spark 2.4.0+. But Spark version 3 isn't supported. The Spark supports three types of listener required to be set:
28
28
29
29
| Listener | Since Spark Version|
30
30
| ------------------- | ------------------- |
@@ -42,7 +42,7 @@ The following steps are documented based on DataBricks as an example:
42
42
43
43
1. Generate package
44
44
1. Pull code from GitHub: https://github.com/hortonworks-spark/spark-atlas-connector
45
-
2.[For Windows] Comment out the **maven-enforcer-plugin** in spark-atlas-connector\pom.xml to remove the dependency on Unix.
45
+
2.[For Windows], Comment out the **maven-enforcer-plugin** in spark-atlas-connector\pom.xml to remove the dependency on Unix.
46
46
47
47
```web
48
48
<requireOS>
@@ -161,14 +161,14 @@ Kick off The Spark job and check the lineage info in your Microsoft Purview acco
161
161
:::image type="content" source="./media/how-to-lineage-spark-atlas-connector/purview-with-spark-lineage.png" alt-text="Screenshot showing purview with spark lineage" lightbox="./media/how-to-lineage-spark-atlas-connector/purview-with-spark-lineage.png":::
162
162
163
163
## Known limitations with the connector for Spark lineage
164
-
1. Supports SQL/DataFrame API (in other words, it does not support RDD). This connector relies on query listener to retrieve query and examine the impacts.
164
+
1. Supports SQL/DataFrame API (in other words, it doesn't support RDD). This connector relies on query listener to retrieve query and examine the impacts.
165
165
166
166
2. All "inputs" and "outputs" from multiple queries are combined into single "spark_process" entity.
167
167
168
168
"spark_process" maps to an "applicationId" in Spark. It allows admin to track all changes that occurred as part of an application. But also causes lineage/relationship graph in "spark_process" to be complicated and less meaningful.
169
169
3. Only part of inputs is tracked in Streaming query.
170
170
171
-
* Kafka source supports subscribing with "pattern" and this connector does not enumerate all existing matching topics, or even all possible topics
171
+
* Kafka source supports subscribing with "pattern" and this connector doesn't enumerate all existing matching topics, or even all possible topics
172
172
173
173
* The "executed plan" provides actual topics with (micro) batch reads and processes. As a result, only inputs that participate in (micro) batch are included as "inputs" of "spark_process" entity.
174
174
@@ -178,7 +178,7 @@ Kick off The Spark job and check the lineage info in your Microsoft Purview acco
178
178
179
179
The "drop table" event from Spark only provides db and table name, which is NOT sufficient to create the unique key to recognize the table.
180
180
181
-
The connector depends on reading the Spark Catalog to get table information. Spark have already dropped the table when this connector notices the table is dropped, so drop table will not work.
181
+
The connector depends on reading the Spark Catalog to get table information. Spark have already dropped the table when this connector notices the table is dropped, so drop table won't work.
0 commit comments