You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.rst
+21-9Lines changed: 21 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -38,20 +38,32 @@ Data.Rentgen is a Data Motion Lineage service, compatible with `OpenLineage <htt
38
38
Goals
39
39
-----
40
40
41
-
* Collect lineage events produced by OpenLineage clients & integrations (Spark, Airflow).
42
-
* Support consuming large amounts of lineage events, by using Kafka as event buffer and storing data in tables partitioned by event timestamp.
43
-
* Store operation-grained events (instead of job grained `Marquez <https://marquezproject.ai/>`_), for better detalization.
44
-
* Provide API for fetching run ↔ dataset lineage.
45
-
* Allow building lineage graph with specific time boundaries (unlike Marquez there lineage is build only for last job run).
46
-
* Allow building lineage graph with different granularity. e.g. merge all individual Spark operations into Spark applicationId or Spark applicationName.
47
-
* Include column-level lineage into lineage graph.
41
+
* Collect lineage events produced by OpenLineage clients & integrations.
42
+
* Store operation-grained events for better detalization (instead of job grained `Marquez <https://marquezproject.ai/>`_).
43
+
* Provide API for fetching job/run ↔ dataset lineage, not dataset ↔ dataset lineage (like `Datahub <https://datahubproject.io/>`_ and `OpenMetadata <https://open-metadata.org/>`_).
44
+
45
+
Features
46
+
--------
47
+
48
+
* Support consuming large amounts of lineage events, use Apache Kafka as event buffer.
49
+
* Store data in tables partitioned by event timestamp, to speed up lineage graph resolution.
50
+
* Lineage graph is build with user-specified time boundaries (unlike Marquez where lineage is build only for last job run).
51
+
* Lineage graph can be build with different granularity. e.g. merge all individual Spark operations into Spark applicationId or Spark applicationName.
52
+
* Column-level lineage support.
53
+
* Authentication support.
48
54
49
55
Non-goals
50
56
---------
51
57
52
-
* This is **not** a Data Catalog. Use `Datahub <https://datahubproject.io/>`_ or `OpenMetadata<https://open-metadata.org/>`_ instead.
58
+
* This is **not** a Data Catalog, DataRentgen doesn't track dataset schema change, owner and so on. Use Datahub or OpenMetadata instead.
53
59
* Static Data Lineage like view → table is not supported.
54
-
* Job/run/operation are always a part of lineage graph. Hiding them to produce dataset → dataset lineage is not supported for now.
60
+
61
+
Limitations
62
+
-----------
63
+
64
+
* For now, only Apache Spark and Apache Airflow are supported as lineage event sources.
65
+
OpenLineage also supports Apache Flink, DBT, Trino and others. DataRentgen support may be added later.
66
+
* Unlike Marquez, DataRentgen parses only limited set of facets send by OpenLineage, and doesn't store custom facets. This can be changed in future.
0 commit comments