Skip to content

Commit 6366950

Browse files
committed
Add comparison with other OSS lineage tools
1 parent 9586491 commit 6366950

File tree

2 files changed

+108
-0
lines changed

2 files changed

+108
-0
lines changed

docs/comparison.rst

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
.. _comparison:
2+
3+
Comparison with other tools
4+
===========================
5+
6+
Why not `DataHub <https://datahubproject.io/>`_?
7+
------------------------------------------------
8+
9+
DataHub cons
10+
~~~~~~~~~~~~
11+
12+
* As Data Catalog, DataHub relies on database ingestion mechanism.
13+
To extract and draw lineage between tables, it is required to *both* connect ingestor to all databases, and to enable integration with ETL (Spark, Airflow, etc).
14+
15+
There is an option ``spark.datahub.metadata.dataset.materialize=true``, but in this case DataHub creates datasets without schema,
16+
so ingestors are still required.
17+
18+
* DataHub Spark agent doesn't properly work if *Platform Instances* are enabled in DataHub.
19+
Platform Instance is an additional hierarchy level for databases,
20+
and there is no way to map it to database address used by Spark, Airflow and other ETL tools.
21+
22+
* OpenLineage → DataHub integration collects each Spark command as dedicated *Pipeline Task*, producing a huge lineage graph.
23+
24+
Data.Rentgen has configurable ``granularity`` option while rendering the lineage graph.
25+
26+
* High CPU and memory consumption.
27+
28+
DataHub pros
29+
~~~~~~~~~~~~
30+
31+
* DataHub has information about real dataset column names, types, description.
32+
Data.Rentgen has only information provided by ETL engine, e.g. selected columns, ETL engine-specific column types.
33+
34+
* DataHub has table → view lineage, Data.Rentgen doesn't.
35+
36+
Why not `OpenMetadata <https://open-metadata.org/>`_?
37+
-----------------------------------------------------
38+
39+
OpenMetadata cons
40+
~~~~~~~~~~~~~~~~~
41+
42+
* Database ingestors are required to build a lineage graph, just like DataHub.
43+
* OpenLineage → OpenMetadata integration produces no lineage, for some unknown reason.
44+
* High CPU and memory consumption.
45+
46+
OpenMetadata pros
47+
~~~~~~~~~~~~~~~~~
48+
49+
* OpenMetadata has information about real dataset column names, types, description.
50+
51+
Data.Rentgen has only information available in ETL engine, e.g. selected columns, ETL engine-specific column types.
52+
53+
* OpenMetadata has table → view lineage, Data.Rentgen doesn't.
54+
55+
Why not `Marquez <https://marquezproject.ai/>`_?
56+
------------------------------------------------
57+
58+
Marquez cons
59+
~~~~~~~~~~~~
60+
61+
* OpenLineage → Marquez integration collects each Spark command as dedicated Jobs, producing too detailed lineage graph.
62+
63+
Data.Rentgen has configurable ``granularity`` option while rendering the lineage graph.
64+
65+
* Severe performance issues while consuming lineage events.
66+
* No support for dataset symlinks, e.g. HDFS location → Hive table.
67+
* No support for parent runs, e.g. Airflow task → Spark application.
68+
* No releases since 2024.
69+
70+
Marquez pros
71+
~~~~~~~~~~~~
72+
73+
* Marquez store and show lineage for any OpenLineage integration.
74+
Data.Rentgen may require some adjustments for that.
75+
76+
* Marquez store and show any facet produced by OpenLineage integration, including custom ones.
77+
Data.Rentgen stores only selected facets.
78+
79+
Why not `Apache Atlas <https://atlas.apache.org>`_?
80+
---------------------------------------------------
81+
82+
* No Apache Spark 3.x integration in open source.
83+
* Only Apache Airflow 1.x integration, but no 2.x and 3.x support.
84+
* High CPU and memory consumption in production environment, as it uses HBase as storage layer.
85+
86+
Why not `Open Data Discovery <https://opendatadiscovery.org/>`_?
87+
-----------------------------------------------------------------
88+
89+
* No Apache Spark integration.
90+
* Only Apache Airflow 1.x integration, but no 2.x and 3.x support.
91+
92+
Why not `Amudsen <https://www.amundsen.io>`_?
93+
---------------------------------------------
94+
95+
* No Apache Spark integration.
96+
* No releases since 2024.
97+
98+
Why not `Spline <https://absaoss.github.io/spline/>`_?
99+
------------------------------------------------------
100+
101+
* No Apache Airflow integration.
102+
* ArangoDB changed license from Apache-2.0 to BSL `since 2024.02.19 <https://arangodb.com/2024/02/update-evolving-arangodbs-licensing-model-for-a-sustainable-future/>`_.
103+
104+
Why not `Egeria <https://egeria-project.org/>`_?
105+
------------------------------------------------
106+
107+
Insanely complicated.

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
:hidden:
55

66
self
7+
comparison
78

89
.. toctree::
910
:maxdepth: 2

0 commit comments

Comments
 (0)