|
| 1 | +.. _comparison: |
| 2 | + |
| 3 | +Comparison with other tools |
| 4 | +=========================== |
| 5 | + |
| 6 | +Why not `DataHub <https://datahubproject.io/>`_? |
| 7 | +------------------------------------------------ |
| 8 | + |
| 9 | +DataHub cons |
| 10 | +~~~~~~~~~~~~ |
| 11 | + |
| 12 | +* As Data Catalog, DataHub relies on database ingestion mechanism. |
| 13 | + To extract and draw lineage between tables, it is required to *both* connect ingestor to all databases, and to enable integration with ETL (Spark, Airflow, etc). |
| 14 | + |
| 15 | + There is an option ``spark.datahub.metadata.dataset.materialize=true``, but in this case DataHub creates datasets without schema, |
| 16 | + so ingestors are still required. |
| 17 | + |
| 18 | +* DataHub Spark agent doesn't properly work if *Platform Instances* are enabled in DataHub. |
| 19 | + Platform Instance is an additional hierarchy level for databases, |
| 20 | + and there is no way to map it to database address used by Spark, Airflow and other ETL tools. |
| 21 | + |
| 22 | +* OpenLineage → DataHub integration collects each Spark command as dedicated *Pipeline Task*, producing a huge lineage graph. |
| 23 | + |
| 24 | + Data.Rentgen has configurable ``granularity`` option while rendering the lineage graph. |
| 25 | + |
| 26 | +* High CPU and memory consumption. |
| 27 | + |
| 28 | +DataHub pros |
| 29 | +~~~~~~~~~~~~ |
| 30 | + |
| 31 | +* DataHub has information about real dataset column names, types, description. |
| 32 | + Data.Rentgen has only information provided by ETL engine, e.g. selected columns, ETL engine-specific column types. |
| 33 | + |
| 34 | +* DataHub has table → view lineage, Data.Rentgen doesn't. |
| 35 | + |
| 36 | +Why not `OpenMetadata <https://open-metadata.org/>`_? |
| 37 | +----------------------------------------------------- |
| 38 | + |
| 39 | +OpenMetadata cons |
| 40 | +~~~~~~~~~~~~~~~~~ |
| 41 | + |
| 42 | +* Database ingestors are required to build a lineage graph, just like DataHub. |
| 43 | +* OpenLineage → OpenMetadata integration produces no lineage, for some unknown reason. |
| 44 | +* High CPU and memory consumption. |
| 45 | + |
| 46 | +OpenMetadata pros |
| 47 | +~~~~~~~~~~~~~~~~~ |
| 48 | + |
| 49 | +* OpenMetadata has information about real dataset column names, types, description. |
| 50 | + |
| 51 | + Data.Rentgen has only information available in ETL engine, e.g. selected columns, ETL engine-specific column types. |
| 52 | + |
| 53 | +* OpenMetadata has table → view lineage, Data.Rentgen doesn't. |
| 54 | + |
| 55 | +Why not `Marquez <https://marquezproject.ai/>`_? |
| 56 | +------------------------------------------------ |
| 57 | + |
| 58 | +Marquez cons |
| 59 | +~~~~~~~~~~~~ |
| 60 | + |
| 61 | +* OpenLineage → Marquez integration collects each Spark command as dedicated Jobs, producing too detailed lineage graph. |
| 62 | + |
| 63 | + Data.Rentgen has configurable ``granularity`` option while rendering the lineage graph. |
| 64 | + |
| 65 | +* Severe performance issues while consuming lineage events. |
| 66 | +* No support for dataset symlinks, e.g. HDFS location → Hive table. |
| 67 | +* No support for parent runs, e.g. Airflow task → Spark application. |
| 68 | +* No releases since 2024. |
| 69 | + |
| 70 | +Marquez pros |
| 71 | +~~~~~~~~~~~~ |
| 72 | + |
| 73 | +* Marquez store and show lineage for any OpenLineage integration. |
| 74 | + Data.Rentgen may require some adjustments for that. |
| 75 | + |
| 76 | +* Marquez store and show any facet produced by OpenLineage integration, including custom ones. |
| 77 | + Data.Rentgen stores only selected facets. |
| 78 | + |
| 79 | +Why not `Apache Atlas <https://atlas.apache.org>`_? |
| 80 | +--------------------------------------------------- |
| 81 | + |
| 82 | +* No Apache Spark 3.x integration in open source. |
| 83 | +* Only Apache Airflow 1.x integration, but no 2.x and 3.x support. |
| 84 | +* High CPU and memory consumption in production environment, as it uses HBase as storage layer. |
| 85 | + |
| 86 | +Why not `Open Data Discovery <https://opendatadiscovery.org/>`_? |
| 87 | +----------------------------------------------------------------- |
| 88 | + |
| 89 | +* No Apache Spark integration. |
| 90 | +* Only Apache Airflow 1.x integration, but no 2.x and 3.x support. |
| 91 | + |
| 92 | +Why not `Amudsen <https://www.amundsen.io>`_? |
| 93 | +--------------------------------------------- |
| 94 | + |
| 95 | +* No Apache Spark integration. |
| 96 | +* No releases since 2024. |
| 97 | + |
| 98 | +Why not `Spline <https://absaoss.github.io/spline/>`_? |
| 99 | +------------------------------------------------------ |
| 100 | + |
| 101 | +* No Apache Airflow integration. |
| 102 | +* ArangoDB changed license from Apache-2.0 to BSL `since 2024.02.19 <https://arangodb.com/2024/02/update-evolving-arangodbs-licensing-model-for-a-sustainable-future/>`_. |
| 103 | + |
| 104 | +Why not `Egeria <https://egeria-project.org/>`_? |
| 105 | +------------------------------------------------ |
| 106 | + |
| 107 | +Insanely complicated. |
0 commit comments