Skip to content

[DOP-22364] Add column lineage tables#153

Merged
dolfinus merged 2 commits intodevelopfrom
feature/DOP-22364
Feb 5, 2025
Merged

[DOP-22364] Add column lineage tables#153
dolfinus merged 2 commits intodevelopfrom
feature/DOP-22364

Conversation

@dolfinus
Copy link
Member

@dolfinus dolfinus commented Feb 4, 2025

Change Summary

Added 2 tables to store column lineage:

  • dataset_column_relation - tuple of (source_column, target_column, type) with common fingerprint value (hash of specific relation set).
  • column_lineage - tuple of (operation, source_dataset, target_dataset, fingerprint).

This combination requires the least amount of space, as relations between columns remain the same between multiple operations/runs/jobs, and sometimes even between datasets.

For example, for 2.6M rows/4.9GB of raw events in Kafka (=1.2M operations/230MB), column lineage requires only:

  • dataset_column_relation - 14.6k rows, 3.2MB
  • column_lineage - 52.9k rows, 13MB

Storing column lineage as flat table (operation, source_dataset, target_dataset, source_column, target_column, type) requires much more space - 805k rows, 123MB.

Note: target_column is nullable, but we need unique index over tuple (fingerprint, source_column, target_column), and PG by default consider NULLs as distinct values. This is temporary fixed by using coalesce(target_column, ''), but I'm not really sure about that - probably, we should make column NOT NULL and store empty string here.

Related issue number

Checklist

  • Commit message and PR title is comprehensive
  • Keep the change as small as possible
  • Unit and integration tests for the changes exist
  • Tests pass on CI and coverage does not decrease
  • Documentation reflects the changes where applicable
  • docs/changelog/next_release/<pull request or issue id>.<change type>.rst file added describing change
    (see CONTRIBUTING.rst for details.)
  • My PR is ready to review.

@dolfinus dolfinus self-assigned this Feb 4, 2025
@codecov
Copy link

codecov bot commented Feb 4, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.82%. Comparing base (868e822) to head (ac1e9ca).
Report is 137 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #153      +/-   ##
===========================================
+ Coverage    92.68%   92.82%   +0.14%     
===========================================
  Files          177      180       +3     
  Lines         3880     3956      +76     
  Branches       268      269       +1     
===========================================
+ Hits          3596     3672      +76     
  Misses         222      222              
  Partials        62       62              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dolfinus dolfinus marked this pull request as ready for review February 4, 2025 09:17
Co-authored-by: Kirill Yakimenkov <kayakimenkov@gmail.com>
@dolfinus dolfinus added the ci:skip-changelog Add this label to skip changelog file check label Feb 5, 2025
@dolfinus dolfinus merged commit da9a6d7 into develop Feb 5, 2025
13 of 14 checks passed
@dolfinus dolfinus deleted the feature/DOP-22364 branch February 5, 2025 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci:skip-changelog Add this label to skip changelog file check

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants