Add batch deletion DAO support for stale metadata cleanup by NatalliaUlashchick · Pull Request #604 · linkedin/datahub-gma

NatalliaUlashchick · 2026-03-18T16:35:22Z

Summary

Add readDeletionInfoBatch() and batchSoftDeleteAssets() to IEbeanLocalAccess to support bulk soft-deletion with exactly 2 DB round-trips per batch
New EntityDeletionInfo data class in dao-api holding deletion eligibility fields and aspect column data for Kafka archival
batchSoftDeleteAssets UPDATE embeds all guard clauses (deleted_ts IS NULL, Status.removed = true, lastmodifiedon < cutoff) as defense-in-depth against race conditions
11 integration tests against embedded MariaDB + 2 mock-based instrumentation tests

Part of META-23501 — Metadata Graph Stale Metadata Cleanup Phase 2.

The Batch Delete API PR-803 is in draft status depending on this PR.

Test plan

EbeanLocalAccessTest: happy path, empty input, non-existent URNs, mixed found/not-found, already soft-deleted, guard clauses (status not removed, retention not met, already deleted), mixed eligibility batch
InstrumentedEbeanLocalAccessTest: delegation and latency recording for both new methods
All existing tests pass (no modifications to existing methods)
metadata-models mint build succeeds with datahub-gma 0.6.171

@builder

Add readDeletionInfoBatch() and batchSoftDeleteAssets() to IEbeanLocalAccess to support bulk soft-deletion with exactly 2 DB round-trips per batch. - EntityDeletionInfo: new @value @builder data class in dao-api holding deletion eligibility fields and aspect columns for Kafka archival - readDeletionInfoBatch: single SELECT * to read all URNs and parse a_status for statusRemoved/statusLastModifiedOn - batchSoftDeleteAssets: single guarded UPDATE with defense-in-depth WHERE clauses (deleted_ts IS NULL, removed=true, lastmodifiedon < cutoff) - SQL templates in SQLStatementUtils, SqlRow parsing in EBeanDAOUtils (convertSqlRowsToEntityDeletionInfoMap / toEntityDeletionInfo) - InstrumentedEbeanLocalAccess: delegation with instrument() wrapper - Integration tests in EbeanLocalAccessTest (11 tests against embedded MariaDB) and mock-based tests in InstrumentedEbeanLocalAccessTest (2) Part of META-23501 -- Metadata Graph Stale Metadata Cleanup Phase 2. DAO layer is intentionally Kafka-free; archival lives in metadata-graph-assets. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e and spotless - Revert EmbeddedMariaInstance Apple Silicon config back to commented-out (was activated for local testing only, must not be in PR) - Rename test methods to camelCase to satisfy checkstyle method naming rule - Apply spotless formatting to CLAUDE.md and spec/batch_delete_dao_changes.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-03-18T16:42:03Z

Codecov Report

❌ Patch coverage is 74.41860% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.42%. Comparing base (2fac19d) to head (69e341f).

Files with missing lines	Patch %	Lines
...va/com/linkedin/metadata/dao/EbeanLocalAccess.java	70.83%	4 Missing and 3 partials ⚠️
.../com/linkedin/metadata/dao/EntityDeletionInfo.java	0.00%	6 Missing ⚠️
...com/linkedin/metadata/dao/utils/EBeanDAOUtils.java	83.87%	2 Missing and 3 partials ⚠️
.../java/com/linkedin/metadata/dao/EbeanLocalDAO.java	0.00%	2 Missing ⚠️
...linkedin/metadata/dao/utils/SQLStatementUtils.java	86.66%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master     #604      +/-   ##
============================================
+ Coverage     65.31%   65.42%   +0.11%     
- Complexity     1717     1744      +27     
============================================
  Files           143      144       +1     
  Lines          6711     6797      +86     
  Branches        809      820      +11     
============================================
+ Hits           4383     4447      +64     
- Misses         1977     1991      +14     
- Partials        351      359       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add format validation in createBatchSoftDeleteAssetSql() that rejects any cutoffTimestamp not matching yyyy-MM-dd HH:mm:ss.SSS pattern with IllegalArgumentException. Defense-in-depth for shared library API surface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add readDeletionInfoBatch() and batchSoftDeleteAssets() delegation methods so BaseAssetServiceImpl can access the new DAO operations through EbeanLocalDAO (which it already uses for all other operations). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jphui · 2026-03-20T17:55:19Z

spec/batch_delete_dao_changes.md

+each entity is eligible for deletion:
+
+- `deletedTs` — whether the entity is already soft-deleted
+- `statusRemoved` — whether the entity's Status aspect has `removed = true`


Not all assets will have a Status aspect, how does your solution address / return for the case where a customer tries to run this on an asset without this aspect?

Great catch. I will add a check at API level to fail if the asset to be deleted doesn't have corresponding aspect status field. I will also make status field a param to pass to DAO layer to handle different names for status field.

jphui · 2026-03-20T18:00:35Z

spec/batch_delete_dao_changes.md

+regardless of batch size.
+
+The DAO layer is intentionally kept simple: pure SQL, no business logic, no Kafka. The consuming service
+(`metadata-graph-assets`) handles validation, Kafka archival, and per-URN result reporting.


IIUC, the idea here is that based on the values rertrieved from this, some kafka event will be emitted to "cold-store" the old data?

Yes, correct. Basically, we want to archive every row we delete as a backup.

jphui · 2026-03-20T18:08:17Z

spec/batch_delete_dao_changes.md

+- **Exactly 2 DB calls per batch.** No per-URN queries.
+- **Guard clauses in the UPDATE.** Even if a caller skips the SELECT validation, the UPDATE will not soft-delete
+  entities that don't meet all safety conditions.
+- **`a_status` column is the only schema assumption.** The two methods rely on `a_status` existing in the entity table.


There's also the case where an asset can have a Status aspect but have it be named differently in column form.

ie. in the Asset.proto model:

... proto.com.linkedin.common.Status foo_bar = 1; ...

this means that the expectation for the DB column name would be a_foo_bar.

I think there is a method that allows you to "find the column name" based on an entity-Aspect type pairing, can you refactor the logic to go through this so that this assumption isn't made, effectively?

Thanks for catching this! I will refactor the logic.

jphui · 2026-03-20T18:20:16Z

dao-impl/ebean-dao/src/main/java/com/linkedin/metadata/dao/utils/SQLStatementUtils.java

  // Delete prefix of the sql statement for deleting from metadata_aspect table
  public static final String SQL_SOFT_DELETE_ASSET_WITH_URN = "UPDATE %s SET deleted_ts = NOW() WHERE urn = '%s';";
+
+  private static final String SQL_READ_ALL_COLUMNS_BY_URNS_TEMPLATE = "SELECT * FROM %s WHERE urn IN (%s)";


Curious if you considered reusing the existing BatchGet functionality so as to not rewrite the DB layer operations in order to obtain this metadata?

What are the pros / cons of this operation approach?

I considered it. Pros/cons:

Reusing BatchGet:

(+) No new SQL, reuses proven code path

(-) BatchGet returns fully deserialized RecordTemplate objects via the Pegasus stack — heavy deserialization for data we only need as raw strings for Kafka archival

(-) BatchGet doesn't return deleted_ts or raw column values — it's designed for the read path, not deletion eligibility checks

(-) Would need to call BatchGet + additional queries for deleted_ts and lastmodifiedon, adding one more DB cal.

jphui · 2026-03-20T18:20:33Z

dao-impl/ebean-dao/src/main/java/com/linkedin/metadata/dao/utils/SQLStatementUtils.java

+          + " WHERE urn IN (%s)"
+          + " AND deleted_ts IS NULL"
+          + " AND JSON_EXTRACT(a_status, '$.aspect.removed') = true"
+          + " AND JSON_EXTRACT(a_status, '$.lastmodifiedon') < '%s'";


what does lastmodifiedon < %s hope to achieve why is this check needed?

The eligibility for deletion is N(30 days) in removed status to cover for "flip-flop" cases. Here we have the very last safety check to ensure we soft-delete entities whose Status aspect was set to removed=true at least N days ago (the cutoffTimestamp). This prevents accidental deletion of entities that were recently marked as removed — giving owners a grace period to undo the removal.

Hmm ok so rewording in my words so I ensure I'm properly understanding 😂 (double check me pls 🙏 )

"This check checks to make sure that the lastmodified on field is strictly 'less than' the input value to ensure that it is 'at least' 30 days (or whatever cutoff is specified) so as to not accidentally delete recently removed datasets" 😁

Yes, correct. Since getting the datasets info and deleting them are two separate API calls, there is tiny chance there might be an update in between where dataset status has been updated in between.

jphui · 2026-03-20T18:22:06Z

dao-impl/ebean-dao/src/main/java/com/linkedin/metadata/dao/utils/SQLStatementUtils.java

      .addEscape('\'', "''")
      .addEscape('\\', "\\\\").build();

+  private static final String TIMESTAMP_FORMAT_PATTERN = "\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\.\\d{3}";


isn't deleted_ts in Long format?

Any advantage to using this kind of formatting instead?

deleted_ts is indeed a BIGINT, but this regex validates the cutoffTimestamp parameter passed to batchSoftDeleteAssets, which is compared against JSON_EXTRACT(a_status, '$.lastmodifiedon'). The lastmodifiedon field inside the Status aspect JSON is stored as a string in yyyy-MM-dd HH:mm:ss.SSS format
The regex validates the input to prevent SQL injection since the cutoff is string-interpolated into the query.

jphui · 2026-03-20T18:24:08Z

dao-impl/ebean-dao/src/main/java/com/linkedin/metadata/dao/InstrumentedEbeanLocalAccess.java

  }

+  @Override
+  public Map<URN, EntityDeletionInfo> readDeletionInfoBatch(@Nonnull List<URN> urns, boolean isTestMode) {


Any thought to include a batch size limit here at the kernel level so as to not accidentally overwhelm the DB?

I have this check at API level already. Agree, let's add it here to be safe.

jphui · 2026-03-20T18:31:26Z

CLAUDE.md

thanks for this!

jphui · 2026-03-20T18:33:09Z

dao-impl/ebean-dao/src/main/java/com/linkedin/metadata/dao/utils/SQLStatementUtils.java

  // Delete prefix of the sql statement for deleting from metadata_aspect table
  public static final String SQL_SOFT_DELETE_ASSET_WITH_URN = "UPDATE %s SET deleted_ts = NOW() WHERE urn = '%s';";
+
+  private static final String SQL_READ_ALL_COLUMNS_BY_URNS_TEMPLATE = "SELECT * FROM %s WHERE urn IN (%s)";


SELECT * is broader than needed

createReadAllColumnsByUrnsSql uses SELECT . While the aspect columns are needed for Kafka archival, this pulls every column including indexes (i_). A SELECT urn, deleted_ts,
a_* pattern would be more precise, though SELECT * is simpler and consistent with being future-proof if columns are added.

I kind agree with this, I think the derived columns can probably be omitted (indexes, basically) which already exist, this can lower the impact of large select *'s a bit

Yes, agree. Let me refactor the logic

Replace hardcoded `a_status` column name with a `statusColumnName` parameter throughout the batch deletion DAO layer. Different entity types may map the Status aspect to different column names (e.g. `a_foo_bar` instead of `a_status`), so the caller must resolve and pass the correct column name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…letion read Replace SELECT * with an explicit column list (urn, deleted_ts, a_*) in readDeletionInfoBatch to avoid fetching index columns (i_*) and other derived columns unnecessarily. Uses SchemaValidatorUtil's cached column metadata to resolve the column list at runtime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reject batches exceeding 2000 URNs in readDeletionInfoBatch and batchSoftDeleteAssets as defense-in-depth against overwhelming the DB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…it SELECT Update spec and Javadocs to reflect that readDeletionInfoBatch uses an explicit column list (not SELECT *), statusColumnName is caller-provided (not hardcoded a_status), and both methods enforce a 2000 URN batch size limit. Fix missing trailing newline in test file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jphui · 2026-03-26T22:28:46Z

Thanks for addressing my comments, from my understanding the solution to the "Aspect Status" issues:

ensure that Asset has Status
find the column name

is delegated to the application layer.

I think there is sufficient utility in the codebase to just do it all here. BUT that being said if there is a strong reason to delegate totally open to hearing it out.

======

Aspect Class → Column Name Mapping

The primary utility is SQLSchemaUtils in
/home/coder/datahub-gma/dao-impl/ebean-dao/src/main/java/com/linkedin/metadata/dao/utils/SQLSchemaUtils.java.

Key Methods

getAspectColumnName(String entityType, String aspectCanonicalName) → String — e.g. ("dataset",
"com.linkedin.common.Status") → "a_status"
getAspectColumnName(String entityType, Class aspectClass) → String — Generic overload taking a Class object
getAspectColumnNameWithUrnCheck(String assetType, String aspectCanonicalName) → String — Returns "a_urn" for Urn
types, otherwise delegates

...

ModelUtils — Entity → associated aspects:

getValidAspectTypes(Class<ASPECT_UNION> aspectUnionClass) → Set<Class<? extends RecordTemplate>> — All aspect
classes from an aspect union
getAspectTypesFromAssetType(Class assetClass) → List<Class<? extends RecordTemplate>> — All aspect classes
from an asset class
getAspectClassNames(Class<ASPECT_UNION> unionClass) → List — Aspect FQCNs as strings

NatalliaUlashchick · 2026-03-27T03:51:11Z

Thanks for addressing my comments, from my understanding the solution to the "Aspect Status" issues:

ensure that Asset has Status

find the column name

is delegated to the application layer.

I think there is sufficient utility in the codebase to just do it all here. BUT that being said if there is a strong reason to delegate totally open to hearing it out.

======

Aspect Class → Column Name Mapping

The primary utility is SQLSchemaUtils in /home/coder/datahub-gma/dao-impl/ebean-dao/src/main/java/com/linkedin/metadata/dao/utils/SQLSchemaUtils.java.

Key Methods

getAspectColumnName(String entityType, String aspectCanonicalName) → String — e.g. ("dataset",
"com.linkedin.common.Status") → "a_status"

getAspectColumnName(String entityType, Class aspectClass) → String — Generic overload taking a Class object

getAspectColumnNameWithUrnCheck(String assetType, String aspectCanonicalName) → String — Returns "a_urn" for Urn
types, otherwise delegates

...

ModelUtils — Entity → associated aspects:

getValidAspectTypes(Class<ASPECT_UNION> aspectUnionClass) → Set<Class<? extends RecordTemplate>> — All aspect
classes from an aspect union

getAspectTypesFromAssetType(Class assetClass) → List<Class<? extends RecordTemplate>> — All aspect classes
from an asset class

getAspectClassNames(Class<ASPECT_UNION> unionClass) → List — Aspect FQCNs as strings

On a second thought I like your idea. We could still have a check at API level that the status aspect exists for the given asset and fail fast if it doesn't, but we will leave all the internal mapping of db column name to DAO layer where it belongs. I refactored both PRs.

Remove statusColumnName parameter from public DAO API. The Status aspect column name is now resolved internally via SQLSchemaUtils.getAspectColumnName(), consistent with how all other DAO methods resolve column names. Add test stub com.linkedin.common.Status PDL and FooAsset PDL so that the test entity type can resolve the Status column name through the standard GlobalAssetRegistry path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Natallia Ulashchick and others added 2 commits March 17, 2026 14:39

Natallia Ulashchick and others added 2 commits March 18, 2026 10:22

NatalliaUlashchick requested review from ZihanLi58 and jphui March 19, 2026 20:46

jphui reviewed Mar 20, 2026

View reviewed changes

Natallia Ulashchick and others added 5 commits March 24, 2026 13:40

fix(dao): add batch size guard to batch deletion DAO methods

c448571

Reject batches exceeding 2000 URNs in readDeletionInfoBatch and batchSoftDeleteAssets as defense-in-depth against overwhelming the DB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

style: fix spotless markdown formatting in spec

c8635a4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

NatalliaUlashchick force-pushed the naulashc/batch-delete-api branch from f759a2a to 69e341f Compare March 27, 2026 16:21

jphui approved these changes Mar 27, 2026

View reviewed changes

NatalliaUlashchick merged commit b311a97 into master Mar 27, 2026
2 checks passed

Conversation

NatalliaUlashchick commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

codecov-commenter commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NatalliaUlashchick Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jphui commented Mar 26, 2026

Uh oh!

NatalliaUlashchick commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NatalliaUlashchick commented Mar 18, 2026 •

edited

Loading

codecov-commenter commented Mar 18, 2026 •

edited

Loading

NatalliaUlashchick Mar 24, 2026 •

edited

Loading