Skip to content

Add batch deletion DAO support for stale metadata cleanup#604

Merged
NatalliaUlashchick merged 10 commits intomasterfrom
naulashc/batch-delete-api
Mar 27, 2026
Merged

Add batch deletion DAO support for stale metadata cleanup#604
NatalliaUlashchick merged 10 commits intomasterfrom
naulashc/batch-delete-api

Conversation

@NatalliaUlashchick
Copy link
Copy Markdown
Contributor

@NatalliaUlashchick NatalliaUlashchick commented Mar 18, 2026

Summary

  • Add readDeletionInfoBatch() and batchSoftDeleteAssets() to IEbeanLocalAccess to support bulk soft-deletion with exactly 2 DB round-trips per batch
  • New EntityDeletionInfo data class in dao-api holding deletion eligibility fields and aspect column data for Kafka archival
  • batchSoftDeleteAssets UPDATE embeds all guard clauses (deleted_ts IS NULL, Status.removed = true, lastmodifiedon < cutoff) as defense-in-depth against race conditions
  • 11 integration tests against embedded MariaDB + 2 mock-based instrumentation tests

Part of META-23501 — Metadata Graph Stale Metadata Cleanup Phase 2.

The Batch Delete API PR-803 is in draft status depending on this PR.

Test plan

  • EbeanLocalAccessTest: happy path, empty input, non-existent URNs, mixed found/not-found, already soft-deleted, guard clauses (status not removed, retention not met, already deleted), mixed eligibility batch
  • InstrumentedEbeanLocalAccessTest: delegation and latency recording for both new methods
  • All existing tests pass (no modifications to existing methods)
  • metadata-models mint build succeeds with datahub-gma 0.6.171

Natallia Ulashchick and others added 2 commits March 17, 2026 14:39
Add readDeletionInfoBatch() and batchSoftDeleteAssets() to IEbeanLocalAccess
to support bulk soft-deletion with exactly 2 DB round-trips per batch.

- EntityDeletionInfo: new @value @builder data class in dao-api holding
  deletion eligibility fields and aspect columns for Kafka archival
- readDeletionInfoBatch: single SELECT * to read all URNs and parse
  a_status for statusRemoved/statusLastModifiedOn
- batchSoftDeleteAssets: single guarded UPDATE with defense-in-depth
  WHERE clauses (deleted_ts IS NULL, removed=true, lastmodifiedon < cutoff)
- SQL templates in SQLStatementUtils, SqlRow parsing in EBeanDAOUtils
  (convertSqlRowsToEntityDeletionInfoMap / toEntityDeletionInfo)
- InstrumentedEbeanLocalAccess: delegation with instrument() wrapper
- Integration tests in EbeanLocalAccessTest (11 tests against embedded
  MariaDB) and mock-based tests in InstrumentedEbeanLocalAccessTest (2)

Part of META-23501 -- Metadata Graph Stale Metadata Cleanup Phase 2.
DAO layer is intentionally Kafka-free; archival lives in metadata-graph-assets.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e and spotless

- Revert EmbeddedMariaInstance Apple Silicon config back to commented-out
  (was activated for local testing only, must not be in PR)
- Rename test methods to camelCase to satisfy checkstyle method naming rule
- Apply spotless formatting to CLAUDE.md and spec/batch_delete_dao_changes.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 18, 2026

Codecov Report

❌ Patch coverage is 74.41860% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.42%. Comparing base (2fac19d) to head (69e341f).

Files with missing lines Patch % Lines
...va/com/linkedin/metadata/dao/EbeanLocalAccess.java 70.83% 4 Missing and 3 partials ⚠️
.../com/linkedin/metadata/dao/EntityDeletionInfo.java 0.00% 6 Missing ⚠️
...com/linkedin/metadata/dao/utils/EBeanDAOUtils.java 83.87% 2 Missing and 3 partials ⚠️
.../java/com/linkedin/metadata/dao/EbeanLocalDAO.java 0.00% 2 Missing ⚠️
...linkedin/metadata/dao/utils/SQLStatementUtils.java 86.66% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master     #604      +/-   ##
============================================
+ Coverage     65.31%   65.42%   +0.11%     
- Complexity     1717     1744      +27     
============================================
  Files           143      144       +1     
  Lines          6711     6797      +86     
  Branches        809      820      +11     
============================================
+ Hits           4383     4447      +64     
- Misses         1977     1991      +14     
- Partials        351      359       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Natallia Ulashchick and others added 2 commits March 18, 2026 10:22
Add format validation in createBatchSoftDeleteAssetSql() that rejects
any cutoffTimestamp not matching yyyy-MM-dd HH:mm:ss.SSS pattern with
IllegalArgumentException. Defense-in-depth for shared library API surface.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add readDeletionInfoBatch() and batchSoftDeleteAssets() delegation
methods so BaseAssetServiceImpl can access the new DAO operations
through EbeanLocalDAO (which it already uses for all other operations).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
each entity is eligible for deletion:

- `deletedTs` — whether the entity is already soft-deleted
- `statusRemoved` — whether the entity's Status aspect has `removed = true`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all assets will have a Status aspect, how does your solution address / return for the case where a customer tries to run this on an asset without this aspect?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. I will add a check at API level to fail if the asset to be deleted doesn't have corresponding aspect status field. I will also make status field a param to pass to DAO layer to handle different names for status field.

regardless of batch size.

The DAO layer is intentionally kept simple: pure SQL, no business logic, no Kafka. The consuming service
(`metadata-graph-assets`) handles validation, Kafka archival, and per-URN result reporting.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the idea here is that based on the values rertrieved from this, some kafka event will be emitted to "cold-store" the old data?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct. Basically, we want to archive every row we delete as a backup.

- **Exactly 2 DB calls per batch.** No per-URN queries.
- **Guard clauses in the UPDATE.** Even if a caller skips the SELECT validation, the UPDATE will not soft-delete
entities that don't meet all safety conditions.
- **`a_status` column is the only schema assumption.** The two methods rely on `a_status` existing in the entity table.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also the case where an asset can have a Status aspect but have it be named differently in column form.

ie. in the Asset.proto model:

...
proto.com.linkedin.common.Status foo_bar = 1;
...

this means that the expectation for the DB column name would be a_foo_bar.

I think there is a method that allows you to "find the column name" based on an entity-Aspect type pairing, can you refactor the logic to go through this so that this assumption isn't made, effectively?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! I will refactor the logic.

// Delete prefix of the sql statement for deleting from metadata_aspect table
public static final String SQL_SOFT_DELETE_ASSET_WITH_URN = "UPDATE %s SET deleted_ts = NOW() WHERE urn = '%s';";

private static final String SQL_READ_ALL_COLUMNS_BY_URNS_TEMPLATE = "SELECT * FROM %s WHERE urn IN (%s)";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if you considered reusing the existing BatchGet functionality so as to not rewrite the DB layer operations in order to obtain this metadata?

What are the pros / cons of this operation approach?

Copy link
Copy Markdown
Contributor Author

@NatalliaUlashchick NatalliaUlashchick Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered it. Pros/cons:

Reusing BatchGet:

  • (+) No new SQL, reuses proven code path
  • (-) BatchGet returns fully deserialized RecordTemplate objects via the Pegasus stack — heavy deserialization for data we only need as raw strings for Kafka archival
  • (-) BatchGet doesn't return deleted_ts or raw column values — it's designed for the read path, not deletion eligibility checks
  • (-) Would need to call BatchGet + additional queries for deleted_ts and lastmodifiedon, adding one more DB cal.

+ " WHERE urn IN (%s)"
+ " AND deleted_ts IS NULL"
+ " AND JSON_EXTRACT(a_status, '$.aspect.removed') = true"
+ " AND JSON_EXTRACT(a_status, '$.lastmodifiedon') < '%s'";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does lastmodifiedon < %s hope to achieve why is this check needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The eligibility for deletion is N(30 days) in removed status to cover for "flip-flop" cases. Here we have the very last safety check to ensure we soft-delete entities whose Status aspect was set to removed=true at least N days ago (the cutoffTimestamp). This prevents accidental deletion of entities that were recently marked as removed — giving owners a grace period to undo the removal.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok so rewording in my words so I ensure I'm properly understanding 😂 (double check me pls 🙏 )

"This check checks to make sure that the lastmodified on field is strictly 'less than' the input value to ensure that it is 'at least' 30 days (or whatever cutoff is specified) so as to not accidentally delete recently removed datasets" 😁

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct. Since getting the datasets info and deleting them are two separate API calls, there is tiny chance there might be an update in between where dataset status has been updated in between.

.addEscape('\'', "''")
.addEscape('\\', "\\\\").build();

private static final String TIMESTAMP_FORMAT_PATTERN = "\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\.\\d{3}";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't deleted_ts in Long format?

Any advantage to using this kind of formatting instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted_ts is indeed a BIGINT, but this regex validates the cutoffTimestamp parameter passed to batchSoftDeleteAssets, which is compared against JSON_EXTRACT(a_status, '$.lastmodifiedon'). The lastmodifiedon field inside the Status aspect JSON is stored as a string in yyyy-MM-dd HH:mm:ss.SSS format
The regex validates the input to prevent SQL injection since the cutoff is string-interpolated into the query.

}

@Override
public Map<URN, EntityDeletionInfo> readDeletionInfoBatch(@Nonnull List<URN> urns, boolean isTestMode) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thought to include a batch size limit here at the kernel level so as to not accidentally overwhelm the DB?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have this check at API level already. Agree, let's add it here to be safe.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this!

// Delete prefix of the sql statement for deleting from metadata_aspect table
public static final String SQL_SOFT_DELETE_ASSET_WITH_URN = "UPDATE %s SET deleted_ts = NOW() WHERE urn = '%s';";

private static final String SQL_READ_ALL_COLUMNS_BY_URNS_TEMPLATE = "SELECT * FROM %s WHERE urn IN (%s)";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. SELECT * is broader than needed

createReadAllColumnsByUrnsSql uses SELECT . While the aspect columns are needed for Kafka archival, this pulls every column including indexes (i_). A SELECT urn, deleted_ts,
a_* pattern would be more precise, though SELECT * is simpler and consistent with being future-proof if columns are added.

I kind agree with this, I think the derived columns can probably be omitted (indexes, basically) which already exist, this can lower the impact of large select *'s a bit

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree. Let me refactor the logic

Natallia Ulashchick and others added 5 commits March 24, 2026 13:40
Replace hardcoded `a_status` column name with a `statusColumnName`
parameter throughout the batch deletion DAO layer. Different entity
types may map the Status aspect to different column names (e.g.
`a_foo_bar` instead of `a_status`), so the caller must resolve and
pass the correct column name.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…letion read

Replace SELECT * with an explicit column list (urn, deleted_ts, a_*)
in readDeletionInfoBatch to avoid fetching index columns (i_*) and
other derived columns unnecessarily. Uses SchemaValidatorUtil's cached
column metadata to resolve the column list at runtime.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reject batches exceeding 2000 URNs in readDeletionInfoBatch and
batchSoftDeleteAssets as defense-in-depth against overwhelming the DB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…it SELECT

Update spec and Javadocs to reflect that readDeletionInfoBatch uses an
explicit column list (not SELECT *), statusColumnName is caller-provided
(not hardcoded a_status), and both methods enforce a 2000 URN batch
size limit. Fix missing trailing newline in test file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jphui
Copy link
Copy Markdown
Contributor

jphui commented Mar 26, 2026

Thanks for addressing my comments, from my understanding the solution to the "Aspect Status" issues:

  1. ensure that Asset has Status
  2. find the column name

is delegated to the application layer.

I think there is sufficient utility in the codebase to just do it all here. BUT that being said if there is a strong reason to delegate totally open to hearing it out.

======

Aspect Class → Column Name Mapping

The primary utility is SQLSchemaUtils in
/home/coder/datahub-gma/dao-impl/ebean-dao/src/main/java/com/linkedin/metadata/dao/utils/SQLSchemaUtils.java.

Key Methods

  • getAspectColumnName(String entityType, String aspectCanonicalName) → String — e.g. ("dataset",
    "com.linkedin.common.Status") → "a_status"
  • getAspectColumnName(String entityType, Class aspectClass) → String — Generic overload taking a Class object
  • getAspectColumnNameWithUrnCheck(String assetType, String aspectCanonicalName) → String — Returns "a_urn" for Urn
    types, otherwise delegates

...

ModelUtils — Entity → associated aspects:

  • getValidAspectTypes(Class<ASPECT_UNION> aspectUnionClass) → Set<Class<? extends RecordTemplate>> — All aspect
    classes from an aspect union
  • getAspectTypesFromAssetType(Class assetClass) → List<Class<? extends RecordTemplate>> — All aspect classes
    from an asset class
  • getAspectClassNames(Class<ASPECT_UNION> unionClass) → List — Aspect FQCNs as strings

@NatalliaUlashchick
Copy link
Copy Markdown
Contributor Author

Thanks for addressing my comments, from my understanding the solution to the "Aspect Status" issues:

  1. ensure that Asset has Status
  2. find the column name

is delegated to the application layer.

I think there is sufficient utility in the codebase to just do it all here. BUT that being said if there is a strong reason to delegate totally open to hearing it out.

======

Aspect Class → Column Name Mapping

The primary utility is SQLSchemaUtils in /home/coder/datahub-gma/dao-impl/ebean-dao/src/main/java/com/linkedin/metadata/dao/utils/SQLSchemaUtils.java.

Key Methods

  • getAspectColumnName(String entityType, String aspectCanonicalName) → String — e.g. ("dataset",
    "com.linkedin.common.Status") → "a_status"
  • getAspectColumnName(String entityType, Class aspectClass) → String — Generic overload taking a Class object
  • getAspectColumnNameWithUrnCheck(String assetType, String aspectCanonicalName) → String — Returns "a_urn" for Urn
    types, otherwise delegates

...

ModelUtils — Entity → associated aspects:

  • getValidAspectTypes(Class<ASPECT_UNION> aspectUnionClass) → Set<Class<? extends RecordTemplate>> — All aspect
    classes from an aspect union
  • getAspectTypesFromAssetType(Class assetClass) → List<Class<? extends RecordTemplate>> — All aspect classes
    from an asset class
  • getAspectClassNames(Class<ASPECT_UNION> unionClass) → List — Aspect FQCNs as strings

On a second thought I like your idea. We could still have a check at API level that the status aspect exists for the given asset and fail fast if it doesn't, but we will leave all the internal mapping of db column name to DAO layer where it belongs. I refactored both PRs.

Remove statusColumnName parameter from public DAO API. The Status
aspect column name is now resolved internally via
SQLSchemaUtils.getAspectColumnName(), consistent with how all other
DAO methods resolve column names.

Add test stub com.linkedin.common.Status PDL and FooAsset PDL so that
the test entity type can resolve the Status column name through the
standard GlobalAssetRegistry path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@NatalliaUlashchick NatalliaUlashchick force-pushed the naulashc/batch-delete-api branch from f759a2a to 69e341f Compare March 27, 2026 16:21
@NatalliaUlashchick NatalliaUlashchick merged commit b311a97 into master Mar 27, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants