Add batch deletion DAO support for stale metadata cleanup#604
Add batch deletion DAO support for stale metadata cleanup#604NatalliaUlashchick merged 10 commits intomasterfrom
Conversation
Add readDeletionInfoBatch() and batchSoftDeleteAssets() to IEbeanLocalAccess to support bulk soft-deletion with exactly 2 DB round-trips per batch. - EntityDeletionInfo: new @value @builder data class in dao-api holding deletion eligibility fields and aspect columns for Kafka archival - readDeletionInfoBatch: single SELECT * to read all URNs and parse a_status for statusRemoved/statusLastModifiedOn - batchSoftDeleteAssets: single guarded UPDATE with defense-in-depth WHERE clauses (deleted_ts IS NULL, removed=true, lastmodifiedon < cutoff) - SQL templates in SQLStatementUtils, SqlRow parsing in EBeanDAOUtils (convertSqlRowsToEntityDeletionInfoMap / toEntityDeletionInfo) - InstrumentedEbeanLocalAccess: delegation with instrument() wrapper - Integration tests in EbeanLocalAccessTest (11 tests against embedded MariaDB) and mock-based tests in InstrumentedEbeanLocalAccessTest (2) Part of META-23501 -- Metadata Graph Stale Metadata Cleanup Phase 2. DAO layer is intentionally Kafka-free; archival lives in metadata-graph-assets. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e and spotless - Revert EmbeddedMariaInstance Apple Silicon config back to commented-out (was activated for local testing only, must not be in PR) - Rename test methods to camelCase to satisfy checkstyle method naming rule - Apply spotless formatting to CLAUDE.md and spec/batch_delete_dao_changes.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #604 +/- ##
============================================
+ Coverage 65.31% 65.42% +0.11%
- Complexity 1717 1744 +27
============================================
Files 143 144 +1
Lines 6711 6797 +86
Branches 809 820 +11
============================================
+ Hits 4383 4447 +64
- Misses 1977 1991 +14
- Partials 351 359 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add format validation in createBatchSoftDeleteAssetSql() that rejects any cutoffTimestamp not matching yyyy-MM-dd HH:mm:ss.SSS pattern with IllegalArgumentException. Defense-in-depth for shared library API surface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add readDeletionInfoBatch() and batchSoftDeleteAssets() delegation methods so BaseAssetServiceImpl can access the new DAO operations through EbeanLocalDAO (which it already uses for all other operations). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| each entity is eligible for deletion: | ||
|
|
||
| - `deletedTs` — whether the entity is already soft-deleted | ||
| - `statusRemoved` — whether the entity's Status aspect has `removed = true` |
There was a problem hiding this comment.
Not all assets will have a Status aspect, how does your solution address / return for the case where a customer tries to run this on an asset without this aspect?
There was a problem hiding this comment.
Great catch. I will add a check at API level to fail if the asset to be deleted doesn't have corresponding aspect status field. I will also make status field a param to pass to DAO layer to handle different names for status field.
| regardless of batch size. | ||
|
|
||
| The DAO layer is intentionally kept simple: pure SQL, no business logic, no Kafka. The consuming service | ||
| (`metadata-graph-assets`) handles validation, Kafka archival, and per-URN result reporting. |
There was a problem hiding this comment.
IIUC, the idea here is that based on the values rertrieved from this, some kafka event will be emitted to "cold-store" the old data?
There was a problem hiding this comment.
Yes, correct. Basically, we want to archive every row we delete as a backup.
spec/batch_delete_dao_changes.md
Outdated
| - **Exactly 2 DB calls per batch.** No per-URN queries. | ||
| - **Guard clauses in the UPDATE.** Even if a caller skips the SELECT validation, the UPDATE will not soft-delete | ||
| entities that don't meet all safety conditions. | ||
| - **`a_status` column is the only schema assumption.** The two methods rely on `a_status` existing in the entity table. |
There was a problem hiding this comment.
There's also the case where an asset can have a Status aspect but have it be named differently in column form.
ie. in the Asset.proto model:
...
proto.com.linkedin.common.Status foo_bar = 1;
...
this means that the expectation for the DB column name would be a_foo_bar.
I think there is a method that allows you to "find the column name" based on an entity-Aspect type pairing, can you refactor the logic to go through this so that this assumption isn't made, effectively?
There was a problem hiding this comment.
Thanks for catching this! I will refactor the logic.
| // Delete prefix of the sql statement for deleting from metadata_aspect table | ||
| public static final String SQL_SOFT_DELETE_ASSET_WITH_URN = "UPDATE %s SET deleted_ts = NOW() WHERE urn = '%s';"; | ||
|
|
||
| private static final String SQL_READ_ALL_COLUMNS_BY_URNS_TEMPLATE = "SELECT * FROM %s WHERE urn IN (%s)"; |
There was a problem hiding this comment.
Curious if you considered reusing the existing BatchGet functionality so as to not rewrite the DB layer operations in order to obtain this metadata?
What are the pros / cons of this operation approach?
There was a problem hiding this comment.
I considered it. Pros/cons:
Reusing BatchGet:
- (+) No new SQL, reuses proven code path
- (-) BatchGet returns fully deserialized RecordTemplate objects via the Pegasus stack — heavy deserialization for data we only need as raw strings for Kafka archival
- (-) BatchGet doesn't return deleted_ts or raw column values — it's designed for the read path, not deletion eligibility checks
- (-) Would need to call BatchGet + additional queries for deleted_ts and lastmodifiedon, adding one more DB cal.
| + " WHERE urn IN (%s)" | ||
| + " AND deleted_ts IS NULL" | ||
| + " AND JSON_EXTRACT(a_status, '$.aspect.removed') = true" | ||
| + " AND JSON_EXTRACT(a_status, '$.lastmodifiedon') < '%s'"; |
There was a problem hiding this comment.
what does lastmodifiedon < %s hope to achieve why is this check needed?
There was a problem hiding this comment.
The eligibility for deletion is N(30 days) in removed status to cover for "flip-flop" cases. Here we have the very last safety check to ensure we soft-delete entities whose Status aspect was set to removed=true at least N days ago (the cutoffTimestamp). This prevents accidental deletion of entities that were recently marked as removed — giving owners a grace period to undo the removal.
There was a problem hiding this comment.
Hmm ok so rewording in my words so I ensure I'm properly understanding 😂 (double check me pls 🙏 )
"This check checks to make sure that the lastmodified on field is strictly 'less than' the input value to ensure that it is 'at least' 30 days (or whatever cutoff is specified) so as to not accidentally delete recently removed datasets" 😁
There was a problem hiding this comment.
Yes, correct. Since getting the datasets info and deleting them are two separate API calls, there is tiny chance there might be an update in between where dataset status has been updated in between.
| .addEscape('\'', "''") | ||
| .addEscape('\\', "\\\\").build(); | ||
|
|
||
| private static final String TIMESTAMP_FORMAT_PATTERN = "\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\.\\d{3}"; |
There was a problem hiding this comment.
isn't deleted_ts in Long format?
Any advantage to using this kind of formatting instead?
There was a problem hiding this comment.
deleted_ts is indeed a BIGINT, but this regex validates the cutoffTimestamp parameter passed to batchSoftDeleteAssets, which is compared against JSON_EXTRACT(a_status, '$.lastmodifiedon'). The lastmodifiedon field inside the Status aspect JSON is stored as a string in yyyy-MM-dd HH:mm:ss.SSS format
The regex validates the input to prevent SQL injection since the cutoff is string-interpolated into the query.
| } | ||
|
|
||
| @Override | ||
| public Map<URN, EntityDeletionInfo> readDeletionInfoBatch(@Nonnull List<URN> urns, boolean isTestMode) { |
There was a problem hiding this comment.
Any thought to include a batch size limit here at the kernel level so as to not accidentally overwhelm the DB?
There was a problem hiding this comment.
I have this check at API level already. Agree, let's add it here to be safe.
| // Delete prefix of the sql statement for deleting from metadata_aspect table | ||
| public static final String SQL_SOFT_DELETE_ASSET_WITH_URN = "UPDATE %s SET deleted_ts = NOW() WHERE urn = '%s';"; | ||
|
|
||
| private static final String SQL_READ_ALL_COLUMNS_BY_URNS_TEMPLATE = "SELECT * FROM %s WHERE urn IN (%s)"; |
There was a problem hiding this comment.
- SELECT * is broader than needed
createReadAllColumnsByUrnsSql uses SELECT . While the aspect columns are needed for Kafka archival, this pulls every column including indexes (i_). A SELECT urn, deleted_ts,
a_* pattern would be more precise, though SELECT * is simpler and consistent with being future-proof if columns are added.
I kind agree with this, I think the derived columns can probably be omitted (indexes, basically) which already exist, this can lower the impact of large select *'s a bit
There was a problem hiding this comment.
Yes, agree. Let me refactor the logic
Replace hardcoded `a_status` column name with a `statusColumnName` parameter throughout the batch deletion DAO layer. Different entity types may map the Status aspect to different column names (e.g. `a_foo_bar` instead of `a_status`), so the caller must resolve and pass the correct column name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…letion read Replace SELECT * with an explicit column list (urn, deleted_ts, a_*) in readDeletionInfoBatch to avoid fetching index columns (i_*) and other derived columns unnecessarily. Uses SchemaValidatorUtil's cached column metadata to resolve the column list at runtime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reject batches exceeding 2000 URNs in readDeletionInfoBatch and batchSoftDeleteAssets as defense-in-depth against overwhelming the DB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…it SELECT Update spec and Javadocs to reflect that readDeletionInfoBatch uses an explicit column list (not SELECT *), statusColumnName is caller-provided (not hardcoded a_status), and both methods enforce a 2000 URN batch size limit. Fix missing trailing newline in test file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for addressing my comments, from my understanding the solution to the "Aspect Status" issues:
is delegated to the application layer. I think there is sufficient utility in the codebase to just do it all here. BUT that being said if there is a strong reason to delegate totally open to hearing it out. ====== Aspect Class → Column Name Mapping The primary utility is SQLSchemaUtils in Key Methods
... ModelUtils — Entity → associated aspects:
|
On a second thought I like your idea. We could still have a check at API level that the status aspect exists for the given asset and fail fast if it doesn't, but we will leave all the internal mapping of db column name to DAO layer where it belongs. I refactored both PRs. |
Remove statusColumnName parameter from public DAO API. The Status aspect column name is now resolved internally via SQLSchemaUtils.getAspectColumnName(), consistent with how all other DAO methods resolve column names. Add test stub com.linkedin.common.Status PDL and FooAsset PDL so that the test entity type can resolve the Status column name through the standard GlobalAssetRegistry path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
f759a2a to
69e341f
Compare
Summary
readDeletionInfoBatch()andbatchSoftDeleteAssets()toIEbeanLocalAccessto support bulk soft-deletion with exactly 2 DB round-trips per batchEntityDeletionInfodata class indao-apiholding deletion eligibility fields and aspect column data for Kafka archivalbatchSoftDeleteAssetsUPDATE embeds all guard clauses (deleted_ts IS NULL,Status.removed = true,lastmodifiedon < cutoff) as defense-in-depth against race conditionsPart of META-23501 — Metadata Graph Stale Metadata Cleanup Phase 2.
The Batch Delete API PR-803 is in draft status depending on this PR.
Test plan
EbeanLocalAccessTest: happy path, empty input, non-existent URNs, mixed found/not-found, already soft-deleted, guard clauses (status not removed, retention not met, already deleted), mixed eligibility batchInstrumentedEbeanLocalAccessTest: delegation and latency recording for both new methodsmetadata-modelsmint buildsucceeds withdatahub-gma 0.6.171