Skip to content

Conversation

@harshach
Copy link
Collaborator

@harshach harshach commented Jan 27, 2026

Describe your changes:

Fixes

I worked on ... because ...


Summary by Gitar

This PR implements batched CSV import/export operations to significantly improve performance for large datasets:

  • Batch Processing: Queues database operations and flushes every 100 records using insertMany()/updateMany() instead of individual INSERT/UPDATE statements
  • Bulk Search Indexing: Replaces individual Elasticsearch updates with bulk API calls via updateEntitiesBulk()
  • Optimized Table Updates: Batches multiple column updates into a single PATCH operation per table instead of one per column
  • Progress Reporting: Added WebSocket-based real-time progress callbacks with UI progress bars for both import and export operations
  • Graceful Degradation: Falls back to individual operations if batch operations fail, ensuring reliability

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

… circular dependency, generated changeEvents (#25582)

* Fix tag clearing and circular dependency detection in batch CSV imports

  - **Tag clearing fix**: Add deleteTagsByTarget before applying new tags in batch imports to match single entity import behavior, ensuring empty CSV fields properly clear existing
   tags
  - **Circular dependency detection fix**: Pre-track entities in dryRunCreatedEntities before parent resolution to enable proper circular reference validation during CSV team
  imports
  - Resolves test failures in TeamResourceIT.test_importCsv_circularDependency_trueRun and tag-related import issues
  - Maintains batch import performance while restoring pre-batch-import validation contracts

* improve storeRelationshipsInternal internal methods - make them truly batched operations

* - Add storeEntities override to all repositories (57 repos)
-  Add batch lock check to HierarchicalLockManager
- Add batch cache write to EntityRepository
-  Fix createManyEntitiesForImport with batched operations
- Fix updateManyEntitiesForImport with batched operations
-  Add change event creation in flushPendingEntityOperations

---------

Co-authored-by: sonika-shah <[email protected]>
@gitar-bot
Copy link

gitar-bot bot commented Jan 30, 2026

🔍 CI failure analysis for 8486d24: Maven SonarCloud CI (MySQL) shows 4 failures (99.9% pass rate) - same infrastructure issues as PostgreSQL CI. Test Report failed as consequence. Critical TeamResourceTest bug remains fixed across both database backends.

Issue

Maven SonarCloud CI (job 62003021906, MySQL backend) shows 1 failure and 3 errors out of 7836 tests (99.9% pass rate: 7831 passed, 1 failure, 3 errors, 701 skipped).

Root Cause

Maven Test Failures (Same as PostgreSQL CI):

  1. AppsResourceTest.post_trigger_app_200 - Polling timeout (1 failure)
  2. AwsCredentialsUtilTest - AWS credentials not configured (3 errors)
    • testBuildCredentialsProviderWithOnlyAccessKey
    • testBuildCredentialsProviderWithNoCredentials
    • testBuildCredentialsProviderWithEmptyCredentials

All failures are infrastructure/configuration issues unrelated to CSV batching changes.

Details

Consistency Across Databases:

  • PostgreSQL CI (job 62003022199): 1 failure, 5 errors (includes 2 WorkflowDefinitionResourceTest errors)
  • MySQL CI (job 62003021906): 1 failure, 3 errors
  • Common failures: AppsResourceTest (1) + AwsCredentialsUtilTest (3)
  • Both backends: 99.9% pass rate

Critical Success Confirmed: The TeamResourceTest.testTeamImportExport bug is not in the failure list for either database backend, confirming commit 8486d24 successfully fixed the hierarchical entity resolution issue across both MySQL and PostgreSQL.

Test Report (job 62025034391): Failed as downstream consequence of Maven test failures. This is a reporting/aggregation job, not a separate test suite.

Complete CI Status:

  1. ✅ Maven (both backends): 99.9% pass rate - infrastructure failures only
  2. ✅ Python: 99.8% pass rate - S3 infrastructure issue
  3. ⚠️ UI Coverage: 99.9% - test maintenance needed
  4. ⚠️ E2E: 98.9% - 2 timeouts, 4 unrelated

Impact: All Maven failures are infrastructure issues, not blocking for CSV batching functionality.

Code Review 👍 Approved with suggestions 4 resolved / 5 findings

Solid batched CSV import/export implementation. The previously identified unused batchNumber parameter in CsvImportProgressCallback remains unresolved - the parameter is captured in the lambda but not passed to sendCsvImportProgressNotification.

💡 Edge Case: Unused batchNumber parameter in CsvImportProgressCallback

📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/EntityResource.java:887

The CsvImportProgressCallback interface defines:

void onProgress(int rowsProcessed, int totalRows, int batchNumber, String message);

However, in EntityResource.java, when the callback is created (line ~892), the batchNumber parameter is ignored:

CsvImportProgressCallback progressCallback =
    (rowsProcessed, totalRows, batchNumber, message) ->
        WebsocketNotificationHandler.sendCsvImportProgressNotification(
            jobId, securityContext, rowsProcessed, totalRows, message);  // batchNumber not passed

This means the batch number information is lost when sending WebSocket notifications, even though it's available.

Suggested fix: Either pass batchNumber to the WebSocket notification handler if useful for the UI, or simplify the callback interface to remove the unused parameter.

✅ 4 resolved
Bug: PendingEntityOperation records deleted CSV failures incorrectly

📄 openmetadata-csv/src/main/java/org/openmetadata/csv/EntityCsv.java:533
In flushPendingTableUpdates(), when a batch patch fails, the code updates import statistics:

for (CSVRecord record : context.csvRecords) {
  importResult.withNumberOfRowsPassed(importResult.getNumberOfRowsPassed() - 1);
  importResult.withNumberOfRowsFailed(importResult.getNumberOfRowsFailed() + 1);
}

However, importSuccess() was already called for these records earlier (line ~479), meaning each record was already counted as "passed". When the batch fails, this code decrements numberOfRowsPassed, but the original success message was already written to the CSV results.

This creates inconsistency:

  1. The CSV output shows success for rows that actually failed
  2. The summary numbers will be inconsistent with the actual CSV results output

Suggested fix: Either defer calling importSuccess() until after flushPendingTableUpdates() succeeds, or update the result CSV when failures occur in flush.

Bug: Missing change event generation in batch entity operations

📄 openmetadata-csv/src/main/java/org/openmetadata/csv/EntityCsv.java:202
In flushPendingEntityOperations() (EntityCsv.java), entities are inserted/updated in batches via insertMany() and updateMany(). However, the code only queues entities for search index updates and skips change event generation entirely.

The original code in createEntity() generated change events via createChangeEventAndUpdateInES() for each entity, but the batch path bypasses this. This means:

  1. No ChangeEvent records are persisted for batch-created/updated entities
  2. Event-driven systems relying on change events won't be notified
  3. Audit trails may be incomplete

Suggested fix: Generate and persist change events for batch operations. Either:

  • Generate change events in flushPendingEntityOperations() before queuing for ES
  • Or modify the fallback path to ensure change events are created when createOrUpdate() is called
Bug: Entity version not incremented on batch updates

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:1291
In flushPendingEntityOperations(), batch updates call dao.updateMany() directly, but this bypasses the entity versioning logic that normally happens in EntityRepository.createOrUpdate().

Looking at updateManyEntitiesForImport() (EntityRepository.java:1285-1305), the version is simply copied from the original without incrementing:

updated.setVersion(original.getVersion());

This means entities updated via batch import won't have their version incremented, which could cause:

  1. Optimistic locking issues if the entity is later updated normally
  2. Inconsistent version history
  3. Potential data overwrite conflicts

Suggested fix: Increment the version using EntityUtil.nextVersion() or similar logic as done in normal update paths.

Bug: Division by zero possible in progress calculation

📄 openmetadata-ui/src/main/resources/ui/src/pages/EntityImport/BulkEntityImportPage/BulkEntityImportPage.tsx:2396 📄 openmetadata-ui/src/main/resources/ui/src/components/Entity/EntityExportModalProvider/EntityExportModalProvider.component.tsx:319
In the frontend BulkEntityImportPage.tsx, the progress percentage is calculated as:

percent={Math.round(((activeAsyncImportJob.progress ?? 0) / activeAsyncImportJob.total) * 100)}

While there is a check activeAsyncImportJob.total > 0, this is in the rendering condition for the Progress component itself, but the calculation still happens. If total is 0 or undefined in a race condition before the check evaluates, this could cause a division by zero resulting in NaN or Infinity.

Similarly, in EntityExportModalProvider.component.tsx:

percent={Math.round((csvExportJob.progress / csvExportJob.total) * 100)}

Suggested fix: Add a safeguard to the calculation itself:

percent={Math.round(((activeAsyncImportJob.progress ?? 0) / Math.max(activeAsyncImportJob.total || 1, 1)) * 100)}
Rules ✅ All requirements met

Gitar Rules

Summary Enhancement: PR description includes comprehensive technical summary with batching details

2 rules not applicable. Show all rules by commenting gitar display:verbose.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants