Skip to content

fix(assets): reduce task success asset registration lock contention#66854

Open
hkc-8010 wants to merge 8 commits into
apache:mainfrom
hkc-8010:fix/asset-event-lock-contention
Open

fix(assets): reduce task success asset registration lock contention#66854
hkc-8010 wants to merge 8 commits into
apache:mainfrom
hkc-8010:fix/asset-event-lock-contention

Conversation

@hkc-8010
Copy link
Copy Markdown
Contributor

@hkc-8010 hkc-8010 commented May 13, 2026

Closes #66853

Summary

  • Commit the task-instance state update and log entry before asset registration in ti_update_state(), releasing the task_instance row lock before the asset scheduling work runs.
  • Replace the alias-event ORM relationship append with a direct association-table insert so long-running aliases do not lazy-load large asset_events collections on the task completion path.
  • Keep HTTP 204 after task success is durable if post-commit asset registration fails, but now log the failure and increment asset.registration_failures so dropped registration work is observable. Durable retry/reconciliation is intentionally out of scope for this lock-contention fix.

Evidence

Sanitized measurements from the affected deployment showed the task completion path under load with 80+ concurrent successful task completions, asset-registration queries left idle in transaction for minutes, blocked SELECT ... FOR UPDATE calls on task_instance, and apiserver OOMKills while those requests piled up.

Changes

  • airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py: early task-state commit, post-commit asset registration, and asset.registration_failures metric on registration failure.
  • airflow-core/src/airflow/assets/manager.py: direct alias association insert, with an explicit note that asset_alias_model.asset_events is intentionally left unsynced in the current session because this path does not read it again before commit.
  • airflow-core/tests/unit/api_fastapi/execution_api/versions/head/test_task_instances.py: regression coverage for 204 task-state durability and asset-registration failure metric emission using asset-registration-aware failure injection.
  • airflow-core/tests/unit/assets/test_manager.py: regression coverage for avoiding alias asset_events lazy-loads.
  • shared/observability/src/airflow_shared/observability/metrics/metrics_template.yaml: metric registry entry for asset.registration_failures.
  • scripts/ci/prek/check_connection_doc_labels.py: skip volatile generated dependency directories while scanning source/docs for connection labels, so all-files static checks do not race with UI node_modules churn.

Validation

  • prek run --files airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py airflow-core/src/airflow/assets/manager.py airflow-core/tests/unit/api_fastapi/execution_api/versions/head/test_task_instances.py airflow-core/tests/unit/assets/test_manager.py shared/observability/src/airflow_shared/observability/metrics/metrics_template.yaml
  • .venv/bin/python scripts/ci/prek/run_mypy_full_dist_local_venv_or_breeze_in_ci.py airflow-core
  • uv run --frozen --no-sync pytest airflow-core/tests/unit/api_fastapi/execution_api/versions/head/test_task_instances.py airflow-core/tests/unit/assets/test_manager.py -q
  • breeze testing core-tests --backend sqlite --python 3.10 --db-reset -- airflow-core/tests/unit/api_fastapi/execution_api/versions/head/test_task_instances.py airflow-core/tests/unit/assets/test_manager.py -q
  • breeze testing core-tests --backend postgres --python 3.10 --db-reset -- airflow-core/tests/unit/api_fastapi/execution_api/versions/head/test_task_instances.py airflow-core/tests/unit/assets/test_manager.py -q
  • breeze testing core-tests --backend postgres --python 3.10 --downgrade-pendulum --db-reset -- airflow-core/tests/unit/api_fastapi/execution_api/versions/head/test_task_instances.py airflow-core/tests/unit/assets/test_manager.py -q
  • breeze testing core-tests --backend sqlite --python 3.10 --force-lowest-dependencies --db-reset -- airflow-core/tests/unit/api_fastapi/execution_api/versions/head/test_task_instances.py airflow-core/tests/unit/assets/test_manager.py -q
  • uv run --frozen --no-sync pytest scripts/tests/ci/prek/test_check_connection_doc_labels.py -q
  • uv run --no-sync scripts/ci/prek/check_connection_doc_labels.py
  • prek run --files scripts/ci/prek/check_connection_doc_labels.py scripts/tests/ci/prek/test_check_connection_doc_labels.py
  • prek run check-connection-doc-labels --all-files

GitHub Actions: CI image checks / Static checks passed on commit ab90c262091d64bdf84d07abe7e604c85f829ebe. One remaining GitHub status is reported red for MySQL tests: core / DB-core:MySQL:8.0:3.10:API...CLI because the job conclusion is cancelled, but the job's internal steps show the migration tests, MySQL DB tests, and post-success step all completed successfully. Rerunning that status from this account is permission-gated. The PR is still draft, so the WIP status remains pending by design.

PR Checklist

  • My PR is targeted at the main branch
  • Tests added/updated
  • Targeted prek, mypy, pytest, and Breeze validation passed locally

@boring-cyborg boring-cyborg Bot added area:API Airflow's REST/HTTP API area:task-sdk labels May 13, 2026
hkc-8010 added a commit to hkc-8010/my-airflow-repository that referenced this pull request May 13, 2026
@hkc-8010 hkc-8010 marked this pull request as ready for review May 13, 2026 11:41
@hkc-8010
Copy link
Copy Markdown
Contributor Author

The failing CI job (provider distributions tests / Compat 3.0.6:P3.10) is pre-existing on main and unrelated to this PR. It fails with:

ImportError: cannot import name 'Options' from 'jwt.types'

This is a flask_jwt_extended/PyJWT version incompatibility in providers/fab tests running against Airflow 3.0.6. Confirmed failing on main in run https://github.com/apache/airflow/actions/runs/25789005777 before this PR was opened.

All other CI checks pass.

Lee-W pushed a commit to hkc-8010/my-airflow-repository that referenced this pull request May 14, 2026
@Lee-W Lee-W force-pushed the fix/asset-event-lock-contention branch from 98668e8 to e8bca3b Compare May 14, 2026 09:07
@hkc-8010 hkc-8010 force-pushed the fix/asset-event-lock-contention branch from e8bca3b to c8086d2 Compare May 15, 2026 04:57
hkc-8010 added a commit to hkc-8010/my-airflow-repository that referenced this pull request May 15, 2026
@hkc-8010
Copy link
Copy Markdown
Contributor Author

hkc-8010 commented May 15, 2026

The CI failure in Integration and System Tests / Integration core otel (test_export_legacy_metric_names) is unrelated to this PR — it tests scheduler-side metric emission timing and has a history of flakiness (see #61070, #65867). This PR touches only task_instances.py and assets/manager.py; no OTEL or scheduler metric code is modified.

@choo121600 choo121600 added the ready for maintainer review Set after triaging when all criteria pass. label May 15, 2026
Comment thread airflow-core/tests/unit/assets/test_manager.py Outdated
@hkc-8010 hkc-8010 requested a review from Lee-W May 19, 2026 15:11
@potiuk potiuk removed the ready for maintainer review Set after triaging when all criteria pass. label May 24, 2026
@eladkal eladkal added this to the Airflow 3.2.3 milestone May 25, 2026
@eladkal eladkal added type:bug-fix Changelog: Bug Fixes backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch labels May 25, 2026
@potiuk potiuk marked this pull request as draft May 26, 2026 00:46
@potiuk
Copy link
Copy Markdown
Member

potiuk commented May 26, 2026

@hkc-8010 Converting to draft — this PR doesn't yet meet our Pull Request quality criteria.

  • Pre-commit / static checks. See docs.
  • Unresolved review comments: 2 thread(s). See docs.

See the linked criteria for how to fix each item, then mark the PR "Ready for review". This is not a rejection — just an invitation to bring the PR up to standard. No rush.


Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.


Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting

@hkc-8010 hkc-8010 force-pushed the fix/asset-event-lock-contention branch from f6486ea to 3cca00a Compare May 27, 2026 12:25
hkc-8010 added a commit to hkc-8010/my-airflow-repository that referenced this pull request May 27, 2026
@hkc-8010 hkc-8010 force-pushed the fix/asset-event-lock-contention branch from 3cca00a to 280a86a Compare May 27, 2026 18:11
hkc-8010 added a commit to hkc-8010/my-airflow-repository that referenced this pull request May 27, 2026
Comment thread airflow-core/src/airflow/assets/manager.py
Under high concurrency (80+ simultaneous task completions emitting asset
events), the API server was OOMKilled due to idle-in-transaction DB
lock pile-up. Root cause: ti_update_state held a SELECT...FOR UPDATE row
lock on task_instance while AssetManager.register_asset_change() ran
multiple slow queries, including an ORM .append() that lazy-loaded the
entire asset_events collection (potentially thousands of rows).

Two fixes:

1. In AssetManager.register_asset_change(), replace
   asset_alias_model.asset_events.append(asset_event) with a direct
   INSERT into asset_alias_asset_event. This avoids loading the full
   relationship collection while the row lock is held.

2. In ti_update_state(), add session.commit() after the TI state UPDATE
   and Log writes to release the task_instance row lock before running
   asset registration. Asset registration then runs outside the lock in
   a fresh implicit transaction. Registration failures are logged and
   swallowed -- the task state is already durable at that point.

Note: session.commit() inside a session-parameter function is an
intentional deviation from the CLAUDE.md convention. No code after the
commit relies on rollback; the subsequent session.get() re-loads fresh
state. Alternative approaches (second session, background task) were
considered but have higher operational complexity for equivalent
correctness.

Production evidence: connections idle-in-transaction for 3+ minutes on
asset_alias queries, blocking SELECT task_instance FOR UPDATE across 8
concurrent workers. Disabling the trigger DAGs dropped apiserver memory
from 5Gi+ to MBs instantly.
hkc-8010 and others added 6 commits June 3, 2026 19:41
Move `insert` from inline method import to the top-level sqlalchemy
import block and drop the unnecessary `sa_insert` alias. Improve the
session.commit() comment to explain why the early commit is still
needed after the direct-INSERT alias-side fix, and clarify that the
post-commit exception swallow is intentional.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@hkc-8010 hkc-8010 force-pushed the fix/asset-event-lock-contention branch from 280a86a to 4673ff3 Compare June 4, 2026 03:34
@hkc-8010 hkc-8010 changed the title fix(assets): release task_instance row lock before asset event emission fix(assets): reduce task success asset registration lock contention Jun 4, 2026
@hkc-8010 hkc-8010 requested a review from kaxil June 4, 2026 05:26
@hkc-8010 hkc-8010 marked this pull request as ready for review June 4, 2026 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:task-sdk backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

API server OOMKill: task_instance row lock held during asset event emission under high concurrency

6 participants