fix(assets): reduce task success asset registration lock contention#66854
fix(assets): reduce task success asset registration lock contention#66854hkc-8010 wants to merge 8 commits into
Conversation
|
The failing CI job ( This is a All other CI checks pass. |
98668e8 to
e8bca3b
Compare
e8bca3b to
c8086d2
Compare
|
The CI failure in |
|
@hkc-8010 Converting to draft — this PR doesn't yet meet our Pull Request quality criteria. See the linked criteria for how to fix each item, then mark the PR "Ready for review". This is not a rejection — just an invitation to bring the PR up to standard. No rush. Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting |
f6486ea to
3cca00a
Compare
3cca00a to
280a86a
Compare
Under high concurrency (80+ simultaneous task completions emitting asset events), the API server was OOMKilled due to idle-in-transaction DB lock pile-up. Root cause: ti_update_state held a SELECT...FOR UPDATE row lock on task_instance while AssetManager.register_asset_change() ran multiple slow queries, including an ORM .append() that lazy-loaded the entire asset_events collection (potentially thousands of rows). Two fixes: 1. In AssetManager.register_asset_change(), replace asset_alias_model.asset_events.append(asset_event) with a direct INSERT into asset_alias_asset_event. This avoids loading the full relationship collection while the row lock is held. 2. In ti_update_state(), add session.commit() after the TI state UPDATE and Log writes to release the task_instance row lock before running asset registration. Asset registration then runs outside the lock in a fresh implicit transaction. Registration failures are logged and swallowed -- the task state is already durable at that point. Note: session.commit() inside a session-parameter function is an intentional deviation from the CLAUDE.md convention. No code after the commit relies on rollback; the subsequent session.get() re-loads fresh state. Alternative approaches (second session, background task) were considered but have higher operational complexity for equivalent correctness. Production evidence: connections idle-in-transaction for 3+ minutes on asset_alias queries, blocking SELECT task_instance FOR UPDATE across 8 concurrent workers. Disabling the trigger DAGs dropped apiserver memory from 5Gi+ to MBs instantly.
Move `insert` from inline method import to the top-level sqlalchemy import block and drop the unnecessary `sa_insert` alias. Improve the session.commit() comment to explain why the early commit is still needed after the direct-INSERT alias-side fix, and clarify that the post-commit exception swallow is intentional. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
280a86a to
4673ff3
Compare
Closes #66853
Summary
ti_update_state(), releasing thetask_instancerow lock before the asset scheduling work runs.asset_eventscollections on the task completion path.asset.registration_failuresso dropped registration work is observable. Durable retry/reconciliation is intentionally out of scope for this lock-contention fix.Evidence
Sanitized measurements from the affected deployment showed the task completion path under load with 80+ concurrent successful task completions, asset-registration queries left idle in transaction for minutes, blocked
SELECT ... FOR UPDATEcalls ontask_instance, and apiserver OOMKills while those requests piled up.Changes
airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py: early task-state commit, post-commit asset registration, andasset.registration_failuresmetric on registration failure.airflow-core/src/airflow/assets/manager.py: direct alias association insert, with an explicit note thatasset_alias_model.asset_eventsis intentionally left unsynced in the current session because this path does not read it again before commit.airflow-core/tests/unit/api_fastapi/execution_api/versions/head/test_task_instances.py: regression coverage for 204 task-state durability and asset-registration failure metric emission using asset-registration-aware failure injection.airflow-core/tests/unit/assets/test_manager.py: regression coverage for avoiding aliasasset_eventslazy-loads.shared/observability/src/airflow_shared/observability/metrics/metrics_template.yaml: metric registry entry forasset.registration_failures.scripts/ci/prek/check_connection_doc_labels.py: skip volatile generated dependency directories while scanning source/docs for connection labels, so all-files static checks do not race with UInode_moduleschurn.Validation
prek run --files airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py airflow-core/src/airflow/assets/manager.py airflow-core/tests/unit/api_fastapi/execution_api/versions/head/test_task_instances.py airflow-core/tests/unit/assets/test_manager.py shared/observability/src/airflow_shared/observability/metrics/metrics_template.yaml.venv/bin/python scripts/ci/prek/run_mypy_full_dist_local_venv_or_breeze_in_ci.py airflow-coreuv run --frozen --no-sync pytest airflow-core/tests/unit/api_fastapi/execution_api/versions/head/test_task_instances.py airflow-core/tests/unit/assets/test_manager.py -qbreeze testing core-tests --backend sqlite --python 3.10 --db-reset -- airflow-core/tests/unit/api_fastapi/execution_api/versions/head/test_task_instances.py airflow-core/tests/unit/assets/test_manager.py -qbreeze testing core-tests --backend postgres --python 3.10 --db-reset -- airflow-core/tests/unit/api_fastapi/execution_api/versions/head/test_task_instances.py airflow-core/tests/unit/assets/test_manager.py -qbreeze testing core-tests --backend postgres --python 3.10 --downgrade-pendulum --db-reset -- airflow-core/tests/unit/api_fastapi/execution_api/versions/head/test_task_instances.py airflow-core/tests/unit/assets/test_manager.py -qbreeze testing core-tests --backend sqlite --python 3.10 --force-lowest-dependencies --db-reset -- airflow-core/tests/unit/api_fastapi/execution_api/versions/head/test_task_instances.py airflow-core/tests/unit/assets/test_manager.py -quv run --frozen --no-sync pytest scripts/tests/ci/prek/test_check_connection_doc_labels.py -quv run --no-sync scripts/ci/prek/check_connection_doc_labels.pyprek run --files scripts/ci/prek/check_connection_doc_labels.py scripts/tests/ci/prek/test_check_connection_doc_labels.pyprek run check-connection-doc-labels --all-filesGitHub Actions:
CI image checks / Static checkspassed on commitab90c262091d64bdf84d07abe7e604c85f829ebe. One remaining GitHub status is reported red forMySQL tests: core / DB-core:MySQL:8.0:3.10:API...CLIbecause the job conclusion iscancelled, but the job's internal steps show the migration tests, MySQL DB tests, and post-success step all completed successfully. Rerunning that status from this account is permission-gated. The PR is still draft, so the WIP status remains pending by design.PR Checklist
mainbranchprek, mypy, pytest, and Breeze validation passed locally