Skip to content

Dag processor: reduce file-queue dedup from O(N²) to O(N) with OrderedDict#67750

Open
shahar1 wants to merge 2 commits into
apache:mainfrom
shahar1:perf/dag-processing-queue
Open

Dag processor: reduce file-queue dedup from O(N²) to O(N) with OrderedDict#67750
shahar1 wants to merge 2 commits into
apache:mainfrom
shahar1:perf/dag-processing-queue

Conversation

@shahar1
Copy link
Copy Markdown
Contributor

@shahar1 shahar1 commented May 29, 2026

Replace collections.deque with OrderedDict[DagFileInfo, None] for
DagFileProcessorManager._file_queue.

Problem

x in deque and deque.remove(x) are O(queue-size). The frontprio
priority path (_queue_requested_files_for_parsing), per-loop callback
adds (_add_callback_to_queue), and any re-add against a populated queue
are therefore O(N × Q) — quadratic in the steady-state file count.

The three affected paths:

  • mode="frontprio": deque.remove(file) per file — O(Q) each
  • mode="front": f not in self._file_queue per file — O(Q) each
  • incremental drip: one file at a time via callback/priority adds

Fix

Use OrderedDict as an ordered set ({file: None}). Membership is O(1),
push-front is move_to_end(file, last=False), pop-front is
popitem(last=False). All three modes collapse to O(1) per file.

Behavior is verified identical over 300 random operations against the old
deque semantics.

Benchmark results

Measured on WSL2 / Python 3.12 with synthetic DagFileInfo objects
(no DB, no subprocess). Full benchmark scripts:
gist: dag-processing benchmarks

File-queue ops (best-of-5, ms):

Before:

files fill empty (linear) front re-add frontprio re-add incremental drip
500 0.17 35.4 35.6 36.5
4000 0.31 2636.7 2537.9 2640.5

After:

files fill empty (linear) front re-add frontprio re-add incremental drip
500 0.18 0.22 0.23 0.24
4000 0.31 2.94 3.82 3.05

The ms/N² column flips from ~142 (flat = quadratic) to ~0.1 (declining = linear), confirming the complexity class change.

Tests

All 116 test_manager.py tests pass. 19 tests required migrating
hardcoded deque(...) in setup/assertions to OrderedDict.fromkeys(...).


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (claude-sonnet-4-6)

Generated-by: Claude Code (claude-sonnet-4-6) following the guidelines

…dDict

Replace collections.deque with OrderedDict[DagFileInfo, None] for
_file_queue. Membership testing and remove operations are O(1) instead of
O(N), eliminating the quadratic cost in frontprio and re-add paths.

Verified behavior-identical over 300 random ops against the old deque
semantics. All 116 manager tests pass.

Benchmark results (best-of-N, ms):
  files   before    after    speedup
   4000   2320.7     3.82    ~610×  (frontprio re-add)
   4000   2299.7     2.94    ~780×  (front re-add)
@shahar1 shahar1 requested a review from kaxil May 29, 2026 20:40
@shahar1 shahar1 added the backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch label May 29, 2026
Comment thread airflow-core/src/airflow/dag_processing/manager.py
In frontprio mode, pop the existing key before re-inserting so the new
DagFileInfo object (which may carry a fresher bundle_path) replaces the
old one in the dict, matching the old deque.remove()+appendleft() semantics.

Remove all inline comments added to _add_files_to_queue and the _file_queue
field declaration.
@shahar1 shahar1 requested a review from kaxil May 30, 2026 05:42
@shahar1 shahar1 changed the title DAG processor: reduce file-queue dedup from O(N²) to O(N) with OrderedDict Dag processor: reduce file-queue dedup from O(N²) to O(N) with OrderedDict May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:DAG-processing backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants