Fuse document iterate and extract stages #1458

sarahyurick · 2026-02-03T19:50:36Z

Closes #780.

This change allows the following improvements:

When tested with limited RAM (--memory=200g), the Common Crawl download and extract pipeline without fusion OOM'ed even when scaling all the way down to 16 CPUs. With fusion, the pipeline was able to succeed with 32 CPUs.
When removing the RAM limit, the pipeline was faster with fusion than without it. See the following benchmark (using https://github.com/NVIDIA-NeMo/Curator/blob/main/benchmarking/scripts/common_crawl_benchmark.py).

Configuration	Runtime
Fusion + Ray Data executor	27m 49.1s
No Fusion + Ray Data executor	29m 33.1s
Fusion + Xenna executor	54m 4.2s
No Fusion + Xenna executor	66m 6.8s

Common Crawl benchmarks with url_limit=16 and num_cpus=8

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

greptile-apps

_{14 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-05T21:59:20Z

nemo_curator/stages/text/download/base/iterator.py

+                for record_dict in iterator_result:
+                    if self.record_limit and record_count >= self.record_limit:
+                        break


Record counts wrong
record_limit is enforced based on record_count, but record_count is only incremented after a record is kept (i.e., after extract() and after extracted is None filtering). With an extractor that filters heavily, this will iterate/extract far more than record_limit input records per file (and potentially do a lot more work/memory) before record_count reaches the limit. This is triggered when extractor is set and can return None (e.g., content filters).

A record is only added to the result after extraction. This means that the record count is correct.

greptile-apps · 2026-02-05T21:59:28Z

Additional Comments (2)

nemo_curator/stages/text/download/base/iterator.py
Empty batch lacks schema
When records is empty (empty task.data, iterate() returns None, all files error, or all records filtered), pd.DataFrame(records) creates a DataFrame with no columns. Downstream stages/tests that rely on declared outputs (e.g., expecting the filename column or extractor output columns) will hit KeyError when accessing missing columns. Consider constructing an empty DataFrame with the expected columns from self.outputs()[1] when records is empty.

docs/about/concepts/text/data-acquisition-concepts.md
Incorrect API description
This section says the iterator’s yielded list[dict] is converted to a DataFrame and “passed to Extractor”, and later states “DocumentExtractor works on a Pandas DataFrame”. In the current implementation, DocumentIterateExtractStage calls extractor.extract(record_dict) per-record (a dict), not on a DataFrame. As written, this doc is describing a different API/behavior than users will actually get.

* Fuse document iterate and extract stages Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * ruff Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * fix bug Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update docs and tutorial Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * save progress Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update more tests Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * ruff Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * fix tests Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * ruff Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update benchmark Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * move class Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add missing import Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update comment Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>

Fuse document iterate and extract stages

c956a85

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

copy-pr-bot bot temporarily deployed to test February 3, 2026 19:50 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 3, 2026 19:51 Error

copy-pr-bot bot temporarily deployed to nemo-ci February 5, 2026 21:54 Inactive

greptile-apps bot reviewed Feb 5, 2026

View reviewed changes

sarahyurick merged commit b7febb3 into NVIDIA-NeMo:main Feb 6, 2026
50 checks passed

sarahyurick deleted the fuse_iterate_extract branch February 9, 2026 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse document iterate and extract stages #1458

Fuse document iterate and extract stages #1458

Uh oh!

sarahyurick commented Feb 3, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 5, 2026

Uh oh!

sarahyurick Feb 6, 2026

Uh oh!

greptile-apps bot commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fuse document iterate and extract stages #1458

Fuse document iterate and extract stages #1458

Uh oh!

Conversation

sarahyurick commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sarahyurick commented Feb 3, 2026 •

edited

Loading