Skip to content

Conversation

@sarahyurick
Copy link
Contributor

@sarahyurick sarahyurick commented Feb 3, 2026

Closes #780.

This change allows the following improvements:

Configuration Runtime
Fusion + Ray Data executor 27m 49.1s
No Fusion + Ray Data executor 29m 33.1s
Fusion + Xenna executor 54m 4.2s
No Fusion + Xenna executor 66m 6.8s

Common Crawl benchmarks with url_limit=16 and num_cpus=8

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +101 to +103
for record_dict in iterator_result:
if self.record_limit and record_count >= self.record_limit:
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Record counts wrong
record_limit is enforced based on record_count, but record_count is only incremented after a record is kept (i.e., after extract() and after extracted is None filtering). With an extractor that filters heavily, this will iterate/extract far more than record_limit input records per file (and potentially do a lot more work/memory) before record_count reaches the limit. This is triggered when extractor is set and can return None (e.g., content filters).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A record is only added to the result after extraction. This means that the record count is correct.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 5, 2026

Additional Comments (2)

nemo_curator/stages/text/download/base/iterator.py
Empty batch lacks schema
When records is empty (empty task.data, iterate() returns None, all files error, or all records filtered), pd.DataFrame(records) creates a DataFrame with no columns. Downstream stages/tests that rely on declared outputs (e.g., expecting the filename column or extractor output columns) will hit KeyError when accessing missing columns. Consider constructing an empty DataFrame with the expected columns from self.outputs()[1] when records is empty.


docs/about/concepts/text/data-acquisition-concepts.md
Incorrect API description
This section says the iterator’s yielded list[dict] is converted to a DataFrame and “passed to Extractor”, and later states “DocumentExtractor works on a Pandas DataFrame”. In the current implementation, DocumentIterateExtractStage calls extractor.extract(record_dict) per-record (a dict), not on a DataFrame. As written, this doc is describing a different API/behavior than users will actually get.

@sarahyurick sarahyurick merged commit b7febb3 into NVIDIA-NeMo:main Feb 6, 2026
50 checks passed
@sarahyurick sarahyurick deleted the fuse_iterate_extract branch February 9, 2026 18:13
SwekeR-463 pushed a commit to SwekeR-463/Curator that referenced this pull request Feb 10, 2026
* Fuse document iterate and extract stages

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* ruff

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* fix bug

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* update docs and tutorial

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* save progress

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* update more tests

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* ruff

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* fix tests

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* ruff

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* update benchmark

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* move class

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* add missing import

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* update comment

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

---------

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Download and Extract - Consider merging iterate and extract to a single stage

2 participants