-
Notifications
You must be signed in to change notification settings - Fork 215
Fuse document iterate and extract stages #1458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuse document iterate and extract stages #1458
Conversation
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
14 files reviewed, 3 comments
| for record_dict in iterator_result: | ||
| if self.record_limit and record_count >= self.record_limit: | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Record counts wrong
record_limit is enforced based on record_count, but record_count is only incremented after a record is kept (i.e., after extract() and after extracted is None filtering). With an extractor that filters heavily, this will iterate/extract far more than record_limit input records per file (and potentially do a lot more work/memory) before record_count reaches the limit. This is triggered when extractor is set and can return None (e.g., content filters).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A record is only added to the result after extraction. This means that the record count is correct.
Additional Comments (2)
|
* Fuse document iterate and extract stages Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * ruff Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * fix bug Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update docs and tutorial Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * save progress Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update more tests Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * ruff Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * fix tests Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * ruff Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update benchmark Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * move class Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add missing import Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update comment Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>
Closes #780.
This change allows the following improvements:
--memory=200g), the Common Crawl download and extract pipeline without fusion OOM'ed even when scaling all the way down to 16 CPUs. With fusion, the pipeline was able to succeed with 32 CPUs.Common Crawl benchmarks with
url_limit=16andnum_cpus=8