-
Notifications
You must be signed in to change notification settings - Fork 215
[benchmark] Add FastText filter benchmarking script (#1411) #1452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[benchmark] Add FastText filter benchmarking script (#1411) #1452
Conversation
Greptile OverviewGreptile SummaryThis PR extends the benchmarking framework with a new Hydra-driven Main things to double-check are around the new benchmark script’s assumptions: whether FastText-based stages need an explicit Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Driver as benchmarking/tools/run.sh
participant Script as fasttext_filter_benchmark.py
participant Hydra as Hydra compose()
participant Pipe as nemo_curator.pipeline.Pipeline
participant Exec as Executor (ray_data/xenna)
participant Stages as Instantiated stages
participant Sinks as write_benchmark_results
Driver->>Script: python ... --yaml-config ... --input-path ... --fasttext-*-model-path ...
Script->>Hydra: initialize_config_dir(config_dir)
Script->>Hydra: compose(config_name, overrides)
Hydra-->>Script: DictConfig(cfg)
Script->>Pipe: create_pipeline_from_yaml(cfg)
loop for each cfg.stages
Script->>Stages: hydra.utils.instantiate(stage_cfg)
Script->>Pipe: add_stage(stage)
end
Script->>Pipe: run(executor)
Pipe->>Exec: execute stages over tasks
Exec-->>Pipe: output_tasks (+ stage perf)
Pipe-->>Script: output_tasks
Script->>Script: aggregate metrics from task._stage_perf
Script->>Sinks: write_benchmark_results(results, benchmark_results_path)
Sinks-->>Driver: metrics.json (+ optional sink reporting)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, no comments
- Add fasttext_filter_benchmark.py script following the pattern from score_filter_benchmark.py - Add fasttext_filter_raydata and fasttext_filter_xenna entries to nightly-benchmark.yaml - Supports FastText language ID and quality filters with model setup requirements Fixes NVIDIA-NeMo#1411 Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
2b52542 to
c2ba0da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, no comments
…onfig (NVIDIA-NeMo#1411) - Add separate dataset entries for FastText langid and quality models - Pass FastText model paths as explicit CLI arguments to benchmarks - Remove hardcoded model paths from Hydra overrides - Update FastText filter benchmarks to use model_weights_path - Align arxiv E2E benchmark arg naming with FastText langid usage Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
c2ba0da to
de0cec9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 4 comments
| def create_pipeline_from_yaml(cfg: DictConfig) -> Pipeline: | ||
| pipeline = Pipeline(name="fasttext_filter_pipeline") | ||
| for stage_cfg in cfg.stages: | ||
| stage = hydra.utils.instantiate(stage_cfg) | ||
| pipeline.add_stage(stage) | ||
| return pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P0] Pipeline stages may never get setup() called, leaving FastText models unloaded.
The PR description mentions FastText filters require setup() for model loading, but create_pipeline_from_yaml only instantiates and adds stages; it doesn't call stage.setup() (or pipeline.setup()). If the pipeline framework doesn't automatically call setup(), this will fail at runtime when the filter runs. Please confirm how Pipeline.run() handles stage setup for Hydra-instantiated stages; if it doesn’t, this script needs to explicitly invoke setup before running.
| # Stage assumptions: | ||
| # 0 = partitioning (if any) | ||
| # 1 = reader | ||
| # -1 = writer (num_items_processed equals documents kept after all filters) | ||
| num_documents_processed = sum(task._stage_perf[1].num_items_processed for task in output_tasks) | ||
| num_kept_documents = sum(task._stage_perf[-1].num_items_processed for task in output_tasks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P1] Hard-coded _stage_perf index assumptions can break with different YAML pipelines.
The script assumes _stage_perf[1] is the reader and [-1] is the writer, but fasttext_filter_pipeline.yaml can have an arbitrary number/order of stages (and some pipelines may not have a partitioning stage). That can lead to wrong metrics or an IndexError. The other benchmark scripts have the same pattern, but since this script is meant to be reusable across FastText pipelines, it would be safer to derive counts from stage names or use TaskPerfUtils instead of positional indexing.
| overrides_list = [ | ||
| f"input_path={input_path}", | ||
| f"output_path={output_path}", | ||
| f"fasttext_langid_model_path={fasttext_langid_model_path}", | ||
| f"fasttext_quality_model_path={fasttext_quality_model_path}", | ||
| ] | ||
| if overrides: | ||
| overrides_list.extend(overrides.split(",")) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P2] --overrides parsing is fragile with Hydra values that contain commas.
Splitting the overrides string on commas will mis-handle overrides whose values include commas (common with list-like values), leading to invalid Hydra override tokens. Consider taking overrides as a repeatable flag (one per override) or otherwise avoiding naive comma-splitting.
| - name: fasttext_filter_raydata | ||
| enabled: true | ||
| script: fasttext_filter_benchmark.py | ||
| args: >- | ||
| --benchmark-results-path={session_entry_dir} | ||
| --output-path={session_entry_dir}/scratch/output | ||
| --executor=ray_data | ||
| --input-path={dataset:tinystories,parquet} | ||
| --yaml-config={curator_repo_dir}/nemo_curator/config/text/fasttext_filter_pipeline.yaml | ||
| --fasttext-langid-model-path={dataset:fasttext_langid_model,bin} | ||
| --fasttext-quality-model-path={dataset:fasttext_quality_model,bin} | ||
| --overrides="stages.0._target_=nemo_curator.stages.text.io.reader.ParquetReader" | ||
| timeout_s: 400 | ||
| sink_data: | ||
| - name: slack | ||
| additional_metrics: | ||
| - num_kept_documents | ||
| - throughput_docs_per_sec | ||
| ray: | ||
| num_cpus: 64 | ||
| num_gpus: 0 | ||
| enable_object_spilling: false | ||
|
|
||
| - name: fasttext_filter_xenna | ||
| enabled: true | ||
| script: fasttext_filter_benchmark.py | ||
| args: >- | ||
| --benchmark-results-path={session_entry_dir} | ||
| --output-path={session_entry_dir}/scratch/output | ||
| --executor=xenna | ||
| --input-path={dataset:tinystories,parquet} | ||
| --yaml-config={curator_repo_dir}/nemo_curator/config/text/fasttext_filter_pipeline.yaml | ||
| --fasttext-langid-model-path={dataset:fasttext_langid_model,bin} | ||
| --fasttext-quality-model-path={dataset:fasttext_quality_model,bin} | ||
| --overrides="stages.0._target_=nemo_curator.stages.text.io.reader.ParquetReader" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P1] New FastText benchmark entries lack requirements, so regressions won’t be caught by nightly.
Most existing entries define a requirements: section to enforce throughput and/or data-integrity expectations. fasttext_filter_raydata and fasttext_filter_xenna currently only report metrics to Slack, so they’ll run but won’t fail the nightly job on major performance or correctness changes. If baseline metrics are known (or can be captured), adding minimal requirements (e.g., exact num_documents_processed and a conservative min throughput) would make these benchmarks actionable.
Description
This PR adds a benchmarking script for FastText-based document filters (language ID and quality) to the NeMo Curator benchmarking framework. The implementation follows the same pattern as the existing
score_filter_benchmark.pyscript.Changes:
fasttext_filter_benchmark.pyscript that benchmarks FastText filters using a Hydra-configured pipelinefasttext_filter_raydataandfasttext_filter_xennaentries tonightly-benchmark.yamlfor both executorsThe script handles FastText filters that require
setup()for model loading, which differentiates them from heuristic filters.Related Issue: #1411
Questions for Discussion
I have a couple of questions posted in the issue comments (#1411) regarding:
requirementssections for the FastText benchmarks now, or add them in a follow-up PR after establishing baseline metrics?{datasets_path}/models/fasttext/lid.176.binand{datasets_path}/models/fasttext/quality.bin) are acceptable.Usage
The benchmark can be run via the benchmarking framework:
Or directly:
python benchmarking/scripts/fasttext_filter_benchmark.py \ --benchmark-results-path /path/to/results \ --input-path /path/to/input \ --yaml-config nemo_curator/config/text/fasttext_filter_pipeline.yaml \ --executor ray_data \ --overrides "fasttext_langid_model_path=/path/to/lid.176.bin, fasttext_quality_model_path=/path/to/quality.bin"Checklist