[benchmark] Add FastText filter benchmarking script (#1411) #1452

KunalSachdev2005 · 2026-02-03T05:09:58Z

Description

This PR adds a benchmarking script for FastText-based document filters (language ID and quality) to the NeMo Curator benchmarking framework. The implementation follows the same pattern as the existing score_filter_benchmark.py script.

Changes:

Added fasttext_filter_benchmark.py script that benchmarks FastText filters using a Hydra-configured pipeline
Added fasttext_filter_raydata and fasttext_filter_xenna entries to nightly-benchmark.yaml for both executors
Supports FastText language ID and quality filters with proper model setup requirements

The script handles FastText filters that require setup() for model loading, which differentiates them from heuristic filters.

Related Issue: #1411

Questions for Discussion

I have a couple of questions posted in the issue comments (#1411) regarding:

Metric requirements: Should we add requirements sections for the FastText benchmarks now, or add them in a follow-up PR after establishing baseline metrics?
Model paths: Confirmation that the model paths ({datasets_path}/models/fasttext/lid.176.bin and {datasets_path}/models/fasttext/quality.bin) are acceptable.

Usage

The benchmark can be run via the benchmarking framework:

./benchmarking/tools/run.sh --config ./benchmarking/nightly-benchmark.yaml

Or directly:

python benchmarking/scripts/fasttext_filter_benchmark.py \
  --benchmark-results-path /path/to/results \
  --input-path /path/to/input \
  --yaml-config nemo_curator/config/text/fasttext_filter_pipeline.yaml \
  --executor ray_data \
  --overrides "fasttext_langid_model_path=/path/to/lid.176.bin, fasttext_quality_model_path=/path/to/quality.bin"

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
- Benchmark scripts serve as integration tests run by the benchmarking framework.
The documentation is up to date with these changes.
- Script includes docstrings and follows the same pattern as other benchmark scripts.

copy-pr-bot · 2026-02-03T05:10:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-02-03T05:12:23Z

Greptile Overview

Greptile Summary

This PR extends the benchmarking framework with a new Hydra-driven fasttext_filter_benchmark.py script for FastText language-ID and quality filters, and wires two new nightly entries (ray_data and xenna) with explicit model-weight dataset paths. It also renames the ArXiv E2E benchmark CLI flag from --fasttext-model-path to the clearer --fasttext-langid-model-path and updates the nightly config accordingly.

Main things to double-check are around the new benchmark script’s assumptions: whether FastText-based stages need an explicit setup() call before Pipeline.run(), and whether positional _stage_perf indexing matches the configured YAML pipeline stages.

Confidence Score: 3/5

This PR is likely safe to merge, but the new FastText benchmark script may fail at runtime depending on pipeline setup behavior and stage metric assumptions.
The YAML and ArXiv flag rename changes are straightforward; the main uncertainty is whether Hydra-instantiated FastText stages get setup() invoked automatically by Pipeline.run(), plus some fragility in metrics collection and overrides parsing that could break certain configs.
benchmarking/scripts/fasttext_filter_benchmark.py

Important Files Changed

Filename	Overview
benchmarking/nightly-benchmark.yaml	Adds distinct FastText model dataset entries and two new FastText filter benchmark runs (ray_data/xenna); suggestion: add requirements to make regressions fail nightly.
benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py	Renames CLI/config plumbing from --fasttext-model-path to --fasttext-langid-model-path for clarity; change is consistent through pipeline creation and params.
benchmarking/scripts/fasttext_filter_benchmark.py	Introduces new Hydra-driven benchmark runner for FastText filters; potential runtime issue if FastText stages require explicit setup, plus brittle stage_perf indexing and override parsing.

Sequence Diagram

sequenceDiagram
  participant Driver as benchmarking/tools/run.sh
  participant Script as fasttext_filter_benchmark.py
  participant Hydra as Hydra compose()
  participant Pipe as nemo_curator.pipeline.Pipeline
  participant Exec as Executor (ray_data/xenna)
  participant Stages as Instantiated stages
  participant Sinks as write_benchmark_results

  Driver->>Script: python ... --yaml-config ... --input-path ... --fasttext-*-model-path ...
  Script->>Hydra: initialize_config_dir(config_dir)
  Script->>Hydra: compose(config_name, overrides)
  Hydra-->>Script: DictConfig(cfg)
  Script->>Pipe: create_pipeline_from_yaml(cfg)
  loop for each cfg.stages
    Script->>Stages: hydra.utils.instantiate(stage_cfg)
    Script->>Pipe: add_stage(stage)
  end
  Script->>Pipe: run(executor)
  Pipe->>Exec: execute stages over tasks
  Exec-->>Pipe: output_tasks (+ stage perf)
  Pipe-->>Script: output_tasks
  Script->>Script: aggregate metrics from task._stage_perf
  Script->>Sinks: write_benchmark_results(results, benchmark_results_path)
  Sinks-->>Driver: metrics.json (+ optional sink reporting)

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

- Add fasttext_filter_benchmark.py script following the pattern from score_filter_benchmark.py - Add fasttext_filter_raydata and fasttext_filter_xenna entries to nightly-benchmark.yaml - Supports FastText language ID and quality filters with model setup requirements Fixes NVIDIA-NeMo#1411 Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

…onfig (NVIDIA-NeMo#1411) - Add separate dataset entries for FastText langid and quality models - Pass FastText model paths as explicit CLI arguments to benchmarks - Remove hardcoded model paths from Hydra overrides - Update FastText filter benchmarks to use model_weights_path - Align arxiv E2E benchmark arg naming with FastText langid usage Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>

greptile-apps

_{3 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-04T16:13:41Z

benchmarking/scripts/fasttext_filter_benchmark.py

+def create_pipeline_from_yaml(cfg: DictConfig) -> Pipeline:
+    pipeline = Pipeline(name="fasttext_filter_pipeline")
+    for stage_cfg in cfg.stages:
+        stage = hydra.utils.instantiate(stage_cfg)
+        pipeline.add_stage(stage)
+    return pipeline


[P0] Pipeline stages may never get setup() called, leaving FastText models unloaded.

The PR description mentions FastText filters require setup() for model loading, but create_pipeline_from_yaml only instantiates and adds stages; it doesn't call stage.setup() (or pipeline.setup()). If the pipeline framework doesn't automatically call setup(), this will fail at runtime when the filter runs. Please confirm how Pipeline.run() handles stage setup for Hydra-instantiated stages; if it doesn’t, this script needs to explicitly invoke setup before running.

greptile-apps · 2026-02-04T16:13:42Z

benchmarking/scripts/fasttext_filter_benchmark.py

+        # Stage assumptions:
+        # 0 = partitioning (if any)
+        # 1 = reader
+        # -1 = writer (num_items_processed equals documents kept after all filters)
+        num_documents_processed = sum(task._stage_perf[1].num_items_processed for task in output_tasks)
+        num_kept_documents = sum(task._stage_perf[-1].num_items_processed for task in output_tasks)


[P1] Hard-coded _stage_perf index assumptions can break with different YAML pipelines.

The script assumes _stage_perf[1] is the reader and [-1] is the writer, but fasttext_filter_pipeline.yaml can have an arbitrary number/order of stages (and some pipelines may not have a partitioning stage). That can lead to wrong metrics or an IndexError. The other benchmark scripts have the same pattern, but since this script is meant to be reusable across FastText pipelines, it would be safer to derive counts from stage names or use TaskPerfUtils instead of positional indexing.

greptile-apps · 2026-02-04T16:13:43Z

benchmarking/scripts/fasttext_filter_benchmark.py

+    overrides_list = [
+        f"input_path={input_path}",
+        f"output_path={output_path}",
+        f"fasttext_langid_model_path={fasttext_langid_model_path}",
+        f"fasttext_quality_model_path={fasttext_quality_model_path}",
+    ]
+    if overrides:
+        overrides_list.extend(overrides.split(","))
+


[P2] --overrides parsing is fragile with Hydra values that contain commas.

Splitting the overrides string on commas will mis-handle overrides whose values include commas (common with list-like values), leading to invalid Hydra override tokens. Consider taking overrides as a repeatable flag (one per override) or otherwise avoiding naive comma-splitting.

greptile-apps · 2026-02-04T16:13:44Z

benchmarking/nightly-benchmark.yaml

+  - name: fasttext_filter_raydata
+    enabled: true
+    script: fasttext_filter_benchmark.py
+    args: >-
+      --benchmark-results-path={session_entry_dir}
+      --output-path={session_entry_dir}/scratch/output
+      --executor=ray_data
+      --input-path={dataset:tinystories,parquet}
+      --yaml-config={curator_repo_dir}/nemo_curator/config/text/fasttext_filter_pipeline.yaml
+      --fasttext-langid-model-path={dataset:fasttext_langid_model,bin}
+      --fasttext-quality-model-path={dataset:fasttext_quality_model,bin}
+      --overrides="stages.0._target_=nemo_curator.stages.text.io.reader.ParquetReader"
+    timeout_s: 400
+    sink_data:
+      - name: slack
+        additional_metrics:
+          - num_kept_documents
+          - throughput_docs_per_sec
+    ray:
+      num_cpus: 64
+      num_gpus: 0
+      enable_object_spilling: false
+
+  - name: fasttext_filter_xenna
+    enabled: true
+    script: fasttext_filter_benchmark.py
+    args: >-
+      --benchmark-results-path={session_entry_dir}
+      --output-path={session_entry_dir}/scratch/output
+      --executor=xenna
+      --input-path={dataset:tinystories,parquet}
+      --yaml-config={curator_repo_dir}/nemo_curator/config/text/fasttext_filter_pipeline.yaml
+      --fasttext-langid-model-path={dataset:fasttext_langid_model,bin}
+      --fasttext-quality-model-path={dataset:fasttext_quality_model,bin}
+      --overrides="stages.0._target_=nemo_curator.stages.text.io.reader.ParquetReader"


[P1] New FastText benchmark entries lack requirements, so regressions won’t be caught by nightly.

Most existing entries define a requirements: section to enforce throughput and/or data-integrity expectations. fasttext_filter_raydata and fasttext_filter_xenna currently only report metrics to Slack, so they’ll run but won’t fail the nightly job on major performance or correctness changes. If baseline metrics are known (or can be captured), adding minimal requirements (e.g., exact num_documents_processed and a conservative min throughput) would make these benchmarks actionable.

github-actions bot added the community-request label Feb 3, 2026

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

KunalSachdev2005 mentioned this pull request Feb 3, 2026

Add benchmarking script for FastText filters #1411

Open

KunalSachdev2005 force-pushed the fixes-1411-fasttext-filters-benchmarking-script branch from 2b52542 to c2ba0da Compare February 4, 2026 04:05

greptile-apps bot reviewed Feb 4, 2026

View reviewed changes

KunalSachdev2005 force-pushed the fixes-1411-fasttext-filters-benchmarking-script branch from c2ba0da to de0cec9 Compare February 4, 2026 16:11

greptile-apps bot reviewed Feb 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[benchmark] Add FastText filter benchmarking script (#1411) #1452

[benchmark] Add FastText filter benchmarking script (#1411) #1452

Uh oh!

KunalSachdev2005 commented Feb 3, 2026

Uh oh!

copy-pr-bot bot commented Feb 3, 2026

Uh oh!

greptile-apps bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 4, 2026

Uh oh!

greptile-apps bot Feb 4, 2026

Uh oh!

greptile-apps bot Feb 4, 2026

Uh oh!

greptile-apps bot Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[benchmark] Add FastText filter benchmarking script (#1411) #1452

Are you sure you want to change the base?

[benchmark] Add FastText filter benchmarking script (#1411) #1452

Uh oh!

Conversation

KunalSachdev2005 commented Feb 3, 2026

Description

Questions for Discussion

Usage

Checklist

Uh oh!

copy-pr-bot bot commented Feb 3, 2026

Uh oh!

greptile-apps bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Feb 3, 2026 •

edited

Loading