Skip to content

Conversation

@KunalSachdev2005
Copy link
Contributor

Description

This PR adds a benchmarking script for FastText-based document filters (language ID and quality) to the NeMo Curator benchmarking framework. The implementation follows the same pattern as the existing score_filter_benchmark.py script.

Changes:

  • Added fasttext_filter_benchmark.py script that benchmarks FastText filters using a Hydra-configured pipeline
  • Added fasttext_filter_raydata and fasttext_filter_xenna entries to nightly-benchmark.yaml for both executors
  • Supports FastText language ID and quality filters with proper model setup requirements

The script handles FastText filters that require setup() for model loading, which differentiates them from heuristic filters.

Related Issue: #1411

Questions for Discussion

I have a couple of questions posted in the issue comments (#1411) regarding:

  1. Metric requirements: Should we add requirements sections for the FastText benchmarks now, or add them in a follow-up PR after establishing baseline metrics?
  2. Model paths: Confirmation that the model paths ({datasets_path}/models/fasttext/lid.176.bin and {datasets_path}/models/fasttext/quality.bin) are acceptable.

Usage

The benchmark can be run via the benchmarking framework:

./benchmarking/tools/run.sh --config ./benchmarking/nightly-benchmark.yaml

Or directly:

python benchmarking/scripts/fasttext_filter_benchmark.py \
  --benchmark-results-path /path/to/results \
  --input-path /path/to/input \
  --yaml-config nemo_curator/config/text/fasttext_filter_pipeline.yaml \
  --executor ray_data \
  --overrides "fasttext_langid_model_path=/path/to/lid.176.bin, fasttext_quality_model_path=/path/to/quality.bin"

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
    • Benchmark scripts serve as integration tests run by the benchmarking framework.
  • The documentation is up to date with these changes.
    • Script includes docstrings and follows the same pattern as other benchmark scripts.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 3, 2026

Greptile Overview

Greptile Summary

This PR extends the benchmarking framework with a new Hydra-driven fasttext_filter_benchmark.py script for FastText language-ID and quality filters, and wires two new nightly entries (ray_data and xenna) with explicit model-weight dataset paths. It also renames the ArXiv E2E benchmark CLI flag from --fasttext-model-path to the clearer --fasttext-langid-model-path and updates the nightly config accordingly.

Main things to double-check are around the new benchmark script’s assumptions: whether FastText-based stages need an explicit setup() call before Pipeline.run(), and whether positional _stage_perf indexing matches the configured YAML pipeline stages.

Confidence Score: 3/5

  • This PR is likely safe to merge, but the new FastText benchmark script may fail at runtime depending on pipeline setup behavior and stage metric assumptions.
  • The YAML and ArXiv flag rename changes are straightforward; the main uncertainty is whether Hydra-instantiated FastText stages get setup() invoked automatically by Pipeline.run(), plus some fragility in metrics collection and overrides parsing that could break certain configs.
  • benchmarking/scripts/fasttext_filter_benchmark.py

Important Files Changed

Filename Overview
benchmarking/nightly-benchmark.yaml Adds distinct FastText model dataset entries and two new FastText filter benchmark runs (ray_data/xenna); suggestion: add requirements to make regressions fail nightly.
benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py Renames CLI/config plumbing from --fasttext-model-path to --fasttext-langid-model-path for clarity; change is consistent through pipeline creation and params.
benchmarking/scripts/fasttext_filter_benchmark.py Introduces new Hydra-driven benchmark runner for FastText filters; potential runtime issue if FastText stages require explicit setup, plus brittle stage_perf indexing and override parsing.

Sequence Diagram

sequenceDiagram
  participant Driver as benchmarking/tools/run.sh
  participant Script as fasttext_filter_benchmark.py
  participant Hydra as Hydra compose()
  participant Pipe as nemo_curator.pipeline.Pipeline
  participant Exec as Executor (ray_data/xenna)
  participant Stages as Instantiated stages
  participant Sinks as write_benchmark_results

  Driver->>Script: python ... --yaml-config ... --input-path ... --fasttext-*-model-path ...
  Script->>Hydra: initialize_config_dir(config_dir)
  Script->>Hydra: compose(config_name, overrides)
  Hydra-->>Script: DictConfig(cfg)
  Script->>Pipe: create_pipeline_from_yaml(cfg)
  loop for each cfg.stages
    Script->>Stages: hydra.utils.instantiate(stage_cfg)
    Script->>Pipe: add_stage(stage)
  end
  Script->>Pipe: run(executor)
  Pipe->>Exec: execute stages over tasks
  Exec-->>Pipe: output_tasks (+ stage perf)
  Pipe-->>Script: output_tasks
  Script->>Script: aggregate metrics from task._stage_perf
  Script->>Sinks: write_benchmark_results(results, benchmark_results_path)
  Sinks-->>Driver: metrics.json (+ optional sink reporting)
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

- Add fasttext_filter_benchmark.py script following the pattern from
  score_filter_benchmark.py
- Add fasttext_filter_raydata and fasttext_filter_xenna entries to
  nightly-benchmark.yaml
- Supports FastText language ID and quality filters with model setup
  requirements

Fixes NVIDIA-NeMo#1411

Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
@KunalSachdev2005 KunalSachdev2005 force-pushed the fixes-1411-fasttext-filters-benchmarking-script branch from 2b52542 to c2ba0da Compare February 4, 2026 04:05
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

…onfig (NVIDIA-NeMo#1411)

- Add separate dataset entries for FastText langid and quality models
- Pass FastText model paths as explicit CLI arguments to benchmarks
- Remove hardcoded model paths from Hydra overrides
- Update FastText filter benchmarks to use model_weights_path
- Align arxiv E2E benchmark arg naming with FastText langid usage

Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
@KunalSachdev2005 KunalSachdev2005 force-pushed the fixes-1411-fasttext-filters-benchmarking-script branch from c2ba0da to de0cec9 Compare February 4, 2026 16:11
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +46 to +51
def create_pipeline_from_yaml(cfg: DictConfig) -> Pipeline:
pipeline = Pipeline(name="fasttext_filter_pipeline")
for stage_cfg in cfg.stages:
stage = hydra.utils.instantiate(stage_cfg)
pipeline.add_stage(stage)
return pipeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] Pipeline stages may never get setup() called, leaving FastText models unloaded.

The PR description mentions FastText filters require setup() for model loading, but create_pipeline_from_yaml only instantiates and adds stages; it doesn't call stage.setup() (or pipeline.setup()). If the pipeline framework doesn't automatically call setup(), this will fail at runtime when the filter runs. Please confirm how Pipeline.run() handles stage setup for Hydra-instantiated stages; if it doesn’t, this script needs to explicitly invoke setup before running.

Comment on lines +96 to +101
# Stage assumptions:
# 0 = partitioning (if any)
# 1 = reader
# -1 = writer (num_items_processed equals documents kept after all filters)
num_documents_processed = sum(task._stage_perf[1].num_items_processed for task in output_tasks)
num_kept_documents = sum(task._stage_perf[-1].num_items_processed for task in output_tasks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Hard-coded _stage_perf index assumptions can break with different YAML pipelines.

The script assumes _stage_perf[1] is the reader and [-1] is the writer, but fasttext_filter_pipeline.yaml can have an arbitrary number/order of stages (and some pipelines may not have a partitioning stage). That can lead to wrong metrics or an IndexError. The other benchmark scripts have the same pattern, but since this script is meant to be reusable across FastText pipelines, it would be safer to derive counts from stage names or use TaskPerfUtils instead of positional indexing.

Comment on lines +77 to +85
overrides_list = [
f"input_path={input_path}",
f"output_path={output_path}",
f"fasttext_langid_model_path={fasttext_langid_model_path}",
f"fasttext_quality_model_path={fasttext_quality_model_path}",
]
if overrides:
overrides_list.extend(overrides.split(","))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] --overrides parsing is fragile with Hydra values that contain commas.

Splitting the overrides string on commas will mis-handle overrides whose values include commas (common with list-like values), leading to invalid Hydra override tokens. Consider taking overrides as a repeatable flag (one per override) or otherwise avoiding naive comma-splitting.

Comment on lines +642 to +676
- name: fasttext_filter_raydata
enabled: true
script: fasttext_filter_benchmark.py
args: >-
--benchmark-results-path={session_entry_dir}
--output-path={session_entry_dir}/scratch/output
--executor=ray_data
--input-path={dataset:tinystories,parquet}
--yaml-config={curator_repo_dir}/nemo_curator/config/text/fasttext_filter_pipeline.yaml
--fasttext-langid-model-path={dataset:fasttext_langid_model,bin}
--fasttext-quality-model-path={dataset:fasttext_quality_model,bin}
--overrides="stages.0._target_=nemo_curator.stages.text.io.reader.ParquetReader"
timeout_s: 400
sink_data:
- name: slack
additional_metrics:
- num_kept_documents
- throughput_docs_per_sec
ray:
num_cpus: 64
num_gpus: 0
enable_object_spilling: false

- name: fasttext_filter_xenna
enabled: true
script: fasttext_filter_benchmark.py
args: >-
--benchmark-results-path={session_entry_dir}
--output-path={session_entry_dir}/scratch/output
--executor=xenna
--input-path={dataset:tinystories,parquet}
--yaml-config={curator_repo_dir}/nemo_curator/config/text/fasttext_filter_pipeline.yaml
--fasttext-langid-model-path={dataset:fasttext_langid_model,bin}
--fasttext-quality-model-path={dataset:fasttext_quality_model,bin}
--overrides="stages.0._target_=nemo_curator.stages.text.io.reader.ParquetReader"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] New FastText benchmark entries lack requirements, so regressions won’t be caught by nightly.

Most existing entries define a requirements: section to enforce throughput and/or data-integrity expectations. fasttext_filter_raydata and fasttext_filter_xenna currently only report metrics to Slack, so they’ll run but won’t fail the nightly job on major performance or correctness changes. If baseline metrics are known (or can be captured), adding minimal requirements (e.g., exact num_documents_processed and a conservative min throughput) would make these benchmarks actionable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant