Add ALM Data Pipeline tutorial and stages by mohammadaaftabv · Pull Request #1419 · NVIDIA-NeMo/Curator

mohammadaaftabv · 2026-01-23T16:56:24Z

Add new NeMo Curator stages for ALM (Audio Language Model) data curation:

ALMDataBuilderStage: Creates training windows from audio segments with quality filtering (sample rate, bandwidth, speaker count, duration)
ALMDataOverlapStage: Filters overlapping windows based on threshold, keeping windows closest to target duration

Add complete tutorial with:

Python CLI (pipeline.py) and Hydra runner (run.py)
Sample input data for testing
Comprehensive documentation

Tested with sample data:

Stage 1 produces 181 windows from 5 input entries
Stage 2 filters to 25 non-overlapping windows (3035.5s total)

Description

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-01-23T16:56:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-01-29T09:41:03Z

Greptile Summary

This PR adds comprehensive ALM (Audio Language Model) data curation stages with builder and overlap filtering capabilities. The implementation follows NeMo Curator patterns correctly and includes extensive tests, tutorials, and benchmarking.

Key Changes:

ALMDataBuilderStage: Creates training windows from audio segments with quality filtering (sample rate, bandwidth, speaker count, duration)
ALMDataOverlapStage: Filters overlapping windows based on configurable threshold
ALMManifestReader/ALMManifestReaderStage: Composite stage for efficient JSONL manifest reading
ALMManifestWriterStage: JSONL output writer
Complete tutorial with Hydra configuration and sample data
Comprehensive test suite and benchmark infrastructure

Issues Resolved Since Previous Reviews:
Many critical issues flagged in earlier review threads have been addressed:

alm_manifest_reader.py file now exists (was restored after deletion)
XennaExecutor import is correct (properly re-exported from xenna.__init__)
Output schema is consistently populated even when windows are empty
Bandwidth threshold now uses configurable parameter instead of hardcoded value

Remaining Considerations:
The main remaining issues are maintainability concerns rather than functional bugs:

Test assertions use hard-coded exact counts (assert total_windows == 181) that will break on any logic change, fixture modification, or floating-point rounding differences
The overlap filtering uses non-standard (end, start) tuple ordering throughout, which is internally consistent but may confuse maintainers
Logging in alm_manifest_reader.py:53 shows cumulative entry count across all manifests rather than per-manifest count

Confidence Score: 4/5

This PR is safe to merge with low risk. Core functionality is sound and critical issues from previous reviews have been resolved.
Score of 4 reflects that the implementation is functionally correct with proper error handling and boundary checks, but contains test brittleness that could cause future CI failures. The main concerns are maintainability issues (hard-coded test assertions, non-standard tuple conventions) rather than runtime bugs. The overlap filtering logic, file I/O, and stage integration are all correct despite some previous threads incorrectly flagging the overlap threshold as inverted.
Test files (test_alm_data_builder.py, test_alm_data_overlap.py) need attention for brittle assertions that will break on minor changes. Consider refactoring to assert invariants rather than exact counts.

Important Files Changed

Filename	Overview
nemo_curator/stages/audio/alm/alm_data_builder.py	Implements ALM window builder stage with quality filtering. Core logic appears sound with proper boundary checks. Previous concerns about hardcoded bandwidth threshold and IndexError have been addressed.
nemo_curator/stages/audio/alm/alm_data_overlap.py	Filters overlapping windows based on threshold. Uses non-standard (end, start) tuple convention but is internally consistent. Output schema is now properly initialized for empty windows case.
nemo_curator/stages/audio/alm/alm_manifest_reader.py	Composite stage for reading JSONL manifests using fsspec. File exists (was restored after earlier deletion). Minor logging clarity issue: shows cumulative entry count across manifests.
tests/stages/audio/alm/test_alm_data_builder.py	Unit tests for builder stage. Contains brittle assertion (line 163) with hard-coded count that will break on logic changes or fixture modifications.
tests/stages/audio/alm/test_alm_data_overlap.py	Integration tests for overlap stage. Contains multiple brittle assertions (lines 142-144) with exact hard-coded counts that reduce test maintainability.

_{Last reviewed commit: c0f5cc0}

greptile-apps

_{30 files reviewed, 17 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_curator/stages/audio/alm/alm_data_builder.py

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

karpnv

LGTM

ayushdg

Few comments:

Can you add a benchmarking script to benchmarks and share a representative dataset that can be used to run an alm pipeline.
You are already logging many statistics in the stages here, is it possible to also use _log_metrics like done in some of the text stages to log some of these timing metrics so that they can be tracked better to catch regressions?

nemo_curator/stages/audio/alm/alm_data_builder.py

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

mohammadaaftabv · 2026-02-07T12:20:46Z

Few comments:

Can you add a benchmarking script to benchmarks and share a representative dataset that can be used to run an alm pipeline.

https://github.com/mohammadaaftabv/Curator/tree/alm_data_build/tests/fixtures/audio/alm is the representative dataset and i am assuming by benchmarks you mean result of running both processors on the representative data, in that case alm data build should build 181 windows based on config in test file and alm data overlap applied on resultant 181 windows with allowing max 50% overlap will give 3035.5 seconds total output.

All this is in test cases here.

mohammadaaftabv · 2026-02-07T12:57:01Z

2. You are already logging many statistics in the stages here, is it possible to also use _log_metrics like done in some of the text stages to log some of these timing metrics so that they can be tracked better to catch regressions?

Added _log_metrics calls to both stages, following the pattern in text stages. Now tracking:

ALMDataBuilderStage: process_entry_time, segments_processed, windows_created

ALMDataOverlapStage: filter_time, input_windows, output_windows"

greptile-apps

_{13 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_curator/stages/audio/alm/alm_data_overlap.py

tutorials/audio/alm/main.py

nemo_curator/stages/audio/alm/alm_data_overlap.py

tests/stages/audio/alm/test_alm_data_overlap.py

greptile-apps

_{13 files reviewed, 5 comments}

_{Edit Code Review Agent Settings | Greptile}

tutorials/audio/alm/main.py

nemo_curator/stages/audio/alm/alm_data_overlap.py

tests/stages/audio/alm/test_alm_data_builder.py

tutorials/audio/alm/main.py

greptile-apps

_{3 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

tutorials/audio/alm/main.py

nemo_curator/stages/audio/alm/alm_data_overlap.py

greptile-apps

_{13 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

nemo_curator/stages/audio/alm/alm_data_overlap.py

tutorials/audio/alm/main.py

praateekmahajan · 2026-02-12T19:17:49Z

tutorials/audio/alm/main.py

+
+    # Calculate statistics
+    # Stage 1 output: total_dur_list_window contains the original window count
+    stage1_windows = sum(len(e.get("total_dur_list_window", e.get("windows", []))) for e in output_entries)


I guess these make sense, but also take a look at Task._metadata and Task._stage_perf_stats if there are things that are relevant

nemo_curator/stages/audio/alm/alm_data_builder.py

praateekmahajan · 2026-02-12T19:18:30Z

nemo_curator/stages/audio/alm/alm_data_builder.py

+        self._drop_fields_set = {f.strip() for f in self.drop_fields.split(",") if f.strip()}
+        self._drop_fields_top_level_set = {f.strip() for f in self.drop_fields_top_level.split(",") if f.strip()}
+
+    def process_dataset_entry(self, data_entry: dict[str, Any]) -> list[AudioBatch]:


Is it intentional that we operate on a single manifest entry at a time? Can any of this be vectorized? Same for other stages

Yes, this is intentional — it follows the LegacySpeechStage pattern used by all other audio stages (GetAudioDurationStage, PreserveByValueStage, etc.), where process() iterates over task.data and calls process_dataset_entry() per entry.

Parallelism is handled at the executor level instead. In benchmark testing (10,000 entries, XennaExecutor on 8-core i9-9900KF), the autoscaler allocated 4 workers to the Builder stage, achieving ~1,460 entries/sec aggregate throughput (365 entries/sec/worker) with 86% CPU utilization. The Overlap stage ran 3 workers at ~5,650 entries/sec. Full pipeline completed in 90s.

If we want batch-level optimization in the future, it would need to happen at the LegacySpeechStage base class level, which would affect all audio stages.

praateekmahajan · 2026-02-12T19:19:17Z

nemo_curator/stages/audio/alm/alm_data_builder.py

+            }
+        )
+
+        return [AudioBatch(data=[result])]


Each time we return a Task you must pass its parents tasks metadata and stage_perf_stats..

In such a fan-out implementation this becomes hard to reason about..

Yeah the _stage_perfs are supposedly propagated via the base LegacySpeechStage. I would be curious to look at the benchmark results for this PR though to get an even better understanding of how existing audio curation code can be refactored.

nemo_curator/stages/audio/alm/alm_data_builder.py

nemo_curator/stages/audio/alm/alm_data_overlap.py

nemo_curator/stages/audio/alm/alm_manifest_reader.py

tests/stages/audio/alm/test_alm_data_builder.py

tests/stages/audio/alm/test_alm_data_overlap.py

…tage

nemo_curator/stages/audio/alm/__init__.py

…uppressions, /tmp usage

nemo_curator/stages/audio/alm/__init__.py

nemo_curator/stages/audio/alm/alm_data_overlap.py

nemo_curator/stages/audio/alm/alm_manifest_reader.py

tests/stages/audio/alm/test_alm_data_builder.py

tests/stages/audio/alm/test_alm_data_overlap.py

greptile-apps · 2026-02-26T22:56:18Z

Additional Comments (3)

nemo_curator/stages/audio/alm/alm_data_builder.py, line 168
Missing validation for stage parameters. Add checks to ensure:

min_speakers <= max_speakers
tolerance >= 0 (reasonable range)
target_window_duration > 0
min_sample_rate > 0

Without validation, invalid configs can cause confusing runtime behavior.

nemo_curator/stages/audio/alm/alm_data_builder.py, line 203
Potential KeyError if segments are missing required fields start, end. Consider adding validation or try/except to handle malformed input data gracefully with a clear error message.

nemo_curator/stages/audio/alm/alm_data_overlap.py, line 62
Non-standard tuple order (end, start) requires compensating index swaps throughout the file (lines 67-68, 87-88, 94-95). This makes the code harder to understand and maintain. Consider using standard (start, end) convention or add a clear comment explaining why this order is intentional.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-02-27T17:48:25Z

Additional Comments (5)

nemo_curator/stages/audio/alm/alm_data_builder.py, line 112
hardcoded 8000 should use self.min_bandwidth for correct diagnostics when users configure different bandwidth thresholds

nemo_curator/stages/audio/alm/alm_data_overlap.py, line 62
returns (end, start) tuple order, requiring index swaps at lines 67-68, 87-88, 94-95 and unpacking as for end, start at lines 121, 126. standard (start, end) order would improve readability

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

tests/stages/audio/alm/test_alm_data_builder.py, line 163
hardcoded count 181 makes test brittle - will break on fixture changes or logic tweaks. consider asserting total_windows > 0 instead

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

tests/stages/audio/alm/test_alm_data_overlap.py, line 144
hardcoded counts 181, 25, 3035.50 make tests brittle. consider asserting invariants like total_filtered_windows <= total_builder_windows and filtered_dur > 0

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

nemo_curator/stages/audio/alm/alm_manifest_reader.py, line 53
logs cumulative total len(entries) across all manifests, not count from current manifest. track per-manifest count separately for accurate logging

github-actions bot added the community-request label Jan 23, 2026

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 25, 2026

mohammadaaftabv force-pushed the alm_data_build branch 4 times, most recently from 66abf28 to 0125f32 Compare January 29, 2026 09:36

mohammadaaftabv marked this pull request as ready for review January 29, 2026 09:38

greptile-apps bot reviewed Jan 29, 2026

View reviewed changes

nemo_curator/stages/audio/alm/alm_data_builder.py Outdated Show resolved Hide resolved

nemo_curator/stages/audio/alm/alm_data_builder.py Outdated Show resolved Hide resolved

greptile-apps bot reviewed Jan 29, 2026

View reviewed changes

karpnv self-requested a review January 31, 2026 00:35

karpnv approved these changes Jan 31, 2026

View reviewed changes

ayushdg reviewed Feb 5, 2026

View reviewed changes

greptile-apps bot reviewed Feb 7, 2026

View reviewed changes

tutorials/audio/alm/main.py Outdated Show resolved Hide resolved

nemo_curator/stages/audio/alm/alm_data_overlap.py Outdated Show resolved Hide resolved

mohammadaaftabv requested review from ayushdg and karpnv February 9, 2026 02:41

greptile-apps bot reviewed Feb 10, 2026

View reviewed changes

nemo_curator/stages/audio/alm/alm_data_overlap.py Show resolved Hide resolved

praateekmahajan reviewed Feb 12, 2026

View reviewed changes

tutorials/audio/alm/main.py Outdated Show resolved Hide resolved

praateekmahajan reviewed Feb 12, 2026

View reviewed changes

tutorials/audio/alm/main.py Outdated Show resolved Hide resolved

praateekmahajan reviewed Feb 12, 2026

View reviewed changes

tutorials/audio/alm/main.py Outdated Show resolved Hide resolved

praateekmahajan reviewed Feb 12, 2026

View reviewed changes

nemo_curator/stages/audio/alm/alm_data_builder.py Outdated Show resolved Hide resolved

praateekmahajan reviewed Feb 12, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci February 26, 2026 18:29 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 26, 2026 18:29 Failure

copy-pr-bot bot temporarily deployed to nemo-ci February 26, 2026 18:29 Inactive

sarahyurick mentioned this pull request Feb 26, 2026

Audio Dataverse Integration #1527

Open

Merge branch 'main' into alm_data_build

0e2db7f

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

Refactor ALMManifestReader into CompositeStage with FilePartitioningS…

fa3d1f4

…tage

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

nemo_curator/stages/audio/alm/__init__.py Show resolved Hide resolved

Fix all ruff linting errors: type annotations, sorted __all__, noqa s…

d641d78

…uppressions, /tmp usage

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

nemo_curator/stages/audio/alm/__init__.py Show resolved Hide resolved

Restore alm_manifest_reader.py lost during rebase

b4b11bf

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

mohammadaaftabv added 3 commits February 27, 2026 02:53

Add directory-based manifest discovery tests and README docs

06f6a7d

Fix ruff errors, add directory discovery tests and README example

43a6658

Add storage_options, blocksize, fix benchmark, update results

2334ddd

mohammadaaftabv requested a review from sarahyurick February 26, 2026 22:56

Merge branch 'main' into alm_data_build

5b8e19f

Ensure consistent output schema when entry has no windows

c0f5cc0

Conversation

mohammadaaftabv commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Checklist

Uh oh!

copy-pr-bot bot commented Jan 23, 2026

Uh oh!

greptile-apps bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

karpnv left a comment

Choose a reason for hiding this comment

Uh oh!

ayushdg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

mohammadaaftabv commented Feb 7, 2026

Uh oh!

mohammadaaftabv commented Feb 7, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

praateekmahajan Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

praateekmahajan Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

mohammadaaftabv Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

praateekmahajan Feb 12, 2026

Choose a reason for hiding this comment

mohammadaaftabv commented Jan 23, 2026 •

edited

Loading

greptile-apps bot commented Jan 29, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading