Skip to content

Add ALM Data Pipeline tutorial and stages#1419

Open
mohammadaaftabv wants to merge 24 commits intoNVIDIA-NeMo:mainfrom
mohammadaaftabv:alm_data_build
Open

Add ALM Data Pipeline tutorial and stages#1419
mohammadaaftabv wants to merge 24 commits intoNVIDIA-NeMo:mainfrom
mohammadaaftabv:alm_data_build

Conversation

@mohammadaaftabv
Copy link

@mohammadaaftabv mohammadaaftabv commented Jan 23, 2026

Add new NeMo Curator stages for ALM (Audio Language Model) data curation:

  • ALMDataBuilderStage: Creates training windows from audio segments with quality filtering (sample rate, bandwidth, speaker count, duration)
  • ALMDataOverlapStage: Filters overlapping windows based on threshold, keeping windows closest to target duration

Add complete tutorial with:

  • Python CLI (pipeline.py) and Hydra runner (run.py)
  • Sample input data for testing
  • Comprehensive documentation

Tested with sample data:

  • Stage 1 produces 181 windows from 5 input entries
  • Stage 2 filters to 25 non-overlapping windows (3035.5s total)

Description

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Jan 25, 2026
@mohammadaaftabv mohammadaaftabv force-pushed the alm_data_build branch 4 times, most recently from 66abf28 to 0125f32 Compare January 29, 2026 09:36
@mohammadaaftabv mohammadaaftabv marked this pull request as ready for review January 29, 2026 09:38
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 29, 2026

Greptile Summary

This PR adds comprehensive ALM (Audio Language Model) data curation stages with builder and overlap filtering capabilities. The implementation follows NeMo Curator patterns correctly and includes extensive tests, tutorials, and benchmarking.

Key Changes:

  • ALMDataBuilderStage: Creates training windows from audio segments with quality filtering (sample rate, bandwidth, speaker count, duration)
  • ALMDataOverlapStage: Filters overlapping windows based on configurable threshold
  • ALMManifestReader/ALMManifestReaderStage: Composite stage for efficient JSONL manifest reading
  • ALMManifestWriterStage: JSONL output writer
  • Complete tutorial with Hydra configuration and sample data
  • Comprehensive test suite and benchmark infrastructure

Issues Resolved Since Previous Reviews:
Many critical issues flagged in earlier review threads have been addressed:

  • alm_manifest_reader.py file now exists (was restored after deletion)
  • XennaExecutor import is correct (properly re-exported from xenna.__init__)
  • Output schema is consistently populated even when windows are empty
  • Bandwidth threshold now uses configurable parameter instead of hardcoded value

Remaining Considerations:
The main remaining issues are maintainability concerns rather than functional bugs:

  • Test assertions use hard-coded exact counts (assert total_windows == 181) that will break on any logic change, fixture modification, or floating-point rounding differences
  • The overlap filtering uses non-standard (end, start) tuple ordering throughout, which is internally consistent but may confuse maintainers
  • Logging in alm_manifest_reader.py:53 shows cumulative entry count across all manifests rather than per-manifest count

Confidence Score: 4/5

  • This PR is safe to merge with low risk. Core functionality is sound and critical issues from previous reviews have been resolved.
  • Score of 4 reflects that the implementation is functionally correct with proper error handling and boundary checks, but contains test brittleness that could cause future CI failures. The main concerns are maintainability issues (hard-coded test assertions, non-standard tuple conventions) rather than runtime bugs. The overlap filtering logic, file I/O, and stage integration are all correct despite some previous threads incorrectly flagging the overlap threshold as inverted.
  • Test files (test_alm_data_builder.py, test_alm_data_overlap.py) need attention for brittle assertions that will break on minor changes. Consider refactoring to assert invariants rather than exact counts.

Important Files Changed

Filename Overview
nemo_curator/stages/audio/alm/alm_data_builder.py Implements ALM window builder stage with quality filtering. Core logic appears sound with proper boundary checks. Previous concerns about hardcoded bandwidth threshold and IndexError have been addressed.
nemo_curator/stages/audio/alm/alm_data_overlap.py Filters overlapping windows based on threshold. Uses non-standard (end, start) tuple convention but is internally consistent. Output schema is now properly initialized for empty windows case.
nemo_curator/stages/audio/alm/alm_manifest_reader.py Composite stage for reading JSONL manifests using fsspec. File exists (was restored after earlier deletion). Minor logging clarity issue: shows cumulative entry count across manifests.
tests/stages/audio/alm/test_alm_data_builder.py Unit tests for builder stage. Contains brittle assertion (line 163) with hard-coded count that will break on logic changes or fixture modifications.
tests/stages/audio/alm/test_alm_data_overlap.py Integration tests for overlap stage. Contains multiple brittle assertions (lines 142-144) with exact hard-coded counts that reduce test maintainability.

Last reviewed commit: c0f5cc0

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30 files reviewed, 17 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@karpnv karpnv self-requested a review January 31, 2026 00:35
Copy link
Contributor

@karpnv karpnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments:

  1. Can you add a benchmarking script to benchmarks and share a representative dataset that can be used to run an alm pipeline.
  2. You are already logging many statistics in the stages here, is it possible to also use _log_metrics like done in some of the text stages to log some of these timing metrics so that they can be tracked better to catch regressions?

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@mohammadaaftabv
Copy link
Author

Few comments:

  1. Can you add a benchmarking script to benchmarks and share a representative dataset that can be used to run an alm pipeline.

https://github.com/mohammadaaftabv/Curator/tree/alm_data_build/tests/fixtures/audio/alm is the representative dataset and i am assuming by benchmarks you mean result of running both processors on the representative data, in that case alm data build should build 181 windows based on config in test file and alm data overlap applied on resultant 181 windows with allowing max 50% overlap will give 3035.5 seconds total output.

All this is in test cases here.

@mohammadaaftabv
Copy link
Author

2. You are already logging many statistics in the stages here, is it possible to also use _log_metrics like done in some of the text stages to log some of these timing metrics so that they can be tracked better to catch regressions?

Added _log_metrics calls to both stages, following the pattern in text stages. Now tracking:

  • ALMDataBuilderStage: process_entry_time, segments_processed, windows_created
  • ALMDataOverlapStage: filter_time, input_windows, output_windows"

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile


# Calculate statistics
# Stage 1 output: total_dur_list_window contains the original window count
stage1_windows = sum(len(e.get("total_dur_list_window", e.get("windows", []))) for e in output_entries)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these make sense, but also take a look at Task._metadata and Task._stage_perf_stats if there are things that are relevant

self._drop_fields_set = {f.strip() for f in self.drop_fields.split(",") if f.strip()}
self._drop_fields_top_level_set = {f.strip() for f in self.drop_fields_top_level.split(",") if f.strip()}

def process_dataset_entry(self, data_entry: dict[str, Any]) -> list[AudioBatch]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional that we operate on a single manifest entry at a time? Can any of this be vectorized? Same for other stages

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is intentional — it follows the LegacySpeechStage pattern used by all other audio stages (GetAudioDurationStage, PreserveByValueStage, etc.), where process() iterates over task.data and calls process_dataset_entry() per entry.

Parallelism is handled at the executor level instead. In benchmark testing (10,000 entries, XennaExecutor on 8-core i9-9900KF), the autoscaler allocated 4 workers to the Builder stage, achieving ~1,460 entries/sec aggregate throughput (365 entries/sec/worker) with 86% CPU utilization. The Overlap stage ran 3 workers at ~5,650 entries/sec. Full pipeline completed in 90s.

If we want batch-level optimization in the future, it would need to happen at the LegacySpeechStage base class level, which would affect all audio stages.

}
)

return [AudioBatch(data=[result])]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each time we return a Task you must pass its parents tasks metadata and stage_perf_stats..

In such a fan-out implementation this becomes hard to reason about..

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the _stage_perfs are supposedly propagated via the base LegacySpeechStage. I would be curious to look at the benchmark results for this PR though to get an even better understanding of how existing audio curation code can be refactored.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 26, 2026

Additional Comments (3)

nemo_curator/stages/audio/alm/alm_data_builder.py, line 168
Missing validation for stage parameters. Add checks to ensure:

  • min_speakers <= max_speakers
  • tolerance >= 0 (reasonable range)
  • target_window_duration > 0
  • min_sample_rate > 0

Without validation, invalid configs can cause confusing runtime behavior.


nemo_curator/stages/audio/alm/alm_data_builder.py, line 203
Potential KeyError if segments are missing required fields start, end. Consider adding validation or try/except to handle malformed input data gracefully with a clear error message.


nemo_curator/stages/audio/alm/alm_data_overlap.py, line 62
Non-standard tuple order (end, start) requires compensating index swaps throughout the file (lines 67-68, 87-88, 94-95). This makes the code harder to understand and maintain. Consider using standard (start, end) convention or add a clear comment explaining why this order is intentional.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 27, 2026

Additional Comments (5)

nemo_curator/stages/audio/alm/alm_data_builder.py, line 112
hardcoded 8000 should use self.min_bandwidth for correct diagnostics when users configure different bandwidth thresholds


nemo_curator/stages/audio/alm/alm_data_overlap.py, line 62
returns (end, start) tuple order, requiring index swaps at lines 67-68, 87-88, 94-95 and unpacking as for end, start at lines 121, 126. standard (start, end) order would improve readability

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!


tests/stages/audio/alm/test_alm_data_builder.py, line 163
hardcoded count 181 makes test brittle - will break on fixture changes or logic tweaks. consider asserting total_windows > 0 instead

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!


tests/stages/audio/alm/test_alm_data_overlap.py, line 144
hardcoded counts 181, 25, 3035.50 make tests brittle. consider asserting invariants like total_filtered_windows <= total_builder_windows and filtered_dur > 0

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!


nemo_curator/stages/audio/alm/alm_manifest_reader.py, line 53
logs cumulative total len(entries) across all manifests, not count from current manifest. track per-manifest count separately for accurate logging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-follow-up Issue needs follow-up

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants