Reduce eagle example test memory usage from 28 to 1 GB #299

kevalmorabia97 · 2025-09-08T10:41:14Z

What does this PR do?

Type of change: Test speedup

Overview: Eagle example tests were using default config which had large dimenstions hence it was using ~28GB per GPU causing OOM on A5000 servers. Optimized to just use 1GB now!

Testing

Locally checked gpu memory usage
Spec dec example test now passing: https://gitlab-master.nvidia.com/omniml/modelopt/-/jobs/207882677

Summary by CodeRabbit

New Features
- Speculative decoding example now integrates TensorBoard by default: runs launched via the provided script automatically report metrics, enabling real-time dashboards for training and evaluation without extra arguments or setup. This simplifies experiment tracking and makes it easier to compare runs, monitor progress, and diagnose performance changes during execution. No other user-facing behavior changes.

Signed-off-by: Keval Morabia <[email protected]>

coderabbitai · 2025-09-08T10:41:22Z

Walkthrough

Adds TensorBoard reporting to the speculative decoding launch script. Updates a speculative decoding test to use a tiny EAGLE config passed via file and adjusts training sequence length and output path. Removes a skip guard in a Megatron export test so eagle-path tests run regardless of optional dependency presence.

Changes

Cohort / File(s)	Summary
Speculative decoding launch script `examples/speculative_decoding/launch.sh`	Inserted `--report_to tensorboard` into the Accelerate launch command (via a new line in CMD).
Speculative decoding EAGLE test `tests/examples/speculative_decoding/test_eagle.py`	Renamed test function to `test_llama_eagle3`. Added tiny EAGLE config JSON (written to tmp path). Passed `--eagle_config <file>`, set output dir to tmp path, and specified `--training_seq_len 128`. Imported `json`.
Megatron unified export test `tests/gpu/torch/export/test_unified_export_megatron.py`	Removed conditional skip logic that depended on `megatron.core.post_training`. Tests now proceed without import-based skipping.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant T as PyTest
  participant M as Unified Export Test
  participant D as Optional Dep (megatron.core.post_training)
  participant C as Conversion Logic

  rect rgba(200,200,255,0.2)
  note right of T: Old flow (before)
  T->>M: Start eagle-path test
  M->>D: Attempt import
  alt ImportError
    M-->>T: skip test
  else Import OK
    M->>C: Run conversion
    C-->>M: Result
    M-->>T: Assert outcomes
  end
  end

  rect rgba(200,255,200,0.2)
  note right of T: New flow (now)
  T->>M: Start eagle-path test
  M->>C: Run conversion (no pre-check)
  alt Dependency missing
    C-->>M: Error raised
    M-->>T: Test fails (error)
  else OK
    C-->>M: Result
    M-->>T: Assert outcomes
  end
  end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I toggled flags with tidy care,
A tiny EAGLE takes the air.
No skipping now—we run the flight,
If deps are shy, we learn by night.
Logs in TensorBoard aglow,
A rabbit nods: on we go! 🐇✨

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch kmorabia/eagle-test-mem

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

examples/speculative_decoding/launch.sh (1)
92-95: Guard division-by-zero when no GPUs are visible

If torch.cuda.device_count() returns 0, 192 / GPU_COUNT errors. Add a safe fallback.
-GPU_COUNT=$(python -c "import torch; print(torch.cuda.device_count())")
-DEFAULT_SAVE_STEPS=$((192 / GPU_COUNT))
+GPU_COUNT=$(python -c "import torch; print(torch.cuda.device_count())")
+if [[ "$GPU_COUNT" -gt 0 ]]; then
+  DEFAULT_SAVE_STEPS=$((192 / GPU_COUNT))
+else
+  DEFAULT_SAVE_STEPS=192
+fi
tests/examples/speculative_decoding/test_eagle.py (2)
33-36: Make the config file easier to debug

Pretty-print JSON for quick inspection on failures.
-    with open(config_file, "w") as f:
-        json.dump(tiny_eagle_config, f)
+    with open(config_file, "w") as f:
+        json.dump(tiny_eagle_config, f, indent=2, sort_keys=True)
45-50: Stabilize test memory and launch mode across environments

Force single-GPU to keep memory predictable (CI hosts with >1 GPU currently trigger --multi_gpu via launch.sh).

Lower train batch size to 1 to further cap memory.
-            "--num_gpu", str(num_gpus),
+            "--num_gpu", "1",
             "--mode", "eagle3",
             "--eagle_config", str(config_file),
-            "--output_dir", tmp_path / "eagle-tinyllama",
-            "--training_seq_len", "128", # Match max_position_embeddings
+            "--output_dir", tmp_path / "eagle-tinyllama",
+            "--training_seq_len", "128",  # Match max_position_embeddings
+            "--train_bs", "1",
If you want to keep multi-GPU testing, consider clamping to min(1, num_gpus) only in CI: detect via env (e.g., CI=true) and pass 1 there.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5c88e7 and 254d6b0.

📒 Files selected for processing (3)

examples/speculative_decoding/launch.sh (1 hunks)
tests/examples/speculative_decoding/test_eagle.py (2 hunks)
tests/gpu/torch/export/test_unified_export_megatron.py (0 hunks)

💤 Files with no reviewable changes (1)

tests/gpu/torch/export/test_unified_export_megatron.py

🧰 Additional context used

🧬 Code graph analysis (1)

tests/examples/speculative_decoding/test_eagle.py (3)

tests/_test_utils/examples/run_command.py (1)

run_example_command (35-37)

tests/examples/conftest.py (2)

tiny_llama_path (33-41)

num_gpus (23-24)

tests/examples/speculative_decoding/conftest.py (1)

tiny_daring_anteater_path (23-36)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: wait-checks / wait
GitHub Check: code-quality
GitHub Check: build-docs

🔇 Additional comments (2)

tests/examples/speculative_decoding/test_eagle.py (2)

16-16: Import is correct and scoped to the new JSON config usage

22-31: Tiny EAGLE config aligns with goal to cut memory footprint

Good call on 128 seq len and single-layer/low-width knobs to reduce activation and KV cache sizes.

Please confirm these keys match what --eagle_config loader expects; if additional required fields emerge upstream, the test may start failing.

examples/speculative_decoding/launch.sh

codecov · 2025-09-08T13:24:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.93%. Comparing base (b233ad1) to head (984e30f).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #299   +/-   ##
=======================================
  Coverage   73.93%   73.93%           
=======================================
  Files         172      172           
  Lines       17408    17408           
=======================================
  Hits        12870    12870           
  Misses       4538     4538

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Reduce eagle example test memory usage from 28 to 1 GB

254d6b0

Signed-off-by: Keval Morabia <[email protected]>

kevalmorabia97 requested review from a team as code owners September 8, 2025 10:41

kevalmorabia97 requested review from ChenhanYu and cjluo-nv September 8, 2025 10:41

Merge branch 'main' into kmorabia/eagle-test-mem

bc531e8

kevalmorabia97 requested review from h-guo18 and yeyu-nvidia and removed request for a team and cjluo-nv September 8, 2025 10:45

coderabbitai bot reviewed Sep 8, 2025

View reviewed changes

examples/speculative_decoding/launch.sh Show resolved Hide resolved

Merge branch 'main' into kmorabia/eagle-test-mem

984e30f

yeyu-nvidia approved these changes Sep 8, 2025

View reviewed changes

kevalmorabia97 merged commit cf6f1d4 into main Sep 8, 2025
22 checks passed

kevalmorabia97 deleted the kmorabia/eagle-test-mem branch September 8, 2025 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce eagle example test memory usage from 28 to 1 GB #299

Reduce eagle example test memory usage from 28 to 1 GB #299

Uh oh!

kevalmorabia97 commented Sep 8, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Sep 8, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

codecov bot commented Sep 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Reduce eagle example test memory usage from 28 to 1 GB #299

Reduce eagle example test memory usage from 28 to 1 GB #299

Uh oh!

Conversation

kevalmorabia97 commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kevalmorabia97 commented Sep 8, 2025 •

edited

Loading

coderabbitai bot commented Sep 8, 2025 •

edited

Loading

codecov bot commented Sep 8, 2025 •

edited

Loading