Skip to content

Conversation

kevalmorabia97
Copy link
Collaborator

@kevalmorabia97 kevalmorabia97 commented Sep 8, 2025

What does this PR do?

Type of change: Test speedup

Overview: Eagle example tests were using default config which had large dimenstions hence it was using ~28GB per GPU causing OOM on A5000 servers. Optimized to just use 1GB now!

Testing

Summary by CodeRabbit

  • New Features
    • Speculative decoding example now integrates TensorBoard by default: runs launched via the provided script automatically report metrics, enabling real-time dashboards for training and evaluation without extra arguments or setup. This simplifies experiment tracking and makes it easier to compare runs, monitor progress, and diagnose performance changes during execution. No other user-facing behavior changes.

Copy link

coderabbitai bot commented Sep 8, 2025

Walkthrough

Adds TensorBoard reporting to the speculative decoding launch script. Updates a speculative decoding test to use a tiny EAGLE config passed via file and adjusts training sequence length and output path. Removes a skip guard in a Megatron export test so eagle-path tests run regardless of optional dependency presence.

Changes

Cohort / File(s) Summary
Speculative decoding launch script
examples/speculative_decoding/launch.sh
Inserted --report_to tensorboard into the Accelerate launch command (via a new line in CMD).
Speculative decoding EAGLE test
tests/examples/speculative_decoding/test_eagle.py
Renamed test function to test_llama_eagle3. Added tiny EAGLE config JSON (written to tmp path). Passed --eagle_config <file>, set output dir to tmp path, and specified --training_seq_len 128. Imported json.
Megatron unified export test
tests/gpu/torch/export/test_unified_export_megatron.py
Removed conditional skip logic that depended on megatron.core.post_training. Tests now proceed without import-based skipping.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant T as PyTest
  participant M as Unified Export Test
  participant D as Optional Dep (megatron.core.post_training)
  participant C as Conversion Logic

  rect rgba(200,200,255,0.2)
  note right of T: Old flow (before)
  T->>M: Start eagle-path test
  M->>D: Attempt import
  alt ImportError
    M-->>T: skip test
  else Import OK
    M->>C: Run conversion
    C-->>M: Result
    M-->>T: Assert outcomes
  end
  end

  rect rgba(200,255,200,0.2)
  note right of T: New flow (now)
  T->>M: Start eagle-path test
  M->>C: Run conversion (no pre-check)
  alt Dependency missing
    C-->>M: Error raised
    M-->>T: Test fails (error)
  else OK
    C-->>M: Result
    M-->>T: Assert outcomes
  end
  end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I toggled flags with tidy care,
A tiny EAGLE takes the air.
No skipping now—we run the flight,
If deps are shy, we learn by night.
Logs in TensorBoard aglow,
A rabbit nods: on we go! 🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch kmorabia/eagle-test-mem

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kevalmorabia97 kevalmorabia97 requested review from yeyu-nvidia and h-guo18 and removed request for a team and cjluo-nv September 8, 2025 10:45
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
examples/speculative_decoding/launch.sh (1)

92-95: Guard division-by-zero when no GPUs are visible

If torch.cuda.device_count() returns 0, 192 / GPU_COUNT errors. Add a safe fallback.

-GPU_COUNT=$(python -c "import torch; print(torch.cuda.device_count())")
-DEFAULT_SAVE_STEPS=$((192 / GPU_COUNT))
+GPU_COUNT=$(python -c "import torch; print(torch.cuda.device_count())")
+if [[ "$GPU_COUNT" -gt 0 ]]; then
+  DEFAULT_SAVE_STEPS=$((192 / GPU_COUNT))
+else
+  DEFAULT_SAVE_STEPS=192
+fi
tests/examples/speculative_decoding/test_eagle.py (2)

33-36: Make the config file easier to debug

Pretty-print JSON for quick inspection on failures.

-    with open(config_file, "w") as f:
-        json.dump(tiny_eagle_config, f)
+    with open(config_file, "w") as f:
+        json.dump(tiny_eagle_config, f, indent=2, sort_keys=True)

45-50: Stabilize test memory and launch mode across environments

  • Force single-GPU to keep memory predictable (CI hosts with >1 GPU currently trigger --multi_gpu via launch.sh).
  • Lower train batch size to 1 to further cap memory.
-            "--num_gpu", str(num_gpus),
+            "--num_gpu", "1",
             "--mode", "eagle3",
             "--eagle_config", str(config_file),
-            "--output_dir", tmp_path / "eagle-tinyllama",
-            "--training_seq_len", "128", # Match max_position_embeddings
+            "--output_dir", tmp_path / "eagle-tinyllama",
+            "--training_seq_len", "128",  # Match max_position_embeddings
+            "--train_bs", "1",

If you want to keep multi-GPU testing, consider clamping to min(1, num_gpus) only in CI: detect via env (e.g., CI=true) and pass 1 there.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5c88e7 and 254d6b0.

📒 Files selected for processing (3)
  • examples/speculative_decoding/launch.sh (1 hunks)
  • tests/examples/speculative_decoding/test_eagle.py (2 hunks)
  • tests/gpu/torch/export/test_unified_export_megatron.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/gpu/torch/export/test_unified_export_megatron.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/examples/speculative_decoding/test_eagle.py (3)
tests/_test_utils/examples/run_command.py (1)
  • run_example_command (35-37)
tests/examples/conftest.py (2)
  • tiny_llama_path (33-41)
  • num_gpus (23-24)
tests/examples/speculative_decoding/conftest.py (1)
  • tiny_daring_anteater_path (23-36)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: wait-checks / wait
  • GitHub Check: code-quality
  • GitHub Check: build-docs
🔇 Additional comments (2)
tests/examples/speculative_decoding/test_eagle.py (2)

16-16: Import is correct and scoped to the new JSON config usage


22-31: Tiny EAGLE config aligns with goal to cut memory footprint

Good call on 128 seq len and single-layer/low-width knobs to reduce activation and KV cache sizes.

Please confirm these keys match what --eagle_config loader expects; if additional required fields emerge upstream, the test may start failing.

Copy link

codecov bot commented Sep 8, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.93%. Comparing base (b233ad1) to head (984e30f).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #299   +/-   ##
=======================================
  Coverage   73.93%   73.93%           
=======================================
  Files         172      172           
  Lines       17408    17408           
=======================================
  Hits        12870    12870           
  Misses       4538     4538           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97 kevalmorabia97 merged commit cf6f1d4 into main Sep 8, 2025
22 checks passed
@kevalmorabia97 kevalmorabia97 deleted the kmorabia/eagle-test-mem branch September 8, 2025 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants