Skip to content

Docker single-GPU verification, NVSHMEM/pip build fix, and small runtime fixes#140

Open
rich7420 wants to merge 2 commits intohao-ai-lab:mainfrom
rich7420:first-click
Open

Docker single-GPU verification, NVSHMEM/pip build fix, and small runtime fixes#140
rich7420 wants to merge 2 commits intohao-ai-lab:mainfrom
rich7420:first-click

Conversation

@rich7420
Copy link

@rich7420 rich7420 commented Feb 6, 2026

Summary

  • Add Docker-based single-GPU smoke test and benchmark (no Slurm).
  • Fix csrc build when using pip-installed NVSHMEM (nvidia-nvshmem-cu12).
  • Fix rope_scaling handling when hf_config is a PretrainedConfig object.
  • Improve softmax_lse shape assert message in dispatch.py and flash-attn error message in fused_comm_attn.py.

Changes

  • Docker / scripts: Dockerfile, scripts/docker_install_and_build.sh, scripts/run_docker_benchmark.sh, scripts/run_docker_single_gpu_smoke.sh, scripts/run_docker_single_gpu_benchmark.sh, scripts/single_gpu_smoke.sh, scripts/single_gpu_benchmark.sh — one-shot smoke/benchmark (container exits) or interactive shell (container stays).
  • Docs: README.md (link to Docker verification), README.Docker.md (step-by-step).
  • Build: csrc/CMakeLists.txt, csrc/cmake/FindNVSHMEM.cmake — support NVSHMEM from pip (no NVSHMEMConfig.cmake).
  • Runtime: distca/runtime/attn_kernels/dispatch.py (clearer assert), distca/runtime/megatron/ops/fused_comm_attn.py (error text), distca/utils/megatron_test_utils.py (rope_scaling for config object).
  • Other: .gitignore (e.g. models/, .build/), requirements.txt (transformers pin, comment cleanup).

How to verify

From repo root with one GPU: ./scripts/run_docker_single_gpu_smoke.sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant