Skip to content

OOM when converting DeepSeek-V3 checkpoint from Hugging Face to MegatronΒ #1783

@mayeths

Description

@mayeths

Describe the bug

I am trying to convert the DeepSeek-V3 Hugging Face checkpoint into a Megatron-compatible format using Megatron-Bridge, on a DGX-H100 cluster (each node has 8x80GB H100 GPUs).

According to the documentation, the conversion should work by directly running convert_checkpoints.py. However, during the conversion, the Hugging Face model is fully loaded into CPU memory, which causes an out-of-memory (OOM) error on the host.

The GPUs remain idle during this process, and GPU memory usage stays almost zero.

Steps/Code to reproduce bug

Start the NeMo 25.11 container

enroot import --output /global/nemo-25.11.sqsh docker://nvcr.io#nvidia/nemo:25.11

srun --mpi=pmix \
    --nodes=1 \
    --ntasks-per-node=8 \
    --container-image=/global/nemo-25.11.sqsh \
    --container-mounts=/global:/mounted_ws \
    --container-workdir=/mounted_ws \
    --container-writable \
    --no-container-mount-home \
    --pty bash

Run convert_checkpoints.py

cd /opt/Megatron-Bridge

torchrun --nproc_per_node=8 \
    examples/conversion/convert_checkpoints.py import \
    --hf-model /mounted_ws/DeepSeek-V3 \
    --megatron-path /mounted_ws/DeepSeek-V3-Megatron \
    --trust-remote-code

Expected behavior

The converted Megatron checkpoint should be saved to /mounted_ws/DeepSeek-V3-Megatron.

However, during execution, the Hugging Face model weights are loaded entirely into main memory (CPU RAM). On my node, which has 2 TB of system memory, this still results in an OOM error.

I also attempted to run the conversion across 4 nodes using torchrun, and observed the following log:

[Gloo] Rank 0 is connected to 31 peer ranks. Expected number of connected peer ranks is: 31

However, each node still loads the full model into CPU memory independently, and all nodes OOM at roughly the same time. Is there a recommended way to convert DeepSeek-V3 without fully loading all weights in CPU memory?

Log

OOM

πŸ”„ Starting import: /mounted_ws/DeepSeek-V3 -> /mounted_ws/DeepSeek-V3-Megatron
   Trust remote code: True
πŸ“₯ Loading HuggingFace model: /mounted_ws/DeepSeek-V3
πŸ”„ Starting import: /mounted_ws/DeepSeek-V3 -> /mounted_ws/DeepSeek-V3-Megatron
   Trust remote code: True
πŸ“₯ Loading HuggingFace model: /mounted_ws/DeepSeek-V3
πŸ”„ Starting import: /mounted_ws/DeepSeek-V3 -> /mounted_ws/DeepSeek-V3-Megatron
   Trust remote code: True
πŸ“₯ Loading HuggingFace model: /mounted_ws/DeepSeek-V3
πŸ”„ Starting import: /mounted_ws/DeepSeek-V3 -> /mounted_ws/DeepSeek-V3-Megatron
   Trust remote code: True
πŸ“₯ Loading HuggingFace model: /mounted_ws/DeepSeek-V3
πŸ”„ Starting import: /mounted_ws/DeepSeek-V3 -> /mounted_ws/DeepSeek-V3-Megatron
   Trust remote code: True
πŸ“₯ Loading HuggingFace model: /mounted_ws/DeepSeek-V3
πŸ”„ Starting import: /mounted_ws/DeepSeek-V3 -> /mounted_ws/DeepSeek-V3-Megatron
   Trust remote code: True
πŸ“₯ Loading HuggingFace model: /mounted_ws/DeepSeek-V3
πŸ”„ Starting import: /mounted_ws/DeepSeek-V3 -> /mounted_ws/DeepSeek-V3-Megatron
   Trust remote code: True
πŸ“₯ Loading HuggingFace model: /mounted_ws/DeepSeek-V3
πŸ”„ Starting import: /mounted_ws/DeepSeek-V3 -> /mounted_ws/DeepSeek-V3-Megatron
   Trust remote code: True
πŸ“₯ Loading HuggingFace model: /mounted_ws/DeepSeek-V3
...
Model parallel not initialized, initializing...
...
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
...
Loading from /mounted_ws/DeepSeek-V3 ━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  18% 3:12:03 (5427/30486) DeepSeekV3Bridge
W1220 14:36:18.667000 675547 torch/distributed/elastic/multiprocessing/api.py:906] Sending process 675648 closing signal SIGTERM
W1220 14:36:18.671000 675547 torch/distributed/elastic/multiprocessing/api.py:906] Sending process 675649 closing signal SIGTERM
W1220 14:36:18.672000 675547 torch/distributed/elastic/multiprocessing/api.py:906] Sending process 675651 closing signal SIGTERM
W1220 14:36:18.673000 675547 torch/distributed/elastic/multiprocessing/api.py:906] Sending process 675652 closing signal SIGTERM
W1220 14:36:18.674000 675547 torch/distributed/elastic/multiprocessing/api.py:906] Sending process 675653 closing signal SIGTERM
W1220 14:36:18.675000 675547 torch/distributed/elastic/multiprocessing/api.py:906] Sending process 675654 closing signal SIGTERM
W1220 14:36:18.676000 675547 torch/distributed/elastic/multiprocessing/api.py:906] Sending process 675655 closing signal SIGTERM
E1220 14:36:32.692000 675547 torch/distributed/elastic/multiprocessing/api.py:880] failed (exitcode: -9) local_rank: 2 (pid: 675650) of binary: /opt/venv/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 7, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 936, in main
    run(args)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 927, in run
    elastic_launch(
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 151, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 288, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
examples/conversion/convert_checkpoints.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-12-20_14:36:18
  host      : dgx-gaia-55.tlv01.nbulabs.nvidia.com
  rank      : 2 (local_rank: 2)
  exitcode  : -9 (pid: 675650)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 675650
=======================================================

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions