OOM when converting DeepSeek-V3 checkpoint from Hugging Face to Megatron

**Describe the bug**

I am trying to convert the **[DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3)** Hugging Face checkpoint into a Megatron-compatible format using **Megatron-Bridge**, on a **DGX-H100 cluster** (each node has 8x80GB H100 GPUs).

According to the [documentation](https://docs.nvidia.com/nemo/megatron-bridge/latest/models/llm/deepseek-v3.html), the conversion should work by directly running `convert_checkpoints.py`. However, during the conversion, the Hugging Face model is fully loaded into **CPU memory**, which causes an **out-of-memory (OOM)** error on the host.

The GPUs remain idle during this process, and GPU memory usage stays almost zero.

**Steps/Code to reproduce bug**

Start the NeMo 25.11 container

```bash
enroot import --output /global/nemo-25.11.sqsh docker://nvcr.io#nvidia/nemo:25.11

srun --mpi=pmix \
    --nodes=1 \
    --ntasks-per-node=8 \
    --container-image=/global/nemo-25.11.sqsh \
    --container-mounts=/global:/mounted_ws \
    --container-workdir=/mounted_ws \
    --container-writable \
    --no-container-mount-home \
    --pty bash
```

Run `convert_checkpoints.py`

```bash
cd /opt/Megatron-Bridge

torchrun --nproc_per_node=8 \
    examples/conversion/convert_checkpoints.py import \
    --hf-model /mounted_ws/DeepSeek-V3 \
    --megatron-path /mounted_ws/DeepSeek-V3-Megatron \
    --trust-remote-code
```

**Expected behavior**

The converted Megatron checkpoint should be saved to `/mounted_ws/DeepSeek-V3-Megatron`.

However, during execution, the Hugging Face model weights are loaded entirely into **main memory (CPU RAM)**. On my node, which has **2 TB of system memory**, this still results in an OOM error.

I also attempted to run the conversion across **4 nodes** using `torchrun`, and observed the following log:

```
[Gloo] Rank 0 is connected to 31 peer ranks. Expected number of connected peer ranks is: 31
```

However, each node still loads the full model into CPU memory independently, and all nodes OOM at roughly the same time. Is there a recommended way to convert DeepSeek-V3 without fully loading all weights in CPU memory?

**Log**

OOM

```
🔄 Starting import: /mounted_ws/DeepSeek-V3 -> /mounted_ws/DeepSeek-V3-Megatron
   Trust remote code: True
📥 Loading HuggingFace model: /mounted_ws/DeepSeek-V3
🔄 Starting import: /mounted_ws/DeepSeek-V3 -> /mounted_ws/DeepSeek-V3-Megatron
   Trust remote code: True
📥 Loading HuggingFace model: /mounted_ws/DeepSeek-V3
🔄 Starting import: /mounted_ws/DeepSeek-V3 -> /mounted_ws/DeepSeek-V3-Megatron
   Trust remote code: True
📥 Loading HuggingFace model: /mounted_ws/DeepSeek-V3
🔄 Starting import: /mounted_ws/DeepSeek-V3 -> /mounted_ws/DeepSeek-V3-Megatron
   Trust remote code: True
📥 Loading HuggingFace model: /mounted_ws/DeepSeek-V3
🔄 Starting import: /mounted_ws/DeepSeek-V3 -> /mounted_ws/DeepSeek-V3-Megatron
   Trust remote code: True
📥 Loading HuggingFace model: /mounted_ws/DeepSeek-V3
🔄 Starting import: /mounted_ws/DeepSeek-V3 -> /mounted_ws/DeepSeek-V3-Megatron
   Trust remote code: True
📥 Loading HuggingFace model: /mounted_ws/DeepSeek-V3
🔄 Starting import: /mounted_ws/DeepSeek-V3 -> /mounted_ws/DeepSeek-V3-Megatron
   Trust remote code: True
📥 Loading HuggingFace model: /mounted_ws/DeepSeek-V3
🔄 Starting import: /mounted_ws/DeepSeek-V3 -> /mounted_ws/DeepSeek-V3-Megatron
   Trust remote code: True
📥 Loading HuggingFace model: /mounted_ws/DeepSeek-V3
...
Model parallel not initialized, initializing...
...
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
...
Loading from /mounted_ws/DeepSeek-V3 ━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  18% 3:12:03 (5427/30486) DeepSeekV3Bridge
W1220 14:36:18.667000 675547 torch/distributed/elastic/multiprocessing/api.py:906] Sending process 675648 closing signal SIGTERM
W1220 14:36:18.671000 675547 torch/distributed/elastic/multiprocessing/api.py:906] Sending process 675649 closing signal SIGTERM
W1220 14:36:18.672000 675547 torch/distributed/elastic/multiprocessing/api.py:906] Sending process 675651 closing signal SIGTERM
W1220 14:36:18.673000 675547 torch/distributed/elastic/multiprocessing/api.py:906] Sending process 675652 closing signal SIGTERM
W1220 14:36:18.674000 675547 torch/distributed/elastic/multiprocessing/api.py:906] Sending process 675653 closing signal SIGTERM
W1220 14:36:18.675000 675547 torch/distributed/elastic/multiprocessing/api.py:906] Sending process 675654 closing signal SIGTERM
W1220 14:36:18.676000 675547 torch/distributed/elastic/multiprocessing/api.py:906] Sending process 675655 closing signal SIGTERM
E1220 14:36:32.692000 675547 torch/distributed/elastic/multiprocessing/api.py:880] failed (exitcode: -9) local_rank: 2 (pid: 675650) of binary: /opt/venv/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 7, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 936, in main
    run(args)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 927, in run
    elastic_launch(
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 151, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 288, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
examples/conversion/convert_checkpoints.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-12-20_14:36:18
  host      : dgx-gaia-55.tlv01.nbulabs.nvidia.com
  rank      : 2 (local_rank: 2)
  exitcode  : -9 (pid: 675650)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 675650
=======================================================
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM when converting DeepSeek-V3 checkpoint from Hugging Face to Megatron #1783

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OOM when converting DeepSeek-V3 checkpoint from Hugging Face to Megatron #1783

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions