[Bug]: `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` causes crash during startup if `enable_attention_dp:false`

### System Info
```
Ubuntu 24.04.3

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA B200                    On  |   00000000:03:00.0 Off |                    0 |
| N/A   32C    P0            144W / 1000W |       0MiB / 183359MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA B200                    On  |   00000000:04:00.0 Off |                    0 |
| N/A   31C    P0            141W / 1000W |       0MiB / 183359MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA B200                    On  |   00000000:05:00.0 Off |                    0 |
| N/A   31C    P0            144W / 1000W |       0MiB / 183359MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA B200                    On  |   00000000:06:00.0 Off |                    0 |
| N/A   32C    P0            143W / 1000W |       0MiB / 183359MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

python --version
Python 3.12.3


# uv pip show tensorrt_llm tensorrt torch
Using Python 3.12.3 environment at: venv
Name: tensorrt
Version: 10.11.0.33
Location: /root/venv/lib/python3.12/site-packages
Requires: tensorrt-cu12
Required-by: tensorrt-llm
---
Name: tensorrt-llm
Version: 1.2.0rc0
Location: /root/venv/lib/python3.12/site-packages
Requires: accelerate, aenum, backoff, blake3, blobfile, build, click, click-option-group, colored, cuda-python, datasets, diffusers, einops, etcd3, evaluate, fastapi, flashinfer-python, h5py, jsonschema, lark, llguidance, matplotlib, meson, mpi4py, mpmath, ninja, numpy, nvidia-cuda-nvrtc-cu12, nvidia-cutlass-dsl, nvidia-ml-py, nvidia-modelopt, nvidia-nccl-cu12, nvtx, omegaconf, onnx, onnx-graphsurgeon, openai, openai-harmony, opencv-python-headless, optimum, ordered-set, pandas, patchelf, peft, pillow, polygraphy, prometheus-client, prometheus-fastapi-instrumentator, protobuf, psutil, pulp, pydantic, pydantic-settings, pyzmq, sentencepiece, setuptools, soundfile, strenum, tensorrt, tiktoken, torch, torchvision, transformers, triton, uvicorn, wheel, xgrammar
Required-by:
---
Name: torch
Version: 2.7.1+cu128
Location: /root/venv/lib/python3.12/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-cufile-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-cusparselt-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, setuptools, sympy, triton, typing-extensions
Required-by: accelerate, flashinfer-python, nvidia-modelopt, optimum, peft, tensorrt-llm, torchaudio, torchprofile, torchvision, xgrammar
```

### Reproduction

During the startup I get this warning: `[10/09/2025-15:10:56] [TRT-LLM] [RANK 0] [W] It is recommended to incl. 'garbage_collection_threshold:0.???' or 'backend:cudaMallocAsync' or 'expandable_segments:True' in PYTORCH_CUDA_ALLOC_CONF.`

It works fine without it, but I figured I'd try adding `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` like this:

```
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True trtllm-serve /root/data/public/nvidia___DeepSeek-R1-0528-FP4 --port 4000 --host 0.0.0.0 --backend pytorch --max_seq_len 16384 --max_batch_size 4096 --max_num_tokens 32768 --tp_size 4 --trust_remote_code --extra_llm_api_options /root/data/trtllm-config.yml
```

But it causes a crash during startup. I'm unsure if this is expected behavior.

Here's `/root/data/trtllm-config.yml`:
```
stream_interval: 2
kv_cache_config:
  enable_block_reuse: true
  dtype: fp8
  free_gpu_memory_fraction: 0.85
  host_cache_size: 153812677427
```

For some reason, if I add `enable_attention_dp: true` then the issue goes away. So it's a combination of `enable_attention_dp: false` and `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` that seems to cause the issue.

Also, I can solve it by adding `cuda_graph_config: null` to the config.

### Expected behavior

No crash when adding `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`, or perhaps the startup warning telling users to add that env variable should be removed/adjusted.

### actual behavior

## Command:
```
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True trtllm-serve /root/data/public/nvidia___DeepSeek-R1-0528-FP4 --port 4000 --host 0.0.0.0 --backend pytorch --max_seq
_len 16384 --max_batch_size 4096 --max_num_tokens 32768 --tp_size 4 --trust_remote_code --extra_llm_api_options /root/data/trtllm-config.yml
```
Note: I tried adding `TLLM_LOG_LEVEL=DEBUG`, but it spat out so much stuff that it froze my terminal, and kept going for more than 15 minutes, so I killed the process.

## Startup crash logs:
```
(venv) root@4xb200:~# CUDA_LAUNCH_BLOCKING=1 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True trtllm-serve /root/data/public/nvidia___DeepSeek-R1-0528-FP4 --port 4000 --host 0.0.0.0 --backend pytorch --max_seq_len 16384 --max_batch_size 4096 --max_num_tokens 32768 --tp_size 4 --trust_remote_code --extra_llm_api_options /root/data/trtllm-config.yml
/root/venv/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin.
  warnings.warn(
/root/venv/lib/python3.12/site-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.56.0 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
[TensorRT-LLM] TensorRT LLM version: 1.2.0rc0
/root/venv/lib/python3.12/site-packages/tensorrt_llm/serve/openai_protocol.py:89: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
  class ResponseFormat(OpenAIBaseModel):
[10/09/2025-15:33:10] [TRT-LLM] [I] Using LLM with PyTorch backend
[10/09/2025-15:33:10] [TRT-LLM] [I] Set nccl_plugin to None.
[10/09/2025-15:33:10] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
[10/09/2025-15:33:10] [TRT-LLM] [I] start MpiSession with 4 workers
[10/09/2025-15:33:10] [TRT-LLM] [I] Found /root/data/public/nvidia___DeepSeek-R1-0528-FP4/hf_quant_config.json, pre-quantized checkpoint is used.
[10/09/2025-15:33:10] [TRT-LLM] [I] Setting quant_algo=NVFP4 form HF quant config.
[10/09/2025-15:33:10] [TRT-LLM] [I] Setting group_size=16 from HF quant config.
[10/09/2025-15:33:10] [TRT-LLM] [I] Setting has_zero_point=False from HF quant config.
[10/09/2025-15:33:10] [TRT-LLM] [I] Setting pre_quant_scale=True from HF quant config.
[10/09/2025-15:33:10] [TRT-LLM] [I] Setting exclude_modules=['lm_head', 'model.layers.0.self_attn*', 'model.layers.1.self_attn*', 'model.layers.10.self_attn*', 'model.layers.11.self_attn*', 'model.layers.12.self_attn*', 'model.layers.13.self_attn*', 'model.layers.14.self_attn*', 'model.layers.15.self_attn*', 'model.layers.16.self_attn*', 'model.layers.17.self_attn*', 'model.layers.18.self_attn*', 'model.layers.19.self_attn*', 'model.layers.2.self_attn*', 'model.layers.20.self_attn*', 'model.layers.21.self_attn*', 'model.layers.22.self_attn*', 'model.layers.23.self_attn*', 'model.layers.24.self_attn*', 'model.layers.25.self_attn*', 'model.layers.26.self_attn*', 'model.layers.27.self_attn*', 'model.layers.28.self_attn*', 'model.layers.29.self_attn*', 'model.layers.3.self_attn*', 'model.layers.30.self_attn*', 'model.layers.31.self_attn*', 'model.layers.32.self_attn*', 'model.layers.33.self_attn*', 'model.layers.34.self_attn*', 'model.layers.35.self_attn*', 'model.layers.36.self_attn*', 'model.layers.37.self_attn*', 'model.layers.38.self_attn*', 'model.layers.39.self_attn*', 'model.layers.4.self_attn*', 'model.layers.40.self_attn*', 'model.layers.41.self_attn*', 'model.layers.42.self_attn*', 'model.layers.43.self_attn*', 'model.layers.44.self_attn*', 'model.layers.45.self_attn*', 'model.layers.46.self_attn*', 'model.layers.47.self_attn*', 'model.layers.48.self_attn*', 'model.layers.49.self_attn*', 'model.layers.5.self_attn*', 'model.layers.50.self_attn*', 'model.layers.51.self_attn*', 'model.layers.52.self_attn*', 'model.layers.53.self_attn*', 'model.layers.54.self_attn*', 'model.layers.55.self_attn*', 'model.layers.56.self_attn*', 'model.layers.57.self_attn*', 'model.layers.58.self_attn*', 'model.layers.59.self_attn*', 'model.layers.6.self_attn*', 'model.layers.60.self_attn*', 'model.layers.7.self_attn*', 'model.layers.8.self_attn*', 'model.layers.9.self_attn*', 'model.layers.61*'] from HF quant config.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type deepseek_v3 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
/root/data/public/nvidia___DeepSeek-R1-0528-FP4
rank 0 using MpiPoolSession to spawn MPI processes
[10/09/2025-15:33:10] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[10/09/2025-15:33:10] [TRT-LLM] [I] Generating a new HMAC key for server worker_init_status_queue
[10/09/2025-15:33:10] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[10/09/2025-15:33:10] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[10/09/2025-15:33:10] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
Multiple distributions found for package optimum. Picked distribution: optimum
Multiple distributions found for package optimum. Picked distribution: optimum
Multiple distributions found for package optimum. Picked distribution: optimum
Multiple distributions found for package optimum. Picked distribution: optimum
/root/venv/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin.
  warnings.warn(
/root/venv/lib/python3.12/site-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.56.0 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
/root/venv/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin.
  warnings.warn(
/root/venv/lib/python3.12/site-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.56.0 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
/root/venv/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin.
  warnings.warn(
/root/venv/lib/python3.12/site-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.56.0 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
/root/venv/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin.
  warnings.warn(
/root/venv/lib/python3.12/site-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.56.0 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
[TensorRT-LLM] TensorRT LLM version: 1.2.0rc0
[TensorRT-LLM] TensorRT LLM version: 1.2.0rc0
[TensorRT-LLM] TensorRT LLM version: 1.2.0rc0
[TensorRT-LLM] TensorRT LLM version: 1.2.0rc0
/root/venv/lib/python3.12/site-packages/tensorrt_llm/serve/openai_protocol.py:89: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
  class ResponseFormat(OpenAIBaseModel):
/root/venv/lib/python3.12/site-packages/tensorrt_llm/serve/openai_protocol.py:89: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
  class ResponseFormat(OpenAIBaseModel):
/root/venv/lib/python3.12/site-packages/tensorrt_llm/serve/openai_protocol.py:89: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
  class ResponseFormat(OpenAIBaseModel):
/root/venv/lib/python3.12/site-packages/tensorrt_llm/serve/openai_protocol.py:89: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
  class ResponseFormat(OpenAIBaseModel):
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[10/09/2025-15:33:18] [TRT-LLM] [RANK 0] [I] ATTENTION RUNTIME FEATURES:  AttentionRuntimeFeatures(chunked_prefill=False, cache_reuse=True, has_speculative_draft_tokens=False, chunk_size=32768, chunked_prefill_buffer_batch_size=4)
/root/data/public/nvidia___DeepSeek-R1-0528-FP4
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
/root/data/public/nvidia___DeepSeek-R1-0528-FP4
`torch_dtype` is deprecated! Use `dtype` instead!
/root/data/public/nvidia___DeepSeek-R1-0528-FP4
/root/data/public/nvidia___DeepSeek-R1-0528-FP4
[10/09/2025-15:33:18] [TRT-LLM] [RANK 0] [I] Validating KV Cache config against kv_cache_dtype="fp8"
`torch_dtype` is deprecated! Use `dtype` instead!
[10/09/2025-15:33:21] [TRT-LLM] [RANK 0] [I] Use 95.57 GB for model weights.
[10/09/2025-15:33:21] [TRT-LLM] [RANK 0] [I] Prefetching 394.53GB checkpoint files.
[10/09/2025-15:33:21] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00157-of-000163.safetensors to memory...
[10/09/2025-15:33:21] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00005-of-000163.safetensors to memory...
[10/09/2025-15:33:21] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00083-of-000163.safetensors to memory...
[10/09/2025-15:33:21] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00006-of-000163.safetensors to memory...
[10/09/2025-15:33:21] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00068-of-000163.safetensors to memory...
[10/09/2025-15:33:21] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00069-of-000163.safetensors to memory...
[10/09/2025-15:33:21] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00139-of-000163.safetensors to memory...
[10/09/2025-15:33:21] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00107-of-000163.safetensors to memory...
[10/09/2025-15:33:21] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00127-of-000163.safetensors to memory...
[10/09/2025-15:33:21] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00016-of-000163.safetensors to memory...
[10/09/2025-15:33:21] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00150-of-000163.safetensors to memory...
[10/09/2025-15:33:21] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00072-of-000163.safetensors to memory...
[10/09/2025-15:33:21] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00030-of-000163.safetensors to memory...
[10/09/2025-15:33:21] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00163-of-000163.safetensors to memory...
[10/09/2025-15:33:21] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00113-of-000163.safetensors to memory...
[10/09/2025-15:33:21] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00065-of-000163.safetensors to memory...
[10/09/2025-15:33:33] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00016-of-000163.safetensors.
[10/09/2025-15:33:33] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00006-of-000163.safetensors.
[10/09/2025-15:33:33] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00072-of-000163.safetensors.
[10/09/2025-15:33:33] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00083-of-000163.safetensors.
[10/09/2025-15:33:33] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00107-of-000163.safetensors.
[10/09/2025-15:33:33] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00005-of-000163.safetensors.
[10/09/2025-15:33:34] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00127-of-000163.safetensors.
[10/09/2025-15:33:34] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00033-of-000163.safetensors to memory...
[10/09/2025-15:33:34] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00113-of-000163.safetensors.
[10/09/2025-15:33:34] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00157-of-000163.safetensors.
[10/09/2025-15:33:34] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00065-of-000163.safetensors.
[10/09/2025-15:33:34] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00069-of-000163.safetensors.
[10/09/2025-15:33:34] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00150-of-000163.safetensors.
[10/09/2025-15:33:34] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00139-of-000163.safetensors.
[10/09/2025-15:33:35] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00030-of-000163.safetensors.
[10/09/2025-15:33:35] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00068-of-000163.safetensors.
[10/09/2025-15:33:35] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00004-of-000163.safetensors to memory...
[10/09/2025-15:33:35] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00104-of-000163.safetensors to memory...
[10/09/2025-15:33:35] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00087-of-000163.safetensors to memory...
[10/09/2025-15:33:35] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00074-of-000163.safetensors to memory...
[10/09/2025-15:33:35] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00070-of-000163.safetensors to memory...
[10/09/2025-15:33:35] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00046-of-000163.safetensors to memory...
[10/09/2025-15:33:35] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00007-of-000163.safetensors to memory...
[10/09/2025-15:33:35] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00135-of-000163.safetensors to memory...
[10/09/2025-15:33:35] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00081-of-000163.safetensors to memory...
[10/09/2025-15:33:35] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00015-of-000163.safetensors to memory...
[10/09/2025-15:33:35] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00077-of-000163.safetensors to memory...
[10/09/2025-15:33:35] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00095-of-000163.safetensors to memory...
[10/09/2025-15:33:35] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00143-of-000163.safetensors to memory...
[10/09/2025-15:33:35] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00126-of-000163.safetensors to memory...
[10/09/2025-15:33:45] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00104-of-000163.safetensors.
[10/09/2025-15:33:45] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00040-of-000163.safetensors to memory...
[10/09/2025-15:33:46] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00077-of-000163.safetensors.
[10/09/2025-15:33:46] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00100-of-000163.safetensors to memory...
[10/09/2025-15:33:48] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00126-of-000163.safetensors.
[10/09/2025-15:33:48] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00135-of-000163.safetensors.
[10/09/2025-15:33:48] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00007-of-000163.safetensors.
[10/09/2025-15:33:48] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00143-of-000163.safetensors.
[10/09/2025-15:33:49] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00015-of-000163.safetensors.
[10/09/2025-15:33:49] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00100-of-000163.safetensors.
[10/09/2025-15:33:49] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00070-of-000163.safetensors.
[10/09/2025-15:33:49] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00087-of-000163.safetensors.
[10/09/2025-15:33:49] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00033-of-000163.safetensors.
[10/09/2025-15:33:49] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00004-of-000163.safetensors.
[10/09/2025-15:33:49] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00074-of-000163.safetensors.
[10/09/2025-15:33:49] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00081-of-000163.safetensors.
[10/09/2025-15:33:49] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00115-of-000163.safetensors to memory...
[10/09/2025-15:33:49] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00046-of-000163.safetensors.
[10/09/2025-15:33:49] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00029-of-000163.safetensors to memory...
[10/09/2025-15:33:49] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00095-of-000163.safetensors.
[10/09/2025-15:33:49] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00141-of-000163.safetensors to memory...
[10/09/2025-15:33:49] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00044-of-000163.safetensors to memory...
[10/09/2025-15:33:49] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00140-of-000163.safetensors to memory...
[10/09/2025-15:33:49] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00151-of-000163.safetensors to memory...
[10/09/2025-15:33:49] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00045-of-000163.safetensors to memory...
[10/09/2025-15:33:49] [TRT-LLM] [RANK 0] [I] Prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00076-of-000163.safetensors to memory...
[10/09/2025-15:33:50] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00163-of-000163.safetensors.
[10/09/2025-15:33:50] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00040-of-000163.safetensors.
[10/09/2025-15:33:51] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00141-of-000163.safetensors.
[10/09/2025-15:33:52] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00151-of-000163.safetensors.
[10/09/2025-15:33:52] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00044-of-000163.safetensors.
[10/09/2025-15:33:52] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00045-of-000163.safetensors.
[10/09/2025-15:33:52] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00140-of-000163.safetensors.
[10/09/2025-15:33:52] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00076-of-000163.safetensors.
[10/09/2025-15:33:52] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00115-of-000163.safetensors.
[10/09/2025-15:33:52] [TRT-LLM] [RANK 0] [I] Finished prefetching /root/data/public/nvidia___DeepSeek-R1-0528-FP4/model-00029-of-000163.safetensors.
Loading safetensors weights in parallel: 100%|██████████| 163/163 [00:06<00:00, 26.22it/s]
Loading safetensors weights in parallel: 100%|██████████| 163/163 [00:06<00:00, 25.92it/s]
Loading safetensors weights in parallel: 100%|██████████| 163/163 [00:06<00:00, 25.72it/s]
Loading safetensors weights in parallel: 100%|██████████| 163/163 [00:06<00:00, 25.50it/s]
Loading weights: 100%|██████████| 1644/1644 [00:32<00:00, 50.75it/s]
Post loading weights: 100%|██████████| 1640/1640 [00:00<00:00, 1650745.99it/s]
Model init total -- 73.95s
Loading weights: 100%|██████████| 1644/1644 [00:32<00:00, 50.30it/s]
Post loading weights: 100%|██████████| 1640/1640 [00:00<00:00, 1490823.27it/s]
Model init total -- 74.07s
Loading weights: 100%|██████████| 1644/1644 [00:34<00:00, 47.52it/s]
Post loading weights: 100%|██████████| 1640/1640 [00:00<00:00, 1226798.39it/s]
Model init total -- 76.01s
Loading weights: 100%|██████████| 1644/1644 [00:34<00:00, 47.24it/s]
Post loading weights: 100%|██████████| 1640/1640 [00:00<00:00, 1552744.60it/s]
Model init total -- 76.23s
[10/09/2025-15:34:36] [TRT-LLM] [RANK 0] [I] Using Sampler: TorchSampler
[10/09/2025-15:34:36] [TRT-LLM] [RANK 0] [W] Both free_gpu_memory_fraction and max_tokens are set (to 0.8500000238418579 and 32864 with free memory 19.848037719726562 of total memory 44.587799072265625, respectively). The smaller value will be used.
[10/09/2025-15:34:36] [TRT-LLM] [RANK 0] [I] Adjusted attention window size to 16385 in blocks_per_window
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 513 [window size=16385], tokens per block=32, primary blocks=1027, secondary blocks=136801
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 513 [window size=16385], tokens per block=32, primary blocks=1027, secondary blocks=136801
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 513 [window size=16385], tokens per block=32, primary blocks=1027, secondary blocks=136801
[TensorRT-LLM][INFO] Max KV cache blocks per sequence: 513 [window size=16385], tokens per block=32, primary blocks=1027, secondary blocks=136801
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.08 GiB for max tokens in paged KV cache (32864).
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.08 GiB for max tokens in paged KV cache (32864).
[10/09/2025-15:35:11] [TRT-LLM] [RANK 0] [I] max_seq_len=16385, max_num_requests=4096, max_num_tokens=32768, max_batch_size=4096
[10/09/2025-15:35:11] [TRT-LLM] [RANK 0] [I] cache_transceiver is disabled
[10/09/2025-15:35:11] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process starts ...
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.08 GiB for max tokens in paged KV cache (32864).
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.08 GiB for max tokens in paged KV cache (32864).
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 1779221504 bytes
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 1779221504 bytes
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 1779221504 bytes
[TensorRT-LLM][WARNING] Attention workspace size is not enough, increase the size from 0 bytes to 1779221504 bytes
[TensorRT-LLM][INFO] Detecting local TP group for rank 2
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] TP group is intra-node for rank 0
[TensorRT-LLM][INFO] Detecting local TP group for rank 1
[TensorRT-LLM][INFO] TP group is intra-node for rank 1
[TensorRT-LLM][INFO] Detecting local TP group for rank 3
[TensorRT-LLM][INFO] TP group is intra-node for rank 3
[TensorRT-LLM][INFO] TP group is intra-node for rank 2
[10/09/2025-15:35:51] [TRT-LLM] [RANK 0] [I] [Autotuner] Cache size after warmup is 92
[10/09/2025-15:35:51] [TRT-LLM] [RANK 0] [I] [Autotuner] Autotuning process ends
[10/09/2025-15:35:51] [TRT-LLM] [RANK 0] [I] Creating CUDA graph instances for 34 batch sizes.
[10/09/2025-15:35:51] [TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=128, draft_len=0
[10/09/2025-15:35:52] [TRT-LLM] [RANK 1] [E] Failed to initialize executor on rank 1: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[10/09/2025-15:35:52] [TRT-LLM] [RANK 1] [E] Traceback (most recent call last):
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/executor/worker.py", line 395, in worker_main
    worker: GenerationExecutorWorker = worker_cls(
                                       ^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/executor/worker.py", line 73, in __init__
    self.setup_engine()
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/executor/base_worker.py", line 187, in setup_engine
    self.engine = _create_py_executor(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/executor/base_worker.py", line 157, in _create_py_executor
    _executor = create_executor(**args)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 602, in create_py_executor
    py_executor = create_py_executor_instance(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/_util.py", line 681, in create_py_executor_instance
    return PyExecutor(
           ^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 245, in __init__
    self.model_engine.warmup(self.resource_manager)
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 434, in wrapper
    return method(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 741, in warmup
    self.forward(batch,
  File "/root/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/utils.py", line 74, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 2214, in forward
    outputs = self.cuda_graph_runner.replay(key, inputs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py", line 299, in replay
    self.graphs[key].replay()
  File "/root/venv/lib/python3.12/site-packages/torch/cuda/graphs.py", line 88, in replay
    super().replay()
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[10/09/2025-15:35:52] [TRT-LLM] [RANK 2] [E] Failed to initialize executor on rank 2: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[10/09/2025-15:35:52] [TRT-LLM] [RANK 2] [E] Traceback (most recent call last):
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/executor/worker.py", line 395, in worker_main
    worker: GenerationExecutorWorker = worker_cls(
                                       ^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/executor/worker.py", line 73, in __init__
    self.setup_engine()
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/executor/base_worker.py", line 187, in setup_engine
    self.engine = _create_py_executor(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/executor/base_worker.py", line 157, in _create_py_executor
    _executor = create_executor(**args)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 602, in create_py_executor
    py_executor = create_py_executor_instance(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/_util.py", line 681, in create_py_executor_instance
    return PyExecutor(
           ^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 245, in __init__
    self.model_engine.warmup(self.resource_manager)
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 434, in wrapper
    return method(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 741, in warmup
    self.forward(batch,
  File "/root/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/utils.py", line 74, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 2214, in forward
    outputs = self.cuda_graph_runner.replay(key, inputs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py", line 299, in replay
    self.graphs[key].replay()
  File "/root/venv/lib/python3.12/site-packages/torch/cuda/graphs.py", line 88, in replay
    super().replay()
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[10/09/2025-15:35:52] [TRT-LLM] [RANK 3] [E] Failed to initialize executor on rank 3: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[10/09/2025-15:35:52] [TRT-LLM] [RANK 3] [E] Traceback (most recent call last):
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/executor/worker.py", line 395, in worker_main
    worker: GenerationExecutorWorker = worker_cls(
                                       ^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/executor/worker.py", line 73, in __init__
    self.setup_engine()
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/executor/base_worker.py", line 187, in setup_engine
    self.engine = _create_py_executor(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/executor/base_worker.py", line 157, in _create_py_executor
    _executor = create_executor(**args)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 602, in create_py_executor
    py_executor = create_py_executor_instance(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/_util.py", line 681, in create_py_executor_instance
    return PyExecutor(
           ^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 245, in __init__
    self.model_engine.warmup(self.resource_manager)
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 434, in wrapper
    return method(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 741, in warmup
    self.forward(batch,
  File "/root/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/utils.py", line 74, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 2214, in forward
    outputs = self.cuda_graph_runner.replay(key, inputs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py", line 299, in replay
    self.graphs[key].replay()
  File "/root/venv/lib/python3.12/site-packages/torch/cuda/graphs.py", line 88, in replay
    super().replay()
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7e10669785e8 in /root/venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7e106690d4a2 in /root/venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7e10700a5422 in /root/venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xb45e21 (0x7e0dee545e21 in /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xb41ffb (0x7e0dee541ffb in /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xb49714 (0x7e0dee549714 in /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x44d2a2 (0x7e0e4ba4d2a2 in /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7e1066952f39 in /root/venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x704f48 (0x7e0e4bd04f48 in /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x705370 (0x7e0e4bd05370 in /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #10: /root/venv/bin/python3() [0x575b1e]
frame #11: /root/venv/bin/python3() [0x57586c]
frame #12: /root/venv/bin/python3() [0x59ec45]
frame #13: /root/venv/bin/python3() [0x575b1e]
frame #14: /root/venv/bin/python3() [0x57586c]
frame #15: /root/venv/bin/python3() [0x59ec45]
frame #16: /root/venv/bin/python3() [0x579682]
frame #17: /root/venv/bin/python3() [0x59ea69]
frame #18: /root/venv/bin/python3() [0x558ef1]
frame #19: /root/venv/bin/python3() [0x60fe85]
frame #20: /root/venv/bin/python3() [0x60fe95]
frame #21: /root/venv/bin/python3() [0x60fe95]
frame #22: /root/venv/bin/python3() [0x60fe95]
frame #23: /root/venv/bin/python3() [0x60fe95]
frame #24: /root/venv/bin/python3() [0x5536db]
frame #25: _PyEval_EvalFrameDefault + 0x9227 (0x5df007 in /root/venv/bin/python3)
frame #26: PyEval_EvalCode + 0x15b (0x5d4dab in /root/venv/bin/python3)
frame #27: /root/venv/bin/python3() [0x5d2bac]
frame #28: /root/venv/bin/python3() [0x5818ed]
frame #29: PyObject_Vectorcall + 0x35 (0x549cf5 in /root/venv/bin/python3)
frame #30: _PyEval_EvalFrameDefault + 0xadf (0x5d68bf in /root/venv/bin/python3)
frame #31: /root/venv/bin/python3() [0x6bc192]
frame #32: Py_RunMain + 0x232 (0x6bbdc2 in /root/venv/bin/python3)
frame #33: Py_BytesMain + 0x2d (0x6bba2d in /root/venv/bin/python3)
frame #34: <unknown function> + 0x2a1ca (0x7e107be2a1ca in /lib/x86_64-linux-gnu/libc.so.6)
frame #35: __libc_start_main + 0x8b (0x7e107be2a28b in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x25 (0x656a35 in /root/venv/bin/python3)

[4xb200:112648] *** Process received signal ***
[4xb200:112648] Signal: Aborted (6)
[4xb200:112648] Signal code:  (-6)
[4xb200:112648] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x7e107be45330]
[4xb200:112648] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x7e107be9eb2c]
[4xb200:112648] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x7e107be4527e]
[4xb200:112648] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x7e107be288ff]
[4xb200:112648] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5)[0x7e10796a5ff5]
[4xb200:112648] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da)[0x7e10796bb0da]
[4xb200:112648] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_call_terminate+0x33)[0x7e10796a58e6]
[4xb200:112648] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x31a)[0x7e10796ba8ba]
[4xb200:112648] [ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x22b06)[0x7e107a98bb06]
[4xb200:112648] [ 9] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7e107a98c5cd]
[4xb200:112648] [10] /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so(+0xb49bc8)[0x7e0dee549bc8]
[4xb200:112648] [11] /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x44d2a2)[0x7e0e4ba4d2a2]
[4xb200:112648] [12] /root/venv/lib/python3.12/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD0Ev+0x9)[0x7e1066952f39]
[4xb200:112648] [13] /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x704f48)[0x7e0e4bd04f48]
[4xb200:112648] [14] /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x705370)[0x7e0e4bd05370]
[4xb200:112648] [15] /root/venv/bin/python3[0x575b1e]
[4xb200:112648] [16] /root/venv/bin/python3[0x57586c]
[4xb200:112648] [17] /root/venv/bin/python3[0x59ec45]
[4xb200:112648] [18] /root/venv/bin/python3[0x575b1e]
[4xb200:112648] [19] /root/venv/bin/python3[0x57586c]
[4xb200:112648] [20] /root/venv/bin/python3[0x59ec45]
[4xb200:112648] [21] /root/venv/bin/python3[0x579682]
[4xb200:112648] [22] /root/venv/bin/python3[0x59ea69]
[4xb200:112648] [23] /root/venv/bin/python3[0x558ef1]
[4xb200:112648] [24] /root/venv/bin/python3[0x60fe85]
[4xb200:112648] [25] /root/venv/bin/python3[0x60fe95]
[4xb200:112648] [26] /root/venv/bin/python3[0x60fe95]
[4xb200:112648] [27] /root/venv/bin/python3[0x60fe95]
[4xb200:112648] [28] /root/venv/bin/python3[0x60fe95]
[4xb200:112648] [29] /root/venv/bin/python3[0x5536db]
[4xb200:112648] *** End of error message ***
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x71fbf39785e8 in /root/venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x71fbf390d4a2 in /root/venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x71fbf453e422 in /root/venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xb45e21 (0x71fb88945e21 in /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xb41ffb (0x71fb88941ffb in /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xb49714 (0x71fb88949714 in /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x44d2a2 (0x71fbe5e4d2a2 in /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x71fbf3952f39 in /root/venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x704f48 (0x71fbe6104f48 in /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x705370 (0x71fbe6105370 in /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #10: /root/venv/bin/python3() [0x575b1e]
frame #11: /root/venv/bin/python3() [0x57586c]
frame #12: /root/venv/bin/python3() [0x59ec45]
frame #13: /root/venv/bin/python3() [0x575b1e]
frame #14: /root/venv/bin/python3() [0x57586c]
frame #15: /root/venv/bin/python3() [0x59ec45]
frame #16: /root/venv/bin/python3() [0x579682]
frame #17: /root/venv/bin/python3() [0x59ea69]
frame #18: /root/venv/bin/python3() [0x558ef1]
frame #19: /root/venv/bin/python3() [0x60fe85]
frame #20: /root/venv/bin/python3() [0x60fe95]
frame #21: /root/venv/bin/python3() [0x60fe95]
frame #22: /root/venv/bin/python3() [0x60fe95]
frame #23: /root/venv/bin/python3() [0x60fe95]
frame #24: /root/venv/bin/python3() [0x5536db]
frame #25: _PyEval_EvalFrameDefault + 0x9227 (0x5df007 in /root/venv/bin/python3)
frame #26: PyEval_EvalCode + 0x15b (0x5d4dab in /root/venv/bin/python3)
frame #27: /root/venv/bin/python3() [0x5d2bac]
frame #28: /root/venv/bin/python3() [0x5818ed]
frame #29: PyObject_Vectorcall + 0x35 (0x549cf5 in /root/venv/bin/python3)
frame #30: _PyEval_EvalFrameDefault + 0xadf (0x5d68bf in /root/venv/bin/python3)
frame #31: /root/venv/bin/python3() [0x6bc192]
frame #32: Py_RunMain + 0x232 (0x6bbdc2 in /root/venv/bin/python3)
frame #33: Py_BytesMain + 0x2d (0x6bba2d in /root/venv/bin/python3)
frame #34: <unknown function> + 0x2a1ca (0x71fe1622a1ca in /lib/x86_64-linux-gnu/libc.so.6)
frame #35: __libc_start_main + 0x8b (0x71fe1622a28b in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x25 (0x656a35 in /root/venv/bin/python3)

[4xb200:112647] *** Process received signal ***
[4xb200:112647] Signal: Aborted (6)
[4xb200:112647] Signal code:  (-6)
[4xb200:112647] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x71fe16245330]
[4xb200:112647] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x71fe1629eb2c]
[4xb200:112647] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x71fe1624527e]
[4xb200:112647] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x71fe162288ff]
[4xb200:112647] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5)[0x71fe0daa5ff5]
[4xb200:112647] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da)[0x71fe0dabb0da]
[4xb200:112647] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_call_terminate+0x33)[0x71fe0daa58e6]
[4xb200:112647] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x31a)[0x71fe0daba8ba]
[4xb200:112647] [ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x22b06)[0x71fe0ffc2b06]
[4xb200:112647] [ 9] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x71fe0ffc35cd]
[4xb200:112647] [10] /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so(+0xb49bc8)[0x71fb88949bc8]
[4xb200:112647] [11] /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x44d2a2)[0x71fbe5e4d2a2]
[4xb200:112647] [12] /root/venv/lib/python3.12/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD0Ev+0x9)[0x71fbf3952f39]
[4xb200:112647] [13] /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x704f48)[0x71fbe6104f48]
[4xb200:112647] [14] /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x705370)[0x71fbe6105370]
[4xb200:112647] [15] /root/venv/bin/python3[0x575b1e]
[4xb200:112647] [16] /root/venv/bin/python3[0x57586c]
[4xb200:112647] [17] /root/venv/bin/python3[0x59ec45]
[4xb200:112647] [18] /root/venv/bin/python3[0x575b1e]
[4xb200:112647] [19] /root/venv/bin/python3[0x57586c]
[4xb200:112647] [20] /root/venv/bin/python3[0x59ec45]
[4xb200:112647] [21] /root/venv/bin/python3[0x579682]
[4xb200:112647] [22] /root/venv/bin/python3[0x59ea69]
[4xb200:112647] [23] /root/venv/bin/python3[0x558ef1]
[4xb200:112647] [24] /root/venv/bin/python3[0x60fe85]
[4xb200:112647] [25] /root/venv/bin/python3[0x60fe95]
[4xb200:112647] [26] /root/venv/bin/python3[0x60fe95]
[4xb200:112647] [27] /root/venv/bin/python3[0x60fe95]
[4xb200:112647] [28] /root/venv/bin/python3[0x60fe95]
[4xb200:112647] [29] /root/venv/bin/python3[0x5536db]
[4xb200:112647] *** End of error message ***
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f84c7f785e8 in /root/venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f84c7f0d4a2 in /root/venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f86d68a5422 in /root/venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xb45e21 (0x7f845cf45e21 in /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xb41ffb (0x7f845cf41ffb in /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xb49714 (0x7f845cf49714 in /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x44d2a2 (0x7f84ba44d2a2 in /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f84c7f52f39 in /root/venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x704f48 (0x7f84ba704f48 in /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x705370 (0x7f84ba705370 in /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #10: /root/venv/bin/python3() [0x575b1e]
frame #11: /root/venv/bin/python3() [0x57586c]
frame #12: /root/venv/bin/python3() [0x59ec45]
frame #13: /root/venv/bin/python3() [0x575b1e]
frame #14: /root/venv/bin/python3() [0x57586c]
frame #15: /root/venv/bin/python3() [0x59ec45]
frame #16: /root/venv/bin/python3() [0x579682]
frame #17: /root/venv/bin/python3() [0x59ea69]
frame #18: /root/venv/bin/python3() [0x558ef1]
frame #19: /root/venv/bin/python3() [0x60fe85]
frame #20: /root/venv/bin/python3() [0x60fe95]
frame #21: /root/venv/bin/python3() [0x60fe95]
frame #22: /root/venv/bin/python3() [0x60fe95]
frame #23: /root/venv/bin/python3() [0x60fe95]
frame #24: /root/venv/bin/python3() [0x5536db]
frame #25: _PyEval_EvalFrameDefault + 0x9227 (0x5df007 in /root/venv/bin/python3)
frame #26: PyEval_EvalCode + 0x15b (0x5d4dab in /root/venv/bin/python3)
frame #27: /root/venv/bin/python3() [0x5d2bac]
frame #28: /root/venv/bin/python3() [0x5818ed]
frame #29: PyObject_Vectorcall + 0x35 (0x549cf5 in /root/venv/bin/python3)
frame #30: _PyEval_EvalFrameDefault + 0xadf (0x5d68bf in /root/venv/bin/python3)
frame #31: /root/venv/bin/python3() [0x6bc192]
frame #32: Py_RunMain + 0x232 (0x6bbdc2 in /root/venv/bin/python3)
frame #33: Py_BytesMain + 0x2d (0x6bba2d in /root/venv/bin/python3)
frame #34: <unknown function> + 0x2a1ca (0x7f86ea82a1ca in /lib/x86_64-linux-gnu/libc.so.6)
frame #35: __libc_start_main + 0x8b (0x7f86ea82a28b in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x25 (0x656a35 in /root/venv/bin/python3)

[4xb200:112646] *** Process received signal ***
[4xb200:112646] Signal: Aborted (6)
[4xb200:112646] Signal code:  (-6)
[4xb200:112646] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x7f86ea845330]
[4xb200:112646] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x7f86ea89eb2c]
[4xb200:112646] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x7f86ea84527e]
[4xb200:112646] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x7f86ea8288ff]
[4xb200:112646] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5)[0x7f86e1ea5ff5]
[4xb200:112646] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da)[0x7f86e1ebb0da]
[4xb200:112646] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_call_terminate+0x33)[0x7f86e1ea58e6]
[4xb200:112646] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x31a)[0x7f86e1eba8ba]
[4xb200:112646] [ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x22b06)[0x7f86e928bb06]
[4xb200:112646] [ 9] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7f86e928c5cd]
[4xb200:112646] [10] /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so(+0xb49bc8)[0x7f845cf49bc8]
[4xb200:112646] [11] /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x44d2a2)[0x7f84ba44d2a2]
[4xb200:112646] [12] /root/venv/lib/python3.12/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD0Ev+0x9)[0x7f84c7f52f39]
[4xb200:112646] [13] /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x704f48)[0x7f84ba704f48]
[4xb200:112646] [14] /root/venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so(+0x705370)[0x7f84ba705370]
[4xb200:112646] [15] /root/venv/bin/python3[0x575b1e]
[4xb200:112646] [16] /root/venv/bin/python3[0x57586c]
[4xb200:112646] [17] /root/venv/bin/python3[0x59ec45]
[4xb200:112646] [18] /root/venv/bin/python3[0x575b1e]
[4xb200:112646] [19] /root/venv/bin/python3[0x57586c]
[4xb200:112646] [20] /root/venv/bin/python3[0x59ec45]
[4xb200:112646] [21] /root/venv/bin/python3[0x579682]
[4xb200:112646] [22] /root/venv/bin/python3[0x59ea69]
[4xb200:112646] [23] /root/venv/bin/python3[0x558ef1]
[4xb200:112646] [24] /root/venv/bin/python3[0x60fe85]
[4xb200:112646] [25] /root/venv/bin/python3[0x60fe95]
[4xb200:112646] [26] /root/venv/bin/python3[0x60fe95]
[4xb200:112646] [27] /root/venv/bin/python3[0x60fe95]
[4xb200:112646] [28] /root/venv/bin/python3[0x60fe95]
[4xb200:112646] [29] /root/venv/bin/python3[0x5536db]
[4xb200:112646] *** End of error message ***
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` causes crash during startup if `enable_attention_dp:false` #8243

System Info

Reproduction

Expected behavior

actual behavior

Command:

Startup crash logs:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True causes crash during startup if enable_attention_dp:false #8243

Description

System Info

Reproduction

Expected behavior

actual behavior

Command:

Startup crash logs:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` causes crash during startup if `enable_attention_dp:false` #8243