Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions examples/llm_ptq/example_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,9 +206,7 @@ def get_model(
if auto_model_module != AutoModelForCausalLM:
model_kwargs2.pop("trust_remote_code", None)
model_kwargs2["torch_dtype"] = torch_dtype
# DeciLMForCausalLM does not support max_memory argument
if "architectures" in hf_config and "DeciLMForCausalLM" in hf_config.architectures:
model_kwargs2.pop("max_memory", None)
model_kwargs2.pop("max_memory", None)
Copy link
Collaborator

@cjluo-nv cjluo-nv Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you actually want to remove line 166-171 as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sure I also thought of it. But I do not know why we added the use_seq_device_map in the first place. So should I remove the entire --use_seq_device_map?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recalled that --use_seq_device_map was added by @sugunav14 for nemotron models that do not distribute evenly on GPUs when initializing with auto device_map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjluo-nv What should we do here? The current fix works. If you are okay with the current fix, can you please approve it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_memory is set at line 170. If we don't use it here, we probably don't need line 170. @sugunav14 could you chime in?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, seems like with this change max_memory is never used.

  1. If we get rid of --use_seq_device_map I think the Llama-Nemotron Ultra will throw an OOM during export
  2. If we only get rid of max memory I believe we will still hit OOM if we use --use_seq_device_map cause the model will load consuming the full GPU memory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjluo-nv @sugunav14 removing max_memory as in this PR will not cause an OOM. max_memory is removed when loading with AutoModelForCausalLM.from_config under init_empy_weights context. Looks like the fix in this PR is sufficient. Can you please approve this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sugunav14 The model initialized here uses AutoModel.from_config(). AutoModel.from_config no longer accepts extra kwargs such as max_memory (I tested with Llama3 and Qwen3).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we inspect model's signature and keep kwargs if they are supported to avoid such issues in the future?

model = from_config(hf_config, **model_kwargs2)

max_memory = get_max_memory()
Expand Down