Skip to content

Conversation

xin3he
Copy link
Contributor

@xin3he xin3he commented Sep 28, 2025

Batch_size is not considered before, quantizing llama3.3 70b will only use 1 card and got OOM.

@wenhuach21
Copy link
Contributor

the solution is incorrect

@xin3he
Copy link
Contributor Author

xin3he commented Sep 28, 2025

Okay, I can share the reproduce step to you so that we can find out a correct solution. @wenhuach21
auto-round --model /models/Llama-3.3-70B-Instruct/ --scheme "MXFP4" --device_map 4,5,6

@xin3he
Copy link
Contributor Author

xin3he commented Sep 28, 2025

BTW, only considering the weight size is not correct, for example, the large in_features of down_proj requires more memory to hold the activation.

(mlp): LlamaMLP(
          (gate_proj): Linear(in_features=8192, out_features=28672, bias=False)
          (up_proj): Linear(in_features=8192, out_features=28672, bias=False)
          (down_proj): Linear(in_features=28672, out_features=8192, bias=False)

@xin3he xin3he added the WIP label Sep 29, 2025
@xin3he xin3he marked this pull request as draft September 29, 2025 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants