Skip to content

Re: Incoporate Marlin for GPTQ checkpoints into tgis_native #66

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 25, 2024

Conversation

cyang49
Copy link
Contributor

@cyang49 cyang49 commented Mar 22, 2024

Resubmitting Marlin PR due to accidental removal

Motivation

This PR enables the use of Marlin kernel for GPTQ checkpoints. Marlin is shown to outperform Exllamav2 on Nvidia GPUs, especially for larger batch sizes.

Modifications

The code changes are mostly similar to exllamav2, except that it uses the Marlin kernel code and binding from the AutoGPTQ package instead of sourcing a separate marlin package. I adapted the QuantLinear implementation from AutoGPTQ with changes to remove codes that we don't need. Note that, my changes also enable marlin support for checkpoints that uses activation reordering (desc_act=True).

Marlin can be turned on by setting environment variable GPTQ_CUDA_TYPE=marlin.

Note that Marlin kernel only works on Nvidia GPUs with compute capability >= 8.0.

Result

[Llama-70B-4bit-128g]
Single A100x80GB, 1k context, output 512 tokens, batch size=16,

Marlin
Prefill : 12.2s, Inference time:38.57s
Exllamav2
Prefill : 9.68s, Inference time:79.7s
  • Investigations are needed as Marlin prefill appears slower.

The code needs to be more thoroughly tested both for the performance and correctness in the following scenarios:

  • Should not break fp16 logic
  • Should work for desc_act=False GPTQ checkpoints correctly with optimal performance
  • Should work for desc_act=True GPTQ checkpoints correctly with optimal performance, with slightly worse performance than the previous scenario
  • Should not break TP uses, although TP performance still needs further optimizations
  • Memory management needs extensive reviews

Related Issues

#51

cyang49 and others added 4 commits March 25, 2024 13:38
Signed-off-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: Chih-Chieh Yang <[email protected]>
Signed-off-by: Chih-Chieh-Yang <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Signed-off-by: Chih-Chieh Yang <[email protected]>
Signed-off-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: cyang49 <[email protected]>
Signed-off-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: Chih-Chieh-Yang <[email protected]>
@cyang49
Copy link
Contributor Author

cyang49 commented Mar 25, 2024

@njhill I really need these changes for #67 for a more thorough performance test. I decide to use exllama as default to get these merged quicker. Please let me know if you need anything else before merging

Copy link
Contributor

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cyang49 for this great work! Running some internal tests on this rn and will merge as soon as those pass.

@njhill njhill merged commit 316ca8d into IBM:main Mar 25, 2024
@cyang49 cyang49 deleted the pr_marlin branch March 25, 2024 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants