Improve performance for quantized models on Power10 CPU #408

mgiessing · 2024-05-06T15:50:59Z

mgiessing
May 6, 2024

The quantized version of the sample model (phi-2) seem to have poor performance compared to fp32:

Phyiscal cores:		 8
Logical cores:		 16
Total Memory (GB):	 61.69
Avail. Memory (GB):	 58.84

FP32: It took 9.37 seconds to generate 100 tokens => approximately 10.67 tokens/second
FP16: It took 232.36 seconds to generate 100 tokens => approximately 0.43 tokens/second
INT4: It took 60.96 seconds to generate 100 tokens => approximately 1.64 tokens/second

How to reproduce?

Preparation of the models:

python3 -m onnxruntime_genai.models.builder -m microsoft/phi-2 -e cpu -p fp32 -o /data/LLMs/phi2_ort_genai_fp32
python3 -m onnxruntime_genai.models.builder -m microsoft/phi-2 -e cpu -p fp16 -o /data/LLMs/phi2_ort_genai_fp16
python3 -m onnxruntime_genai.models.builder -m microsoft/phi-2 -e cpu -p int4 -o /data/LLMs/phi2_ort_genai_int4

Script to run the inference:

import os, time, psutil
import onnxruntime_genai as og

def get_tokens(generated_tokens):
    # Dim 0 is no. of batches, Dim1 is length of each batch (truncated/padded to same dim)
    num_generated_tokens = len(generated_tokens)*len(generated_tokens[0])
    return num_generated_tokens

print(f"Phyiscal cores:\t\t {psutil.cpu_count(logical=False)}")
print(f"Logical cores:\t\t {psutil.cpu_count(logical=True)}")
print(f"Total Memory (GB):\t {psutil.virtual_memory().total / (1024 ** 3):.2f}")
print(f"Avail. Memory (GB):\t {psutil.virtual_memory().available / (1024 ** 3):.2f}\n")

QUANT_LEVEL = ["fp32", "fp16", "int4"]
for q in QUANT_LEVEL:
    model_path = os.path.abspath(f"/data/LLMs/phi2_ort_genai_{q}")
    model = og.Model(model_path)
    tokenizer = og.Tokenizer(model)

    prompt = '''def print_prime(n):
        """
        Print all primes between 1 and n
        """'''

    tokens = tokenizer.encode(prompt)

    params = og.GeneratorParams(model)
    params.set_search_options(min_length=100, max_length=100)
    params.input_ids = tokens

    st = time.time()
    output_tokens = model.generate(params)
    ed = time.time()

    #print(tokenizer.decode(output_tokens))
    print(f"{q.upper()}: It took {ed-st:.2f} seconds to generate {get_tokens(output_tokens)} tokens => approximately {get_tokens(output_tokens)/(ed-st):.2f} tokens/second")

System information

OS: AlmaLinux 9.3
IBM Power10 CPU (ppc64le)

$ pip3 list installed | grep onnxruntime
onnxruntime            1.17.3
onnxruntime-genai      0.2.0rc4

Anyone got an idea why the quantized performance is poor - I'd could assume there is some data type conversion adding overhead?

Thank you!

yufenglee · 2024-05-06T16:32:06Z

yufenglee
May 6, 2024
Collaborator

@mgiessing, int4 doesn't have special kernel for Power10 CPU yet. I don't have plan to support it yet. Do you want to contribute?

2 replies

mgiessing May 28, 2024
Author

Hi @yufenglee - just for my understanding:
int4 kernel for Power10 CPU would needed to be implemented in onnxruntime itself, not this repository (onnxruntime-genai) - is this correct?

baijumeswani May 28, 2024
Maintainer

Yes, your understanding is correct. onnxruntime-genai is a wrapper around onnxruntime and relies on ort to execute the onnx graph.

mgiessing · 2024-05-06T17:28:43Z

mgiessing
May 6, 2024
Author

@yufenglee is that something that would needed to be addressed within onnxruntime-genai or via onnxruntime mlas?

I just synced with our optimization & kernel team and they confirmed we have only implemented Power10 MMA (matrix math accelerator engine) for fp32 & int8 in onnxruntime/mlas

That would lead to another question: Are you going to support int8 quantization?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve performance for quantized models on Power10 CPU #408

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Improve performance for quantized models on Power10 CPU #408

Uh oh!

mgiessing May 6, 2024

How to reproduce?

Preparation of the models:

Script to run the inference:

System information

Replies: 2 comments · 2 replies

Uh oh!

yufenglee May 6, 2024 Collaborator

Uh oh!

mgiessing May 28, 2024 Author

Uh oh!

baijumeswani May 28, 2024 Maintainer

Uh oh!

Uh oh!

mgiessing May 6, 2024 Author

mgiessing
May 6, 2024

Replies: 2 comments 2 replies

yufenglee
May 6, 2024
Collaborator

mgiessing May 28, 2024
Author

baijumeswani May 28, 2024
Maintainer

mgiessing
May 6, 2024
Author