~20s to respond to "Hello. How are you?" Is that normal? #1176

nchammas · 2025-04-05T22:05:55Z

nchammas
Apr 5, 2025

Simple test script on guidance @ 9629e88, an M3 macOS, and Python 3.11:

from datetime import datetime
from guidance import models, gen, user, assistant
from guidance.chat import Phi3MiniChatTemplate


if __name__ == "__main__":
    model = models.Transformers(
        "microsoft/Phi-3.5-mini-instruct",
        chat_template=Phi3MiniChatTemplate,
    )
    with user():
        model += "Hello. How are you?\n"

    response_start = datetime.now()
    with assistant():
        model += gen(name="response", stop="\n")
    response_end = datetime.now()

    print(model["response"])
    print(response_end - response_start)

Results after running this three times in a row:

$ python test.py 
Loading checkpoint shards: 100%|████████████████| 2/2 [00:10<00:00,  5.31s/it]
gpustat is not installed, run `pip install gpustat` to collect GPU stats.
 I'm Phi, an AI language model, so I don't have feelings, but I'm fully operational and ready to assist you! How can I help you today?
0:00:17.377674

$ python test.py 
Loading checkpoint shards: 100%|████████████████| 2/2 [00:12<00:00,  6.06s/it]
gpustat is not installed, run `pip install gpustat` to collect GPU stats.
 I'm Phi, an AI language model, so I don't have feelings, but I'm fully operational and ready to assist you! How can I help you today?
0:00:19.502854

$ python test.py 
Loading checkpoint shards: 100%|████████████████| 2/2 [00:13<00:00,  6.64s/it]
gpustat is not installed, run `pip install gpustat` to collect GPU stats.
 I'm Phi, an AI language model, so I don't have feelings, but I'm fully operational and ready to assist you! How can I help you today?
0:00:21.474961

Is this normal? I don't see any GPU load on my machine. I assume there is something I can do to speed this up.

There is some stuff in the Transformers docs on GPUs, but I'm not sure what, if anything, applies to my situation.

Is it appropriate to add some docs to Guidance on configuring your environment for the best performance? If not, what do you suggest I try?

Answered by nchammas

Apr 7, 2025

Oh wow! I just copied torch_dtype from the Transformers docs and didn't really think about it.

Yes, I am seeing similar performance (and GPU usage) if I update my call to models.Transformers() as follows:

from datetime import datetime
from guidance import models, gen, user, assistant
from guidance.chat import Phi3MiniChatTemplate
from accelerate import Accelerator


if __name__ == "__main__":
    accelerator = Accelerator()
    model = models.Transformers(
        "microsoft/Phi-3.5-mini-instruct",
        chat_template=Phi3MiniChatTemplate,
        torch_dtype="auto",
        device_map=accelerator.device,
    )
    with user():
        model += "Hello. How are you?\n"

    response_start =

View full answer

nchammas · 2025-04-06T18:32:46Z

nchammas
Apr 6, 2025
Author

Doing some more reading, I believe I need to configure an MPS backend to get GPU acceleration on macOS. And I noticed a device_map parameter in Guidance's tests, though it doesn't seem to be documented.

guidance/tests/conftest.py

Line 69 in 9629e88

device_map="cuda:0",

guidance/tests/model_specific/test_transformers.py

Line 121 in 9629e88

device_map="mps",

So I installed the accelerate library (which is apparently required) and added that parameter to my test script as follows:

if __name__ == "__main__":
    model = models.Transformers(
        "microsoft/Phi-3.5-mini-instruct",
        chat_template=Phi3MiniChatTemplate,
        device_map="mps:0",
    )

However, whether the value is mps:0 or just mps, running my script now yields this RuntimeError:

$ python test.py 
Loading checkpoint shards:   0%|                        | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
  File ".../test.py", line 7, in <module>
    model = models.Transformers(
            ^^^^^^^^^^^^^^^^^^^^
  File ".../.venv/lib/python3.11/site-packages/guidance/models/_transformers.py", line 706, in __init__
    TransformersEngine(
  File ".../.venv/lib/python3.11/site-packages/guidance/models/_transformers.py", line 415, in __init__
    self.model_obj = self._model(model, **kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../.venv/lib/python3.11/site-packages/guidance/models/_transformers.py", line 464, in _model
    model = transformers_package.AutoModelForCausalLM.from_pretrained(model, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../.venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../.venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File ".../.venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4400, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../.venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4793, in _load_pretrained_model
    caching_allocator_warmup(model_to_load, expanded_device_map, factor=2 if hf_quantizer is None else 4)
  File ".../.venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 5799, in caching_allocator_warmup
    _ = torch.empty(byte_count // factor, dtype=torch.float16, device=device, requires_grad=False)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Invalid buffer size: 14.23 GB
Loading checkpoint shards:   0%|                        | 0/2 [00:00<?, ?it/s]

I'm not sure if the mention of buffer size means I have insufficient memory. My machine has 24 GB of memory available.

How is this ideally supposed to work?

0 replies

hudson-ai · 2025-04-06T18:42:26Z

hudson-ai
Apr 6, 2025
Collaborator

Extra parameters to the guidance.models.Transformers class are just passed through to a transformers.AutoModelForCausalLM constructor under the hood -- it looks like the errors you're seeing are coming from that library, not ours. Are you able to get that model working on your machine when using the transformers library directly?

7 replies

nchammas Apr 7, 2025
Author

Interestingly, if I use accelerate I can see the model now uses the GPU, but the runtime is no better:

import transformers
from datetime import datetime
from accelerate import Accelerator

if __name__ == "__main__":
    accelerator = Accelerator()
    model = (
        transformers.AutoModelForCausalLM
        .from_pretrained(
            "microsoft/Phi-3.5-mini-instruct",
            torch_dtype="auto",
            # I can also uncomment this without issue.
            # device_map="mps",
        )
    ).to(accelerator.device)
    tokenizer = transformers.AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
    model_inputs = tokenizer(["Hello. How are you?\n"], return_tensors="pt").to(accelerator.device)
    model, tokenizer, model_inputs = Accelerator().prepare(model, tokenizer, model_inputs)

    response_start = datetime.now()
    generated_ids = model.generate(
        **model_inputs,
        max_length=53,
    )
    response_end = datetime.now()

    print(tokenizer.batch_decode(generated_ids)[0])
    print(response_end - response_start)

Results:

$ python test-accelerate.py 
Loading checkpoint shards: 100%|█████████████████████| 2/2 [00:00<00:00, 17.03it/s]
Hello. How are you?

Assistant: I'm Phi, an AI language model. I don't have feelings, but I'm functioning optimally and ready to assist you. How can I help you today?


0:00:05.445009
$ python test-accelerate.py 
Loading checkpoint shards: 100%|█████████████████████| 2/2 [00:00<00:00, 20.44it/s]
Hello. How are you?

Assistant: I'm Phi, an AI language model. I don't have feelings, but I'm functioning optimally and ready to assist you. How can I help you today?


0:00:05.486935
$ python test-accelerate.py 
Loading checkpoint shards: 100%|█████████████████████| 2/2 [00:00<00:00, 21.67it/s]
Hello. How are you?

Assistant: I'm Phi, an AI language model. I don't have feelings, but I'm functioning optimally and ready to assist you. How can I help you today?


0:00:05.272158

Still ~6s even with the GPU running things and not the CPU (as verified by eyeballing the GPU and CPU history in macOS Activity Monitor).

And as noted in the code comment above, I can use device_map="mps" in the call to AutoModelForCausalLM.from_pretrained() without issue.

nchammas Apr 7, 2025
Author

@hudson-ai - It seems there are potentially two separate issues here:

Why is Guidance so much slower than Transformers at roughly the same task? Is there something subtly different about my two test scripts that explains this? Is it something fundamental about how Guidance works? Or is it some kind of bug in how Guidance calls Transformers under the hood?
The problem with specifying device_map seems like some kind of bug in how Guidance calls into Transformers, or at least something that needs clarification in the Guidance docs. (It also makes me wonder, by the way, why Guidance does not simply use accelerate to configure the correct device map by default. I would assume most users would benefit from this.)

hudson-ai Apr 7, 2025
Collaborator

@nchammas there is indeed something subtlely different:
Your guidance usage:

    model = models.Transformers(
        "microsoft/Phi-3.5-mini-instruct",
        chat_template=Phi3MiniChatTemplate,
    )

Your transformers usage:

    model = (
        transformers.AutoModelForCausalLM
        .from_pretrained(
            "microsoft/Phi-3.5-mini-instruct",
            torch_dtype="auto",
        )
    )

On my machine, I can indeed reproduce your observation that the guidance model appears slower. However, adding the torch_dtype arg like so:

    model = models.Transformers(
        "microsoft/Phi-3.5-mini-instruct",
        chat_template=Phi3MiniChatTemplate,
        torch_dtype="auto",
    )

makes the two timings nearly identical. Because we pass parameters directly through to transformers, you can also add device map like so:

    model = models.Transformers(
        "microsoft/Phi-3.5-mini-instruct",
        chat_template=Phi3MiniChatTemplate,
        torch_dtype="auto",
        device_map="mps",
    )

I don't know why you were getting a runtime error trying to pass device map before -- the transformers library should have given you a nicer exception telling you to install accelerate from the get-go. Maybe some interaction with the torch_dtype argument? Really not sure, and I can't currently reproduce

nchammas Apr 7, 2025
Author

Oh wow! I just copied torch_dtype from the Transformers docs and didn't really think about it.

Yes, I am seeing similar performance (and GPU usage) if I update my call to models.Transformers() as follows:

from datetime import datetime
from guidance import models, gen, user, assistant
from guidance.chat import Phi3MiniChatTemplate
from accelerate import Accelerator


if __name__ == "__main__":
    accelerator = Accelerator()
    model = models.Transformers(
        "microsoft/Phi-3.5-mini-instruct",
        chat_template=Phi3MiniChatTemplate,
        torch_dtype="auto",
        device_map=accelerator.device,
    )
    with user():
        model += "Hello. How are you?\n"

    response_start = datetime.now()
    with assistant():
        model += gen(name="response", stop="\n")
    response_end = datetime.now()

    print(model["response"])
    print(response_end - response_start)

This runs in ~4s, which is a smidge faster than using Transformers directly (with or without GPU).

Would it be appropriate to update the various Guidance docs and examples that use Transformers to pass torch_dtype="auto"? Or perhaps a dedicated section somewhere on performance that mentions both torch_dtype and Accelerate, and links to the appropriate Transformers docs for more details? I'd be happy to submit a PR for that.

The performance impact of torch_dtype alone is really large and should make anyone's local development cycle much more pleasant!

Answer selected by nchammas

nchammas Apr 7, 2025
Author

OK, perhaps we shouldn't add anything to the docs just yet. I just tried adding torch_dtype and then both torch_dtype and device_map to my main Guidance program, and in both cases my program got significantly slower. I don't understand why.

My main program uses various features of Guidance, including select, capture, and gen. I've implemented a ReAct loop, and as a test I ask the model two different questions. Interestingly, the first one goes from taking ~40s to ~2m20s, and the second one goes from taking ~22s to ~13s.

In the case of the ~2m20s question, specifically when I use both torch_dtype and device_map, I can see the GPU work for a short bit and then idle for the majority of that time. I don't know what that's an indicator of.

It's not surprising to me that tuning model performance would involve a lot trial and error and knob-turning. If there are any obvious things I should check or try, I'm all ears. Otherwise, thank you for your help thus far! I guess I'll stick with my original setup without any of these performance-related configs, as that is overall faster for me when running my main program.

hudson-ai Apr 16, 2025
Collaborator

@nchammas I don't have any real guidance to give here at this point, as I expect that there are some non-trivial interactions here... While a specific device_map/torch_dtype might lead to better performance with the underlying model itself, there might be competing forces -- e.g. if it's more expensive for guidance to apply a mask (that's sitting in ram/on the cpu, potentially with some other dtype) and sample. There are likely some optimizations we can do here in the future, so thank you for experimenting with this!

~20s to respond to "Hello. How are you?" Is that normal? #1176

Uh oh!

nchammas Apr 5, 2025

Replies: 2 comments · 7 replies

Uh oh!

nchammas Apr 6, 2025 Author

Uh oh!

hudson-ai Apr 6, 2025 Collaborator

Uh oh!

Uh oh!

nchammas Apr 7, 2025 Author

Uh oh!

nchammas Apr 7, 2025 Author

Uh oh!

hudson-ai Apr 7, 2025 Collaborator

Uh oh!

nchammas Apr 7, 2025 Author

Uh oh!

nchammas Apr 7, 2025 Author

Uh oh!

Uh oh!

hudson-ai Apr 16, 2025 Collaborator

nchammas
Apr 5, 2025

Replies: 2 comments 7 replies

nchammas
Apr 6, 2025
Author

hudson-ai
Apr 6, 2025
Collaborator

nchammas Apr 7, 2025
Author

nchammas Apr 7, 2025
Author

hudson-ai Apr 7, 2025
Collaborator

nchammas Apr 7, 2025
Author

nchammas Apr 7, 2025
Author

hudson-ai Apr 16, 2025
Collaborator