Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@vmpuri
Copy link
Contributor

@vmpuri vmpuri commented Oct 4, 2024

Multi-turn conversations were not working within the browser for LLaMA 3.2 Vision. This fixes this for the case of:

  • Text only conversation
  • Conversation with one image prompt.

Tests

Model Browser Test Text Only Browser Test with Image Generate CLI test
LLaMA 3.1 -
LLaMA 3.2 11B Vision

CLI Tests

python3 torchchat.py generate llama3.2-11B --prompt "What's this?" --image-prompt assets/dog.jpg

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/transformers/data/metrics/__init__.py:19: UserWarning: A NumPy version >=1.23.5 and <2.3.0 is required for this version of SciPy (detected version 1.23.2)
  from scipy.stats import pearsonr, spearmanr
2024-10-04:18:39:09,981 INFO     [sdpa_with_kv_cache.py:28] Loading custom ops library: /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/executorch/extension/llm/custom_ops/libcustom_ops_aot_lib.dylib
Using device=mps 
Loading model...
Time to load model: 38.54 seconds
-----------------------------------------------------------
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/torch/nn/functional.py:5096: UserWarning: MPS: The constant padding of more than 3 dimensions is not currently supported natively. It uses View Ops default implementation to run. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Pad.mm:465.)
  return torch._C._nn.pad(input, pad, mode, value)
What's this?The image depicts a dog riding a skateboard on a road, showcasing its unique and playful appearance.

* A dog:
        + Sitting on a skateboard
        + Wearing sunglasses
        + Has a blue collar around its neck
        + Ears perked up
        + Tongue out
* A skateboard:
        + Red in color
        + Has yellow wheels
        + Being ridden by the dog
* Sunglasses:
        + Pink in color
        + Worn by the dog

The image presents a lighthearted and humorous scene, with the dog's sunglasses and skateboard adding to its playful and carefree demeanor.2024-10-04:18:40:47,012 INFO     [generate.py:1138] 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                
Generated 131 tokens                 
Time for inference 1: 56.0592 sec total                 
Time to first token: 14.2072 sec with parallel prefill.                

      Total throughput: 2.3547 tokens/sec, 0.4247 s/token                 
First token throughput: 0.0704 tokens/sec, 14.2072 s/token                 
 Next token throughput: 3.1301 tokens/sec, 0.3195 s/token                     
2024-10-04:18:40:47,012 INFO     [generate.py:1149] 
Bandwidth achieved: 50.14 GB/s
2024-10-04:18:40:47,012 INFO     [generate.py:1153] *** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================


      Average tokens/sec (total): 2.35                 
Average tokens/sec (first token): 0.07                 
Average tokens/sec (next tokens): 3.13 
(.venv) puri@puri-mac torchchat % python3 torchchat.py generate --checkpoint-path /Users/puri/.torchchat/model-cache/stories15M/stories15M.pt --pte-path stories15M.pte
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/transformers/data/metrics/__init__.py:19: UserWarning: A NumPy version >=1.23.5 and <2.3.0 is required for this version of SciPy (detected version 1.23.2)
  from scipy.stats import pearsonr, spearmanr
^E2024-10-04:18:41:19,239 INFO     [sdpa_with_kv_cache.py:28] Loading custom ops library: /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/executorch/extension/llm/custom_ops/libcustom_ops_aot_lib.dylib
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Using device=mps 
Loading model...
Cannot load specified PTE to mps. Attempting to load model to CPU instead
Time to load model: 0.05 seconds
[program.cpp:134] InternalConsistency verification requested but not available
-----------------------------------------------------------
Hello, my name is9 in the garden. The garden is very beautiful and all the flowers she blooms were so happy. One day, an old flower started to bloom. It was so beautiful that it fluttered around and around. Then, it was a small flower that bloomed in many different colors.
A few kids came to the garden and saw the beautiful flower. They thought it was so special, so they wanted to see it bloom. But the old flower was too high up on a tall tree, so they couldn't reach2024-10-04:18:41:19,824 INFO     [generate.py:1138] 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                
Generated 109 tokens                 
Time for inference 1: 0.2679 sec total                 
Time to first token: 0.0256 sec with sequential prefill.                

      Total throughput: 410.6411 tokens/sec, 0.0024 s/token                 
First token throughput: 39.0747 tokens/sec, 0.0256 s/token                 
 Next token throughput: 449.8894 tokens/sec, 0.0022 s/token                     
2024-10-04:18:41:19,824 INFO     [generate.py:1149] 
Bandwidth achieved: 0.00 GB/s
2024-10-04:18:41:19,824 INFO     [generate.py:1153] *** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================


      Average tokens/sec (total): 410.64                 
Average tokens/sec (first token): 39.07                 
Average tokens/sec (next tokens): 449.89 
(.venv) puri@puri-mac torchchat % python3 torchchat.py generate llama3 --prompt "Give me a table of 4 cold war era jets and their max speeds in km/h."                 
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/transformers/data/metrics/__init__.py:19: UserWarning: A NumPy version >=1.23.5 and <2.3.0 is required for this version of SciPy (detected version 1.23.2)
  from scipy.stats import pearsonr, spearmanr
2024-10-04:18:41:28,664 INFO     [sdpa_with_kv_cache.py:28] Loading custom ops library: /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/executorch/extension/llm/custom_ops/libcustom_ops_aot_lib.dylib
Using device=mps 
Loading model...
Time to load model: 27.24 seconds
-----------------------------------------------------------
Give me a table of 4 cold war era jets and their max speeds in km/h.Here is a table of 4 Cold War era jets and their maximum speeds in km/h:

| Aircraft | Country | Maximum Speed (km/h) |
| --- | --- | --- |
| Mikoyan-Gurevich MiG-15 | Soviet Union | 1,050 |
| North American F-86 Sabre | United States | 1,075 |
| Supermarine Swift F.7 | United Kingdom | 1,110 |
| Lockheed F-104 Starfighter | United States | 2,184 |

Note: The maximum speeds listed are approximate and may vary depending on the specific variant and configuration of each aircraft.2024-10-04:18:42:26,483 INFO     [generate.py:1138] 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                
Generated 130 tokens                 
Time for inference 1: 30.2699 sec total                 
Time to first token: 1.5681 sec with parallel prefill.                

      Total throughput: 4.3277 tokens/sec, 0.2311 s/token                 
First token throughput: 0.6377 tokens/sec, 1.5681 s/token                 
 Next token throughput: 4.5293 tokens/sec, 0.2208 s/token                     
2024-10-04:18:42:26,483 INFO     [generate.py:1149] 
Bandwidth achieved: 69.51 GB/s
2024-10-04:18:42:26,483 INFO     [generate.py:1153] *** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================


      Average tokens/sec (total): 4.33                 
Average tokens/sec (first token): 0.64                 
Average tokens/sec (next tokens): 4.53 

Tested with server + browser:

Server Terminal Browser Terminal
python3 torchchat.py server llama3.2-11b streamlit run torchchat/usages/browser.py

Single Image
image

Text Only

image

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 4, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1270

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d494aa0 with merge base d8c0aaf (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 4, 2024
@vmpuri vmpuri marked this pull request as ready for review October 4, 2024 21:23
if len(prompt.shape) > 1:
prompt = prompt.squeeze(0)
T = prompt.size(0)
max_new_tokens = min(max_new_tokens, max_seq_length - start_pos - T)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line necessary to remove?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it would be problem with long prompt_lengths

]
image_found = False
messages = []
for message in prompt:
Copy link
Contributor

@Jack-Khuu Jack-Khuu Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be torchchat.py generate Llama3.2-11B right?

Since it sends prompt: str and uses the image_prompt field

You might need to "create" a container prompt with those 2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or chat calls a curried version of this function that creates the format before calling this function

Comment on lines 316 to +334
if not isinstance(self.model, FlamingoModel):
prompt = [
{"role": message["role"], "content": message["content"]}
for message in completion_request.messages
for message in messages
]
return self._gen_model_input(
prompt=prompt, max_new_tokens=completion_request.max_tokens
)

# Llama 3.2 11B
prompt = None
images = None

for message in messages:
torchtune_contents = []
if isinstance(message["content"], list):
for content_dict in message["content"]:
if content_dict["type"] == "text":
assert (
prompt is None
), "At most one text prompt is supported for each request"
prompt = content_dict["text"]
elif content_dict["type"] == "image_url":
assert (
images is None
), "At most one image is supported at the moment"

base64_decoded = base64.b64decode(
content_dict["image_url"].split(";base64,")[1]
)
images = [Image.open(BytesIO(base64_decoded))]

assert prompt is not None, "Text prompt must be specified in the request"

return self._gen_model_input(prompt, images, completion_request.max_tokens)

prompt = [
{"role": message["role"], "content": message["content"]}
for message in messages
]

return self._gen_model_input(
prompt=prompt, max_new_tokens=completion_request.max_tokens
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary if check now

@vmpuri vmpuri force-pushed the multiturn-mm-single-image branch from f5b512e to be0632b Compare October 5, 2024 00:07
@vmpuri vmpuri force-pushed the multiturn-mm-single-image branch from be0632b to d494aa0 Compare October 5, 2024 01:43
@vmpuri vmpuri merged commit d0993b3 into main Oct 5, 2024
52 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants