🦙🦙 Llama3.2 Release: Support for 1B, 3B, and 11B (Multimodal) (#1204)

Jack-Khuu · web-flow · commit f7aef8ab9349 · 2024-09-25T11:26:00.000-07:00
diff --git a/README.md b/README.md
@@ -2,6 +2,12 @@
 
 torchchat is a small codebase showcasing the ability to run large language models (LLMs) seamlessly. With torchchat, you can run LLMs using Python, within your own (C/C++) application (desktop or server) and on iOS and Android.
 
+> [!IMPORTANT]
+> Update September 25, 2024: torchchat has multimodal support for **Llama3.2 11B**!!
+>
+> To try it out, finish the [Installation](#Installation) section below, then hop
+> over to our [multimodal guide](docs/multimodal.md) to learn more.
+
 
 ## What can you do with torchchat?
 - [Run models via PyTorch / Python](#running-via-pytorch--python)
@@ -18,6 +24,7 @@ torchchat is a small codebase showcasing the ability to run large language model
 
 
 ## Highlights
+
 - Command line interaction with popular LLMs such as Llama 3, Llama 2, Stories, Mistral and more
 - PyTorch-native execution with performance
 - Supports popular hardware and OS
@@ -514,6 +521,13 @@ aliases.
 
 | Model | Mobile Friendly | Notes |
 |------------------|---|---------------------|
+|[meta-llama/Meta-Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)|✅|Tuned for `chat` . Alias to `llama3.2-3b`.|
+|[meta-llama/Meta-Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B)|✅|Best for `generate`. Alias to `llama3.2-3b-base`.|
+|[meta-llama/Llama-Guard-3-1B](https://huggingface.co/meta-llama/Llama-Guard-3-1B)|✅|Tuned for classification . Alias to `llama3-1b-guard`.|
+|[meta-llama/Meta-Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)|✅|Tuned for `chat` . Alias to `llama3.2-1b`.|
+|[meta-llama/Meta-Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B)|✅|Best for `generate`. Alias to `llama3.2-1b-base`.|
+|[meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)||Multimodal (Image + Text). Tuned for `chat` . Alias to `llama3.2-11B`.|
+|[meta-llama/Llama-3.2-11B-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision)||Multimodal (Image + Text). Tuned for `generate` . Alias to `llama3.2-11B-base`.|
 |[meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)|✅|Tuned for `chat` . Alias to `llama3.1`.|
 |[meta-llama/Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B)|✅|Best for `generate`. Alias to `llama3.1-base`.|
 |[meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)|✅|Tuned for `chat` . Alias to `llama3`.|
diff --git a/assets/dog.jpg b/assets/dog.jpg
diff --git a/docs/multimodal.md b/docs/multimodal.md
@@ -0,0 +1,73 @@
+# Multimodal Models
+
+Released on September 25th, 2024, **Llama3.2 11B Vision** is torchchat's first multimodal model.
+
+This page goes over the different commands you can run with LLama 3.2 11B Vision.
+
+## Model Access
+
+> [!NOTE]
+> While the commands refer to the model as some variant of "Llama 3.2 11B Vision",
+> the underlying checkpoint used is based off the "Instruct" variant of the model.
+
+**Llama3.2 11B Vision** is available via both [Hugging Face](https://huggingface.co/meta-llama) and [directly from Meta](https://www.llama.com/).
+
+While we strongly encourage you to use the Hugging Face checkpoint (which is the default for torchchat when utilizing the commands with the argument `llama3.2-11B`), we also provide support for manually providing the checkpoint. This can be done by replacing the `llama3.2-11B` argument in the commands below with the following:
+
+```
+--checkpoint-path <file.pth> --tokenizer-path <tokenizer.model> --params-path torchchat/model_params/Llama-3.2-11B-Vision.json
+```
+
+##  Generation
+
+**We are currently debugging Multimodal Inference on MPS and will have updates soon. In the meantime, when testing on Mac, please set `--device cpu`**
+
+This generates text output based on a text prompt and (optional) image prompt.
+
+```
+python torchchat.py generate llama3.2-11B --prompt "What's in this image?" --image-prompt assets/dog.jpg
+```
+
+## Server
+This mode exposes a REST API for interacting with a model.
+The server follows the [OpenAI API specification](https://platform.openai.com/docs/api-reference/chat) for chat completions.
+
+To test out the REST API, **you'll need 2 terminals**: one to host the server, and one to send the request.
+In one terminal, start the server
+
+[skip default]: begin
+
+```bash
+python3 torchchat.py server llama3.2-11B
+```
+[skip default]: end
+
+In another terminal, query the server using `curl`. This query might take a few minutes to respond.
+
+**We are currently debugging the server integration and will have updated examples shortly.**
+
+## Browser
+
+This command opens a basic browser interface for local chat by querying a local server.
+
+First, follow the steps in the Server section above to start a local server. Then, in another terminal, launch the interface. Running the following will open a tab in your browser.
+
+[skip default]: begin
+
+```
+streamlit run torchchat/usages/browser.py
+```
+
+**We are currently debugging the browser integration and will have updated examples shortly.**
+
+---
+
+# Future Work
+
+One of the goals of torchchat is to support various execution modes for every model. The following are execution modes that will be supported for **Llama3.2 11B Vision** in the near future:
+
+- **[torch.compile](https://pytorch.org/docs/stable/torch.compiler.html)**: Optimize inference via JIT Compilation
+- **[AOTI](https://pytorch.org/blog/pytorch2-2/)**: Enable pre-compiled and C++ inference
+- **[ExecuTorch](https://github.com/pytorch/executorch)**: On-device (Edge) inference
+
+In addition, we are in the process of integrating with [lm_evaluation_harness](https://github.com/EleutherAI/lm-evaluation-harness) for multimodal model evaluation.
diff --git a/torchchat/cli/builder.py b/torchchat/cli/builder.py
@@ -16,10 +16,7 @@
 import torch._inductor.config
 import torch.nn as nn
 
-try:
-    from _torchchat_test_script import flamingo_meta_to_tune
-except ImportError:
-    pass
+from torchtune.models.llama3_2_vision._convert_weights import llama3_vision_meta_to_tune
 
 from distributed import launch_distributed, ParallelDims, parallelize_llama
 
@@ -404,7 +401,7 @@ def _load_model_default(builder_args: BuilderArgs) -> Model:
             for submodule in model.modules():
                 if isinstance(submodule, Llama3ScaledRoPE):
                     submodule.__init__(head_dim, max_seq_len, rope_base)
-        state_dict = flamingo_meta_to_tune(checkpoint)
+        state_dict = llama3_vision_meta_to_tune(checkpoint)
         model.model.load_state_dict(state_dict, assign=True, strict=False)
     else:
         checkpoint = {"model." + k: v for k, v in checkpoint.items()}
diff --git a/torchchat/cli/convert_hf_checkpoint.py b/torchchat/cli/convert_hf_checkpoint.py
@@ -34,6 +34,8 @@ def convert_hf_checkpoint(
     if model_name is None:
         model_name = model_dir.name
 
+    # TODO: This is an incongruent way of resolving config_args
+    # See https://github.com/pytorch/torchchat/issues/1179
     config_args = ModelArgs.from_name(model_name).transformer_args['text']
     config = TransformerArgs.from_params(config_args)
     print(f"Model config {config.__dict__}")
@@ -132,6 +134,26 @@ def permute(w, n_heads):
             os.remove(file)
 
 
+@torch.inference_mode()
+def convert_hf_checkpoint_to_tune(
+    *,
+    model_dir: Optional[Path] = None,
+    model_name: str,
+) -> None:
+    assert model_dir is not None
+
+    consolidated_pth = model_dir / "original" / "consolidated.pth"
+    tokenizer_pth = model_dir / "original" / "tokenizer.model"
+    if consolidated_pth.is_file() and tokenizer_pth.is_file():
+        print(f"Moving checkpoint to {model_dir / 'model.pth'}.")
+        os.rename(consolidated_pth, model_dir / "model.pth")
+        print(f"Moving tokenizer to {model_dir / 'tokenizer.model'}.")
+        os.rename(tokenizer_pth, model_dir / "tokenizer.model")
+        print("Done.")
+    else:
+        raise RuntimeError(f"Could not find {consolidated_pth}")
+
+
 if __name__ == "__main__":
     import argparse
 
diff --git a/torchchat/cli/download.py b/torchchat/cli/download.py
@@ -10,7 +10,7 @@
 from pathlib import Path
 from typing import Optional
 
-from torchchat.cli.convert_hf_checkpoint import convert_hf_checkpoint
+from torchchat.cli.convert_hf_checkpoint import convert_hf_checkpoint, convert_hf_checkpoint_to_tune
 from torchchat.model_config.model_config import (
     load_model_configs,
     ModelConfig,
@@ -50,11 +50,17 @@ def _download_hf_snapshot(
         else:
             raise e
 
-    # Convert the model to the torchchat format.
-    print(f"Converting {model_config.name} to torchchat format...", file=sys.stderr)
-    convert_hf_checkpoint(
-        model_dir=artifact_dir, model_name=model_config.name, remove_bin_files=True
-    )
+    # Convert the Multimodal Llama model to the torchtune format.
+    if model_config.name in {"meta-llama/Llama-3.2-11B-Vision-Instruct", "meta-llama/Llama-3.2-11B-Vision"}:
+        print(f"Converting {model_config.name} to torchtune format...", file=sys.stderr)
+        convert_hf_checkpoint_to_tune( model_dir=artifact_dir, model_name=model_config.name)
+
+    else:
+        # Convert the model to the torchchat format.
+        print(f"Converting {model_config.name} to torchchat format...", file=sys.stderr)
+        convert_hf_checkpoint(
+            model_dir=artifact_dir, model_name=model_config.name, remove_bin_files=True
+        )
 
 
 def _download_direct(
diff --git a/torchchat/generate.py b/torchchat/generate.py
@@ -20,10 +20,7 @@
 import torch._dynamo.config
 import torch._inductor.config
 
-try:
-    from _torchchat_test_script import flamingo_transform
-except ImportError:
-    pass
+from torchtune.models.llama3_2_vision._model_builders import llama3_2_vision_transform
 
 from PIL import Image
 
@@ -753,7 +750,7 @@ def chat(
                 Message(role="assistant", content=""),
             ]
 
-            transform = flamingo_transform(str(self.tokenizer_args.tokenizer_path))
+            transform = llama3_2_vision_transform(str(self.tokenizer_args.tokenizer_path))
 
             with torch.device(device=self.builder_args.device), set_default_dtype(self.dtype):
                 data = transform({"messages": messages}, inference=True)
diff --git a/torchchat/model_config/models.json b/torchchat/model_config/models.json
@@ -69,6 +69,44 @@
         "distribution_path": "meta-llama/Meta-Llama-3.1-70B-Instruct",
         "transformer_params_key": "Meta-Llama-3.1-70B-Tune"
     },
+    "meta-llama/Meta-Llama-3.2-1B": {
+        "aliases": ["llama3.2-1b-base"],
+        "distribution_channel": "HuggingFaceSnapshot",
+        "distribution_path": "meta-llama/Llama-3.2-1B"
+    },
+    "meta-llama/Meta-Llama-3.2-1B-Instruct": {
+        "aliases": ["llama3.2-1b", "llama3.2-1b-chat", "llama3.2-1b-instruct"],
+        "distribution_channel": "HuggingFaceSnapshot",
+        "distribution_path": "meta-llama/Llama-3.2-1B-Instruct",
+        "transformer_params_key": "Meta-Llama-3.2-1B"
+    },
+    "meta-llama/Llama-Guard-3-1B": {
+        "aliases": ["llama3-1b-guard", "llama3.2-1b-guard"],
+        "distribution_channel": "HuggingFaceSnapshot",
+        "distribution_path": "meta-llama/Llama-Guard-3-1B"
+    },
+    "meta-llama/Meta-Llama-3.2-3B": {
+        "aliases": ["llama3.2-3b-base"],
+        "distribution_channel": "HuggingFaceSnapshot",
+        "distribution_path": "meta-llama/Llama-3.2-3B"
+    },
+    "meta-llama/Meta-Llama-3.2-3B-Instruct": {
+        "aliases": ["llama3.2-3b", "llama3.2-3b-chat", "llama3.2-3b-instruct"],
+        "distribution_channel": "HuggingFaceSnapshot",
+        "distribution_path": "meta-llama/Llama-3.2-3B-Instruct",
+        "transformer_params_key": "Meta-Llama-3.2-3B"
+    },
+    "meta-llama/Llama-3.2-11B-Vision": {
+        "aliases": ["llama3.2-11B-base", "Llama-3.2-11B-Vision-base"],
+        "distribution_channel": "HuggingFaceSnapshot",
+        "distribution_path": "meta-llama/Llama-3.2-11B-Vision"
+    },
+    "meta-llama/Llama-3.2-11B-Vision-Instruct": {
+        "aliases": ["llama3.2-11B", "Llama-3.2-11B-Vision", "Llama-3.2-mm"],
+        "distribution_channel": "HuggingFaceSnapshot",
+        "distribution_path": "meta-llama/Llama-3.2-11B-Vision-Instruct",
+        "transformer_params_key": "Llama-3.2-11B-Vision"
+    },
     "meta-llama/CodeLlama-7b-Python-hf": {
         "aliases": ["codellama", "codellama-7b"],
         "distribution_channel": "HuggingFaceSnapshot",
diff --git a/torchchat/model_params/Llama-3.2-11B-Vision.json b/torchchat/model_params/Llama-3.2-11B-Vision.json
@@ -0,0 +1,29 @@
+{
+    "model_type": "flamingo",
+    "use_tiktoken": true,
+    "encoder": {
+        "patch_size": 14,
+        "num_heads": 16,
+        "clip_embed_dim": 1280,
+        "clip_num_layers": 32,
+        "clip_hidden_states": [3, 7, 15, 23, 30],
+        "decoder_embed_dim": 4096,
+        "num_layers_projection": 8,
+        "tile_size": 560,
+        "max_num_tiles": 4,
+        "in_channels": 3
+    },
+    "decoder": {
+        "vocab_size": 128256,
+        "num_layers": 32,
+        "fusion_interval": 4,
+        "num_special_tokens": 8,
+        "num_heads": 32,
+        "num_kv_heads": 8,
+        "embed_dim": 4096,
+        "max_seq_len": 131072,
+        "encoder_max_seq_len": 128080,
+        "rope_base": 500000.0,
+        "intermediate_dim": 14336
+    }
+}
diff --git a/torchchat/model_params/Llama-Guard-3-1B-INT4.json b/torchchat/model_params/Llama-Guard-3-1B-INT4.json
@@ -0,0 +1,20 @@
+{
+    "block_size": 131072,
+    "dim": 2048,
+    "hidden_dim": 6400,
+    "n_layers": 12,
+    "n_heads": 32,
+    "n_kv_heads": 8,
+    "vocab_size": 128256,
+    "ffn_dim_multiplier": 1.5,
+    "multiple_of": 256,
+    "norm_eps": 1e-05,
+    "rope_theta": 500000.0,
+    "rope_scaling": {
+        "factor": 32.0,
+        "low_freq_factor": 1.0,
+        "high_freq_factor": 4.0,
+        "original_max_position_embeddings": 8192
+    },
+    "use_tiktoken": true
+}
diff --git a/torchchat/model_params/Llama-Guard-3-1B.json b/torchchat/model_params/Llama-Guard-3-1B.json
@@ -0,0 +1,19 @@
+{
+    "block_size": 131072,
+    "dim": 2048,
+    "n_layers": 16,
+    "n_heads": 32,
+    "n_kv_heads": 8,
+    "vocab_size": 128256,
+    "ffn_dim_multiplier": 1.5,
+    "multiple_of": 256,
+    "norm_eps": 1e-05,
+    "rope_theta": 500000.0,
+    "rope_scaling": {
+        "factor": 32.0,
+        "low_freq_factor": 1.0,
+        "high_freq_factor": 4.0,
+        "original_max_position_embeddings": 8192
+    },
+    "use_tiktoken": true
+}
diff --git a/torchchat/model_params/Meta-Llama-3.2-1B.json b/torchchat/model_params/Meta-Llama-3.2-1B.json
@@ -0,0 +1,19 @@
+{
+    "block_size": 131072,
+    "dim": 2048,
+    "n_layers": 16,
+    "n_heads": 32,
+    "n_kv_heads": 8,
+    "vocab_size": 128256,
+    "ffn_dim_multiplier": 1.5,
+    "multiple_of": 256,
+    "norm_eps": 1e-05,
+    "rope_theta": 500000.0,
+    "rope_scaling": {
+        "factor": 32.0,
+        "low_freq_factor": 1.0,
+        "high_freq_factor": 4.0,
+        "original_max_position_embeddings": 8192
+    },
+    "use_tiktoken": true
+}
diff --git a/torchchat/model_params/Meta-Llama-3.2-3B.json b/torchchat/model_params/Meta-Llama-3.2-3B.json
@@ -0,0 +1,19 @@
+{
+    "block_size": 131072,
+    "dim": 3072,
+    "n_layers": 28,
+    "n_heads": 24,
+    "n_kv_heads": 8,
+    "vocab_size": 128256,
+    "ffn_dim_multiplier": 1.0,
+    "multiple_of": 256,
+    "norm_eps": 1e-05,
+    "rope_theta": 500000.0,
+    "rope_scaling": {
+        "factor": 32.0,
+        "low_freq_factor": 1.0,
+        "high_freq_factor": 4.0,
+        "original_max_position_embeddings": 8192
+    },
+    "use_tiktoken": true
+}
diff --git a/torchchat/usages/openai_api.py b/torchchat/usages/openai_api.py
@@ -17,10 +17,8 @@
 
 import torch
 
-try:
-    from _torchchat_test_script import flamingo_transform, padded_collate
-except ImportError:
-    pass
+from torchtune.models.llama3_2_vision._convert_weights import padded_collate
+from torchtune.models.llama3_2_vision._model_builders import llama3_2_vision_transform 
 
 from PIL import Image
 from torchtune.data import Message
@@ -376,7 +374,7 @@ def chunked_completion(self, completion_request: CompletionRequest):
                         images.append(Image.open(BytesIO(base64_decoded)))
         print("images:", len(images), flush=True)
         if len(images) > 0:
-            transform = flamingo_transform(str(self.tokenizer_args.tokenizer_path))
+            transform = llama3_2_vision_transform(str(self.tokenizer_args.tokenizer_path))
             torchtune_messages = self._openai_messages_to_torchtune(
                 completion_request.messages
             )