Merge branch 'main' into lessw2020/prefill

lessw2020 · web-flow · commit 0f28976d9c22 · 2024-09-11T18:26:12.000-07:00
diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@ torchchat is a small codebase showcasing the ability to run large language model
   - [Run chat in the Browser](#browser)
 - [Run models on desktop/server without python](#desktopserver-execution)
   - [Use AOT Inductor for faster execution](#aoti-aot-inductor)
-  - [Running in c++ using the runner](#running-native-using-our-c-runner)
+  - [Running in c++ using the runner](#run-using-our-c-runner)
 - [Run models on mobile](#mobile-execution)
   - [Deploy and run on iOS](#deploy-and-run-on-ios)
   - [Deploy and run on Android](#deploy-and-run-on-android)
@@ -33,7 +33,8 @@ torchchat is a small codebase showcasing the ability to run large language model
 ## Installation
 The following steps require that you have [Python 3.10](https://www.python.org/downloads/release/python-3100/) installed.
 
-*torchchat uses the latest changes from various PyTorch projects so it's highly recommended that you use a venv (by using the commands below) or CONDA.*
+> [!TIP]
+> torchchat uses the latest changes from various PyTorch projects so it's highly recommended that you use a venv (by using the commands below) or CONDA.
 
 [skip default]: begin
 ```bash
@@ -127,21 +128,21 @@ python3 torchchat.py download llama3.1
 <summary>Additional Model Inventory Management Commands</summary>
 
 ### List
-This subcommands shows the available models
+This subcommand shows the available models
 ```bash
 python3 torchchat.py list
 ```
 
 ### Where
-This subcommands shows location of a particular model.
+This subcommand shows location of a particular model.
 ```bash
 python3 torchchat.py where llama3.1
 ```
 This is useful in scripts when you do not want to hard-code paths
 
 
 ### Remove
-This subcommands removes the specified model
+This subcommand removes the specified model
 ```bash
 python3 torchchat.py remove llama3.1
 ```
@@ -181,18 +182,10 @@ python3 torchchat.py generate llama3.1 --prompt "write me a story about a boy an
 [skip default]: end
 
 ### Server
-**Note: This feature is still a work in progress and not all endpoints are working**
-
-
-<details>
-<summary>This mode gives a REST API that matches the OpenAI API spec for interacting with a model</summary>
-
+This mode exposes a REST API for interacting with a model. 
 The server follows the [OpenAI API specification](https://platform.openai.com/docs/api-reference/chat) for chat completions.
-Since this feature is under active development, not every parameter is consumed. See api/api.py for details on
-which request parameters are implemented. If you encounter any issues, please comment on the [tracking Github issue](https://github.com/pytorch/torchchat/issues/973).
 
 To test out the REST API, **you'll need 2 terminals**: one to host the server, and one to send the request.
-
 In one terminal, start the server
 
 [skip default]: begin
@@ -204,8 +197,14 @@ python3 torchchat.py server llama3.1
 
 In another terminal, query the server using `curl`. Depending on the model configuration, this query might take a few minutes to respond.
 
-Setting `stream` to "true" in the request emits a response in chunks. If `stream` is unset or not "true", then the client will await the full response from the server.
+> [!NOTE]
+> Since this feature is under active development, not every parameter is consumed. See api/api.py for details on
+> which request parameters are implemented. If you encounter any issues, please comment on the [tracking Github issue](https://github.com/pytorch/torchchat/issues/973).
 
+<details>
+<summary>Example Query</summary>
+
+Setting `stream` to "true" in the request emits a response in chunks. If `stream` is unset or not "true", then the client will await the full response from the server.
 
 **Example Input + Output**
 
@@ -348,7 +347,7 @@ Specifically there are 2 ways of doing so: Pure Python and via a Runner
 
 ```
 # Execute
-python3 torchchat.py generate llama3.1 --device cpu --pte-path llama3.1.pte --prompt "Hello my name is"
+python3 torchchat.py generate llama3.1 --pte-path llama3.1.pte --prompt "Hello my name is"
 ```
 
 </details>
diff --git a/install/install_requirements.sh b/install/install_requirements.sh
@@ -41,13 +41,23 @@ fi
 )
 
 # Since torchchat often uses main-branch features of pytorch, only the nightly
-# pip versions will have the required features. The NIGHTLY_VERSION value should
+# pip versions will have the required features. The PYTORCH_NIGHTLY_VERSION value should
 # agree with the third-party/pytorch pinned submodule commit.
 #
 # NOTE: If a newly-fetched version of the executorch repo changes the value of
-# NIGHTLY_VERSION, you should re-run this script to install the necessary
+# PYTORCH_NIGHTLY_VERSION, you should re-run this script to install the necessary
 # package versions.
-NIGHTLY_VERSION=dev20240814
+PYTORCH_NIGHTLY_VERSION=dev20240814
+
+# Nightly version for torchvision
+VISION_NIGHTLY_VERSION=dev20240814
+
+# Nightly version for torchao
+AO_NIGHTLY_VERSION=dev20240905
+
+# Nightly version for torchtune
+TUNE_NIGHTLY_VERSION=dev20240910
+
 
 # Uninstall triton, as nightly will depend on pytorch-triton, which is one and the same
 (
@@ -67,23 +77,45 @@ fi
 
 # pip packages needed by exir.
 REQUIREMENTS_TO_INSTALL=(
-  torch=="2.5.0.${NIGHTLY_VERSION}"
+  torch=="2.5.0.${PYTORCH_NIGHTLY_VERSION}"
+  torchvision=="0.20.0.${VISION_NIGHTLY_VERSION}"
+)
+
+LINUX_REQUIREMENTS_TO_INSTALL=(
+  torchao=="0.5.0.${AO_NIGHTLY_VERSION}"
+  torchtune=="0.3.0.${TUNE_NIGHTLY_VERSION}"
 )
 
-# Install the requirements. `--extra-index-url` tells pip to look for package
+# Install the requirements. --extra-index-url tells pip to look for package
 # versions on the provided URL if they aren't available on the default URL.
 (
   set -x
   $PIP_EXECUTABLE install --extra-index-url "${TORCH_NIGHTLY_URL}" \
     "${REQUIREMENTS_TO_INSTALL[@]}"
 )
 
-# For torchao need to install from github since nightly build doesn't have macos build.
-# TODO: Remove this and install nightly build, once it supports macos
-(
-  set -x
-  $PIP_EXECUTABLE install git+https://github.com/pytorch/ao.git@e11201a62669f582d81cdb33e031a07fb8dfc4f3
-)
+PLATFORM=$(uname -s)
+
+# Install torchtune and torchao requirements for Linux systems using nightly.
+# For non-Linux systems (e.g., macOS), install torchao from GitHub since nightly
+# build doesn't have macOS build.
+# TODO: Remove this and install nightly build, once it supports macOS
+if [ "$PLATFORM" == "Linux" ];
+then
+  (
+    set -x
+    $PIP_EXECUTABLE install --pre --extra-index-url "${TORCH_NIGHTLY_URL}" --no-cache-dir \
+      "${LINUX_REQUIREMENTS_TO_INSTALL[@]}"
+  )
+else
+  # For torchao need to install from github since nightly build doesn't have macos build.
+  # TODO: Remove this and install nightly build, once it supports macos
+  (
+    set -x
+    $PIP_EXECUTABLE install git+https://github.com/pytorch/ao.git@e11201a62669f582d81cdb33e031a07fb8dfc4f3
+  )
+fi
+
 if [[ -x "$(command -v nvidia-smi)" ]]; then
   (
     set -x
diff --git a/torchchat/model.py b/torchchat/model.py
@@ -10,7 +10,8 @@
 from dataclasses import dataclass
 from enum import Enum
 from pathlib import Path
-from typing import Callable, Dict, Optional, Union
+
+from typing import Any, Callable, Dict, Optional, Union
 from abc import ABC, abstractmethod
 
 import torch
@@ -132,7 +133,7 @@ class TransformerArgs:
     ffn_dim_multiplier: Optional[int] = None
     use_tiktoken: bool = False
     max_seq_length: int = 8192
-    use_scaled_rope: bool = False
+    rope_scaling: Optional[Dict[str, Any]] = None
     # For pipeline parallel
     n_stages: int = 1
     stage_idx: int = 0
@@ -418,8 +419,6 @@ def __init__(self, config: TransformerArgs) -> None:
             self.norm = None
             self.output = None
 
-        # self.freqs_cis: Optional[Tensor] = None
-        # self.mask_cache: Optional[Tensor] = None
         self.max_batch_size = -1
         self.max_seq_length = -1
         # For supporting sequence parallel (default is off, thus value of 1)
@@ -444,7 +443,7 @@ def setup_caches(self, max_batch_size, max_seq_length):
             self.config.dim // self.config.n_heads,
             self.config.block_size * 2,
             self.config.rope_base,
-            use_scaled=self.config.use_scaled_rope,
+            rope_scaling=self.config.rope_scaling,
         )
         self.register_buffer("freqs_cis", freqs_cis, persistent=True)
         causal_mask = torch.tril(
@@ -681,12 +680,16 @@ def forward(self, x: Tensor) -> Tensor:
         return output * self.weight
 
 
-def apply_scaling(freqs: torch.Tensor):
-    # Values obtained from grid search
-    scale_factor = 8
-    low_freq_factor = 1
-    high_freq_factor = 4
-    old_context_len = 8192  # original llama3 length
+def apply_scaling(freqs: torch.Tensor, rope_scaling: Dict[str, Any]):
+    # Check for the presence of the required keys
+    required_keys = {"factor", "low_freq_factor", "high_freq_factor", "original_max_position_embeddings"}
+    if not required_keys.issubset(rope_scaling.keys()):
+        raise ValueError(f"Missing required keys in apply_scaling. Expected: {required_keys}")
+
+    scale_factor = rope_scaling["factor"]
+    low_freq_factor = rope_scaling["low_freq_factor"]
+    high_freq_factor = rope_scaling["high_freq_factor"]
+    old_context_len = rope_scaling["original_max_position_embeddings"]
 
     low_freq_wavelen = old_context_len / low_freq_factor
     high_freq_wavelen = old_context_len / high_freq_factor
@@ -707,16 +710,16 @@ def apply_scaling(freqs: torch.Tensor):
 
 
 def precompute_freqs_cis(
-    n_elem: int, seq_len: int, base: int = 10000, dtype=None, use_scaled: bool = False
+    n_elem: int, seq_len: int, base: int = 10000, dtype=None, rope_scaling: Optional[Dict[str, Any]] = None
 ) -> Tensor:
     if not dtype:
         dtype = get_precision()
     freqs = 1.0 / (
         base ** (torch.arange(0, n_elem, 2)[: (n_elem // 2)].float() / n_elem)
     )
     t = torch.arange(seq_len, device=freqs.device)
-    if use_scaled:
-        freqs = apply_scaling(freqs)
+    if rope_scaling is not None:
+        freqs = apply_scaling(freqs, rope_scaling)
     freqs = torch.outer(t, freqs)
     freqs_cis = torch.polar(torch.ones_like(freqs), freqs)
     cache = torch.stack([freqs_cis.real, freqs_cis.imag], dim=-1)
diff --git a/torchchat/model_params/Meta-Llama-3-70B.json b/torchchat/model_params/Meta-Llama-3-70B.json
@@ -1 +1 @@
-{"dim": 8192, "ffn_dim_multiplier": 1.3, "multiple_of": 4096, "n_heads": 64, "n_local_heads": 8, "n_layers": 80, "rope_base": 500000.0, "vocab_size": 128256, "use_tiktoken": true}
+{"block_size": 8192, "dim": 8192, "ffn_dim_multiplier": 1.3, "multiple_of": 4096, "n_heads": 64, "n_local_heads": 8, "n_layers": 80, "rope_base": 500000.0, "vocab_size": 128256, "use_tiktoken": true}
diff --git a/torchchat/model_params/Meta-Llama-3-8B.json b/torchchat/model_params/Meta-Llama-3-8B.json
@@ -1 +1 @@
-{"dim": 4096, "ffn_dim_multiplier": 1.3, "multiple_of": 1024, "n_heads": 32, "n_local_heads": 8, "n_layers": 32, "rope_base": 500000.0, "vocab_size": 128256, "use_tiktoken": true}
+{"block_size": 8192, "dim": 4096, "ffn_dim_multiplier": 1.3, "multiple_of": 1024, "n_heads": 32, "n_local_heads": 8, "n_layers": 32, "rope_base": 500000.0, "vocab_size": 128256, "use_tiktoken": true}
diff --git a/torchchat/model_params/Meta-Llama-3.1-70B.json b/torchchat/model_params/Meta-Llama-3.1-70B.json
@@ -1 +1 @@
-{"dim": 8192, "ffn_dim_multiplier": 1.3, "multiple_of": 4096, "n_heads": 64, "n_local_heads": 8, "n_layers": 80, "rope_base": 500000.0, "vocab_size": 128256, "use_tiktoken": true, "norm_eps": 1e-05, "use_scaled_rope": true}
+{"block_size": 131072, "dim": 8192, "ffn_dim_multiplier": 1.3, "multiple_of": 4096, "n_heads": 64, "n_local_heads": 8, "n_layers": 80, "rope_base": 500000.0, "vocab_size": 128256, "use_tiktoken": true, "norm_eps": 1e-05, "rope_scaling": {"factor": 8.0, "low_freq_factor": 1.0, "high_freq_factor": 4.0, "original_max_position_embeddings": 8192}}
diff --git a/torchchat/model_params/Meta-Llama-3.1-8B.json b/torchchat/model_params/Meta-Llama-3.1-8B.json
@@ -1 +1 @@
-{"dim": 4096, "ffn_dim_multiplier": 1.3, "multiple_of": 1024, "n_heads": 32, "n_local_heads": 8, "n_layers": 32, "rope_base": 500000.0, "vocab_size": 128256, "use_tiktoken": true, "norm_eps": 1e-05, "use_scaled_rope": true}
+{"block_size": 131072, "dim": 4096, "ffn_dim_multiplier": 1.3, "multiple_of": 1024, "n_heads": 32, "n_local_heads": 8, "n_layers": 32, "rope_base": 500000.0, "vocab_size": 128256, "use_tiktoken": true, "norm_eps": 1e-05, "rope_scaling": {"factor": 8.0, "low_freq_factor": 1.0, "high_freq_factor": 4.0, "original_max_position_embeddings": 8192}}

Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-{"dim": 8192, "ffn_dim_multiplier": 1.3, "multiple_of": 4096, "n_heads": 64, "n_local_heads": 8, "n_layers": 80, "rope_base": 500000.0, "vocab_size": 128256, "use_tiktoken": true}`
	`1`	`+{"block_size": 8192, "dim": 8192, "ffn_dim_multiplier": 1.3, "multiple_of": 4096, "n_heads": 64, "n_local_heads": 8, "n_layers": 80, "rope_base": 500000.0, "vocab_size": 128256, "use_tiktoken": true}`
Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-{"dim": 4096, "ffn_dim_multiplier": 1.3, "multiple_of": 1024, "n_heads": 32, "n_local_heads": 8, "n_layers": 32, "rope_base": 500000.0, "vocab_size": 128256, "use_tiktoken": true}`
	`1`	`+{"block_size": 8192, "dim": 4096, "ffn_dim_multiplier": 1.3, "multiple_of": 1024, "n_heads": 32, "n_local_heads": 8, "n_layers": 32, "rope_base": 500000.0, "vocab_size": 128256, "use_tiktoken": true}`
Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-{"dim": 8192, "ffn_dim_multiplier": 1.3, "multiple_of": 4096, "n_heads": 64, "n_local_heads": 8, "n_layers": 80, "rope_base": 500000.0, "vocab_size": 128256, "use_tiktoken": true, "norm_eps": 1e-05, "use_scaled_rope": true}`
	`1`	`+{"block_size": 131072, "dim": 8192, "ffn_dim_multiplier": 1.3, "multiple_of": 4096, "n_heads": 64, "n_local_heads": 8, "n_layers": 80, "rope_base": 500000.0, "vocab_size": 128256, "use_tiktoken": true, "norm_eps": 1e-05, "rope_scaling": {"factor": 8.0, "low_freq_factor": 1.0, "high_freq_factor": 4.0, "original_max_position_embeddings": 8192}}`
Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-{"dim": 4096, "ffn_dim_multiplier": 1.3, "multiple_of": 1024, "n_heads": 32, "n_local_heads": 8, "n_layers": 32, "rope_base": 500000.0, "vocab_size": 128256, "use_tiktoken": true, "norm_eps": 1e-05, "use_scaled_rope": true}`
	`1`	`+{"block_size": 131072, "dim": 4096, "ffn_dim_multiplier": 1.3, "multiple_of": 1024, "n_heads": 32, "n_local_heads": 8, "n_layers": 32, "rope_base": 500000.0, "vocab_size": 128256, "use_tiktoken": true, "norm_eps": 1e-05, "rope_scaling": {"factor": 8.0, "low_freq_factor": 1.0, "high_freq_factor": 4.0, "original_max_position_embeddings": 8192}}`