FasterDecoding
diff --git a/‎LICENSE‎
Lines changed: 21 additions & 0 deletions b/‎LICENSE‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 135 additions & 0 deletions b/‎README.md‎
Lines changed: 135 additions & 0 deletions
diff --git a/‎__init__.py‎ b/‎__init__.py‎
diff --git a/‎figures/clickbait.png‎
80.1 KB b/‎figures/clickbait.png‎
80.1 KB
diff --git a/‎gpt-fast/CODE_OF_CONDUCT.md‎
Lines changed: 76 additions & 0 deletions b/‎gpt-fast/CODE_OF_CONDUCT.md‎
Lines changed: 76 additions & 0 deletions
diff --git a/‎gpt-fast/CONTRIBUTING.md‎
Lines changed: 32 additions & 0 deletions b/‎gpt-fast/CONTRIBUTING.md‎
Lines changed: 32 additions & 0 deletions
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 FasterDecoding
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
@@ -0,0 +1,135 @@
+# Training-Free Acivation Sparsity in Large Language Models
+
+[[Paper](https://www.arxiv.org/abs/2408.14690)][[Blog](XXX)]
+
+
+TEAL induces up to 40-50% model-wide activation sparsity in modern LLMs with minimal degradation, resulting in an up to 1.53-1.8x speedup in single-batch decoding.
+
+<div align="center">
+    <img src="figures/clickbait.png" width="500" height="auto"/>
+  </a>
+</div>
+
+The current release supports:
+- FP16 inference for Llama-2/3 models using uniform sparsities
+- Accuracy evaluation for Llama-2/3 and Mistral models using uniform sparsities
+
+Stay tuned for block-wise greedy sparsities!
+
+
+## News
+
+- [08/2024] 🔥 Arxiv release!
+
+## Abstract
+
+Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix
+multiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards
+older models with ReLU-based sparsity, while others require extensive continued
+pre-training on up to hundreds of billions of tokens. This paper describes TEAL
+(**T**raining-Fre**e** **A**ctivation Sparsity in **L**LMs), a simple training-free method that
+applies magnitude-based activation sparsity to hidden states throughout the entire
+model. TEAL achieves 40-50% model-wide sparsity with minimal performance
+degradation across Llama-2, Llama-3, and Mistral families, with sizes varying
+from 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock
+decoding speed-ups of up to 1.53× and 1.8× at 40% and 50% model-wide sparsity.
+TEAL is compatible with weight quantization, enabling further efficiency gains.
+
+
+
+## Contents
+
+- [Install](#Install)
+- [Demo](#Demo)
+- [Inference Usage](#Inference-Usage)
+- [Accuracy Usage](#Accuracy-Usage)
+- [Citation](#citation)
+
+## Install
+
+1. Clone the repo and navigate to TEAL:
+
+```
+git clone https://github.com/FasterDecoding/TEAL
+cd TEAL
+```
+
+2. Set up environment:
+
+
+```bash
+conda create -yn teal python=3.11
+conda activate teal
+
+pip install -e .
+```
+
+3. (Optional) If you want to calibrate thresholds for your own models, or run accuracy evals for models, install the following dependency:
+
+  ```bash
+  pip install -e ".[eval]"
+  ```
+
+## Inference Usage
+
+For easy usage, we provide calibrated thresholds for Llama-2/3 and Mistral models in `models/` folder.
+
+1. Navigate to gpt-fast:
+
+```bash
+cd gpt-fast
+```
+
+2. Download model weights and convert to gpt-fast format (`scripts/prepare.sh`):
+```bash
+python scripts/download.py --repo_id meta-llama/Llama-2-7b-hf --path $SAVE_PATH && python scripts/convert_hf_checkpoint.py --checkpoint_dir $SAVE_PATH/meta-llama/Llama-2-7b-hf
+```
+
+3. Run dense inference (`scripts/base_run.sh`):
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python generate.py \
+    --compile \ 
+    --checkpoint_path $SAVE_PATH/meta-llama/Llama-2-7b-hf/model.pth \ 
+    --interactive
+```
+
+4. Run sparse inference (`scripts/run.sh`):
+```bash
+CUDA_VISIBLE_DEVICES=0 python generate.py \
+    --compile \ 
+    --checkpoint_path $SAVE_PATH/meta-llama/Llama-2-7b-hf/model.pth \ 
+    --hist_path ../models/Llama-2-7B/histograms \ 
+    --sparsity 0.5 \ 
+    --interactive
+```
+
+Please treat the current inference implementation as just a proof of concept! There are a few limitations:
+- Only FP16 is supported, as Triton does not currently support BF16 `atomic_add`.
+- Block-wise greedy sparsities are not currently supported.
+- Quantized sparse kernels are not currently supported (though, would love a PR!).
+- Speculative decoding is untested
+
+### Accuracy Usage
+
+1. Navigate to TEAL:
+```bash
+cd TEAL
+```
+
+1. Construct histograms for threshold calibration (`scripts/grab_acts.bash`):
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python teal/grab_acts.py \  
+--model_name meta-llama/Llama-2-7b-hf \ 
+--output_path $OUTPUT_PATH$
+```
+
+2. Run perplexity test (`scripts/ppl_test.bash):
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python teal/ppl_test.py \
+--model_name meta-llama/Llama-2-7b-hf \
+--teal_path $OUTPUT_PATH$ \
+--sparsity 0.5
+```
@@ -0,0 +1,76 @@
+# Code of Conduct
+
+## Our Pledge
+
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to make participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, sex characteristics, gender identity and expression,
+level of experience, education, socio-economic status, nationality, personal
+appearance, race, religion, or sexual identity and orientation.
+
+## Our Standards
+
+Examples of behavior that contributes to creating a positive environment
+include:
+
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+
+Examples of unacceptable behavior by participants include:
+
+* The use of sexualized language or imagery and unwelcome sexual attention or
+advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+professional setting
+
+## Our Responsibilities
+
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+
+## Scope
+
+This Code of Conduct applies within all project spaces, and it also applies when
+an individual is representing the project or its community in public spaces.
+Examples of representing a project or community include using an official
+project e-mail address, posting via an official social media account, or acting
+as an appointed representative at an online or offline event. Representation of
+a project may be further defined and clarified by project maintainers.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team at <[email protected]>. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+
+[homepage]: https://www.contributor-covenant.org
+
+For answers to common questions about this code of conduct, see
+https://www.contributor-covenant.org/faq
@@ -0,0 +1,32 @@
+# Contributing to gpt-fast
+We want to make contributing to this project as easy and transparent as
+possible.
+
+
+## Pull Requests
+We actively welcome your pull requests.
+
+1. Fork the repo and create your branch from `main`.
+2. If you've added code that should be tested, add tests.
+3. If you've changed APIs, update the documentation.
+4. Ensure the test suite passes.
+5. Make sure your code lints.
+6. If you haven't already, complete the Contributor License Agreement ("CLA").
+
+## Contributor License Agreement ("CLA")
+In order to accept your pull request, we need you to submit a CLA. You only need
+to do this once to work on any of Meta's open source projects.
+
+Complete your CLA here: <https://code.facebook.com/cla>
+
+## Issues
+We use GitHub issues to track public bugs. Please ensure your description is
+clear and has sufficient instructions to be able to reproduce the issue.
+
+Meta has a [bounty program](https://www.facebook.com/whitehat/) for the safe
+disclosure of security bugs. In those cases, please go through the process
+outlined on that page and do not file a public issue.
+
+## License
+By contributing to `gpt-fast`, you agree that your contributions will be licensed
+under the LICENSE file in the root directory of this source tree.