Skip to content

Commit eca60e7

Browse files
committed
Your commit message here
0 parents  commit eca60e7

File tree

763 files changed

+14098
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

763 files changed

+14098
-0
lines changed

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2024 FasterDecoding
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# Training-Free Acivation Sparsity in Large Language Models
2+
3+
[[Paper](https://www.arxiv.org/abs/2408.14690)][[Blog](XXX)]
4+
5+
6+
TEAL induces up to 40-50% model-wide activation sparsity in modern LLMs with minimal degradation, resulting in an up to 1.53-1.8x speedup in single-batch decoding.
7+
8+
<div align="center">
9+
<img src="figures/clickbait.png" width="500" height="auto"/>
10+
</a>
11+
</div>
12+
13+
The current release supports:
14+
- FP16 inference for Llama-2/3 models using uniform sparsities
15+
- Accuracy evaluation for Llama-2/3 and Mistral models using uniform sparsities
16+
17+
Stay tuned for block-wise greedy sparsities!
18+
19+
20+
## News
21+
22+
- [08/2024] 🔥 Arxiv release!
23+
24+
## Abstract
25+
26+
Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix
27+
multiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards
28+
older models with ReLU-based sparsity, while others require extensive continued
29+
pre-training on up to hundreds of billions of tokens. This paper describes TEAL
30+
(**T**raining-Fre**e** **A**ctivation Sparsity in **L**LMs), a simple training-free method that
31+
applies magnitude-based activation sparsity to hidden states throughout the entire
32+
model. TEAL achieves 40-50% model-wide sparsity with minimal performance
33+
degradation across Llama-2, Llama-3, and Mistral families, with sizes varying
34+
from 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock
35+
decoding speed-ups of up to 1.53× and 1.8× at 40% and 50% model-wide sparsity.
36+
TEAL is compatible with weight quantization, enabling further efficiency gains.
37+
38+
39+
40+
## Contents
41+
42+
- [Install](#Install)
43+
- [Demo](#Demo)
44+
- [Inference Usage](#Inference-Usage)
45+
- [Accuracy Usage](#Accuracy-Usage)
46+
- [Citation](#citation)
47+
48+
## Install
49+
50+
1. Clone the repo and navigate to TEAL:
51+
52+
```
53+
git clone https://github.com/FasterDecoding/TEAL
54+
cd TEAL
55+
```
56+
57+
2. Set up environment:
58+
59+
60+
```bash
61+
conda create -yn teal python=3.11
62+
conda activate teal
63+
64+
pip install -e .
65+
```
66+
67+
3. (Optional) If you want to calibrate thresholds for your own models, or run accuracy evals for models, install the following dependency:
68+
69+
```bash
70+
pip install -e ".[eval]"
71+
```
72+
73+
## Inference Usage
74+
75+
For easy usage, we provide calibrated thresholds for Llama-2/3 and Mistral models in `models/` folder.
76+
77+
1. Navigate to gpt-fast:
78+
79+
```bash
80+
cd gpt-fast
81+
```
82+
83+
2. Download model weights and convert to gpt-fast format (`scripts/prepare.sh`):
84+
```bash
85+
python scripts/download.py --repo_id meta-llama/Llama-2-7b-hf --path $SAVE_PATH && python scripts/convert_hf_checkpoint.py --checkpoint_dir $SAVE_PATH/meta-llama/Llama-2-7b-hf
86+
```
87+
88+
3. Run dense inference (`scripts/base_run.sh`):
89+
90+
```bash
91+
CUDA_VISIBLE_DEVICES=0 python generate.py \
92+
--compile \
93+
--checkpoint_path $SAVE_PATH/meta-llama/Llama-2-7b-hf/model.pth \
94+
--interactive
95+
```
96+
97+
4. Run sparse inference (`scripts/run.sh`):
98+
```bash
99+
CUDA_VISIBLE_DEVICES=0 python generate.py \
100+
--compile \
101+
--checkpoint_path $SAVE_PATH/meta-llama/Llama-2-7b-hf/model.pth \
102+
--hist_path ../models/Llama-2-7B/histograms \
103+
--sparsity 0.5 \
104+
--interactive
105+
```
106+
107+
Please treat the current inference implementation as just a proof of concept! There are a few limitations:
108+
- Only FP16 is supported, as Triton does not currently support BF16 `atomic_add`.
109+
- Block-wise greedy sparsities are not currently supported.
110+
- Quantized sparse kernels are not currently supported (though, would love a PR!).
111+
- Speculative decoding is untested
112+
113+
### Accuracy Usage
114+
115+
1. Navigate to TEAL:
116+
```bash
117+
cd TEAL
118+
```
119+
120+
1. Construct histograms for threshold calibration (`scripts/grab_acts.bash`):
121+
122+
```bash
123+
CUDA_VISIBLE_DEVICES=0 python teal/grab_acts.py \
124+
--model_name meta-llama/Llama-2-7b-hf \
125+
--output_path $OUTPUT_PATH$
126+
```
127+
128+
2. Run perplexity test (`scripts/ppl_test.bash):
129+
130+
```bash
131+
CUDA_VISIBLE_DEVICES=0 python teal/ppl_test.py \
132+
--model_name meta-llama/Llama-2-7b-hf \
133+
--teal_path $OUTPUT_PATH$ \
134+
--sparsity 0.5
135+
```

__init__.py

Whitespace-only changes.

figures/clickbait.png

80.1 KB
Loading

gpt-fast/CODE_OF_CONDUCT.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Code of Conduct
2+
3+
## Our Pledge
4+
5+
In the interest of fostering an open and welcoming environment, we as
6+
contributors and maintainers pledge to make participation in our project and
7+
our community a harassment-free experience for everyone, regardless of age, body
8+
size, disability, ethnicity, sex characteristics, gender identity and expression,
9+
level of experience, education, socio-economic status, nationality, personal
10+
appearance, race, religion, or sexual identity and orientation.
11+
12+
## Our Standards
13+
14+
Examples of behavior that contributes to creating a positive environment
15+
include:
16+
17+
* Using welcoming and inclusive language
18+
* Being respectful of differing viewpoints and experiences
19+
* Gracefully accepting constructive criticism
20+
* Focusing on what is best for the community
21+
* Showing empathy towards other community members
22+
23+
Examples of unacceptable behavior by participants include:
24+
25+
* The use of sexualized language or imagery and unwelcome sexual attention or
26+
advances
27+
* Trolling, insulting/derogatory comments, and personal or political attacks
28+
* Public or private harassment
29+
* Publishing others' private information, such as a physical or electronic
30+
address, without explicit permission
31+
* Other conduct which could reasonably be considered inappropriate in a
32+
professional setting
33+
34+
## Our Responsibilities
35+
36+
Project maintainers are responsible for clarifying the standards of acceptable
37+
behavior and are expected to take appropriate and fair corrective action in
38+
response to any instances of unacceptable behavior.
39+
40+
Project maintainers have the right and responsibility to remove, edit, or
41+
reject comments, commits, code, wiki edits, issues, and other contributions
42+
that are not aligned to this Code of Conduct, or to ban temporarily or
43+
permanently any contributor for other behaviors that they deem inappropriate,
44+
threatening, offensive, or harmful.
45+
46+
## Scope
47+
48+
This Code of Conduct applies within all project spaces, and it also applies when
49+
an individual is representing the project or its community in public spaces.
50+
Examples of representing a project or community include using an official
51+
project e-mail address, posting via an official social media account, or acting
52+
as an appointed representative at an online or offline event. Representation of
53+
a project may be further defined and clarified by project maintainers.
54+
55+
## Enforcement
56+
57+
Instances of abusive, harassing, or otherwise unacceptable behavior may be
58+
reported by contacting the project team at <[email protected]>. All
59+
complaints will be reviewed and investigated and will result in a response that
60+
is deemed necessary and appropriate to the circumstances. The project team is
61+
obligated to maintain confidentiality with regard to the reporter of an incident.
62+
Further details of specific enforcement policies may be posted separately.
63+
64+
Project maintainers who do not follow or enforce the Code of Conduct in good
65+
faith may face temporary or permanent repercussions as determined by other
66+
members of the project's leadership.
67+
68+
## Attribution
69+
70+
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71+
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
72+
73+
[homepage]: https://www.contributor-covenant.org
74+
75+
For answers to common questions about this code of conduct, see
76+
https://www.contributor-covenant.org/faq

gpt-fast/CONTRIBUTING.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Contributing to gpt-fast
2+
We want to make contributing to this project as easy and transparent as
3+
possible.
4+
5+
6+
## Pull Requests
7+
We actively welcome your pull requests.
8+
9+
1. Fork the repo and create your branch from `main`.
10+
2. If you've added code that should be tested, add tests.
11+
3. If you've changed APIs, update the documentation.
12+
4. Ensure the test suite passes.
13+
5. Make sure your code lints.
14+
6. If you haven't already, complete the Contributor License Agreement ("CLA").
15+
16+
## Contributor License Agreement ("CLA")
17+
In order to accept your pull request, we need you to submit a CLA. You only need
18+
to do this once to work on any of Meta's open source projects.
19+
20+
Complete your CLA here: <https://code.facebook.com/cla>
21+
22+
## Issues
23+
We use GitHub issues to track public bugs. Please ensure your description is
24+
clear and has sufficient instructions to be able to reproduce the issue.
25+
26+
Meta has a [bounty program](https://www.facebook.com/whitehat/) for the safe
27+
disclosure of security bugs. In those cases, please go through the process
28+
outlined on that page and do not file a public issue.
29+
30+
## License
31+
By contributing to `gpt-fast`, you agree that your contributions will be licensed
32+
under the LICENSE file in the root directory of this source tree.

0 commit comments

Comments
 (0)