Skip to content
This repository was archived by the owner on Jun 3, 2025. It is now read-only.

Commit 7676b27

Browse files
natuanmarkurtzjeanniefinks
authored
Tutorial on sparsifying BERT on SQuAD (#286)
* HuggingFace Transformers integ: folder structure, recipes * Remove scripts already in transformers repo * Rename recipes * Tutorial sparsifying Bert on Squad * Remove tutorial from this diff * Tutorial on sparsifying Bert on Squad * Revised recipes, tutorial * Add W&B graph, fix table * Fix reference to images * Update integrations/huggingface-transformers/tutorials/sparsifying_bert_using_recipes.md Co-authored-by: Jeannie Finks <[email protected]> Co-authored-by: Mark Kurtz <[email protected]> Co-authored-by: Jeannie Finks <[email protected]>
1 parent 985304e commit 7676b27

10 files changed

+228
-72
lines changed

integrations/huggingface-transformers/recipes/bert-base-12layers_prune80.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -15,18 +15,18 @@ limitations under the License.
1515
-->
1616

1717
---
18-
# General variables
18+
# General Variables
1919
num_epochs: &num_epochs 30
2020

21-
# pruning hyperparameters
21+
# Pruning Hyperparameters
2222
init_sparsity: &init_sparsity 0.00
2323
final_sparsity: &final_sparsity 0.80
2424
pruning_start_epoch: &pruning_start_epoch 2
2525
pruning_end_epoch: &pruning_end_epoch 20
2626
update_frequency: &pruning_update_frequency 0.01
2727

2828

29-
# modifiers:
29+
# Modifiers
3030
training_modifiers:
3131
- !EpochRangeModifier
3232
end_epoch: 30
@@ -35,12 +35,12 @@ training_modifiers:
3535
pruning_modifiers:
3636
- !GMPruningModifier
3737
params:
38-
- re:bert.encoder.layer.([0,2,4,6,8]|11).attention.self.query.weight
39-
- re:bert.encoder.layer.([0,2,4,6,8]|11).attention.self.key.weight
40-
- re:bert.encoder.layer.([0,2,4,6,8]|11).attention.self.value.weight
41-
- re:bert.encoder.layer.([0,2,4,6,8]|11).attention.output.dense.weight
42-
- re:bert.encoder.layer.([0,2,4,6,8]|11).intermediate.dense.weight
43-
- re:bert.encoder.layer.([0,2,4,6,8]|11).output.dense.weight
38+
- re:bert.encoder.layer.*.attention.self.query.weight
39+
- re:bert.encoder.layer.*.attention.self.key.weight
40+
- re:bert.encoder.layer.*.attention.self.value.weight
41+
- re:bert.encoder.layer.*.attention.output.dense.weight
42+
- re:bert.encoder.layer.*.intermediate.dense.weight
43+
- re:bert.encoder.layer.*.output.dense.weight
4444
start_epoch: *pruning_start_epoch
4545
end_epoch: *pruning_end_epoch
4646
init_sparsity: *init_sparsity
@@ -52,21 +52,21 @@ pruning_modifiers:
5252
log_types: __ALL__
5353
---
5454

55-
# Bert model with pruned encoder layers
55+
# BERT Model with Pruned Encoder Layers
5656

57-
This recipe defines a pruning strategy to sparsify all encoder layers of a Bert model at 80% sparsity. It was used together with knowledge distillation to create sparse model that achives 100% recovery from its baseline accuracy on the Squad dataset.
58-
Training was done using 1 GPU at half precision using a training batch size of 16 with the
57+
This recipe defines a pruning strategy to sparsify all encoder layers of a BERT model at 80% sparsity. It was used together with knowledge distillation to create sparse model that completely recovers the F1 metric (88.596) of the baseline model by on the SQuAD dataset. (We use the checkpoint at the end of the first 2 epochs as the baseline model for comparison, right before the pruning takes effect.)
58+
Training was done using one V100 GPU at half precision using a training batch size of 16 with the
5959
[SparseML integration with huggingface/transformers](https://github.com/neuralmagic/sparseml/tree/main/integrations/huggingface-transformers).
6060

6161
## Weights and Biases
6262

63-
- [Sparse Bert on Squad](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/18qdx7b3?workspace=user-neuralmagic)
63+
- [Sparse BERT on SQuAD](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/18qdx7b3?workspace=user-neuralmagic)
6464

6565
## Training
6666

6767
To set up the training environment, follow the instructions on the [integration README](https://github.com/neuralmagic/sparseml/blob/main/integrations/huggingface-transformers/README.md).
6868
Using the `run_qa.py` script from the question-answering examples, the following command can be used to launch this recipe with distillation.
69-
Adjust the training command below with your setup for GPU device, checkpoint saving frequency and logging options.
69+
Adjust the training command below with your setup for GPU device, checkpoint saving frequency, and logging options.
7070

7171
*training command*
7272
```
@@ -91,7 +91,7 @@ python transformers/examples/pytorch/question-answering/run_qa.py \
9191
--distill_temperature 2.0 \
9292
--save_steps 1000 \
9393
--save_total_limit 2 \
94-
--recipe ../recipes/uni_80sparse_freq0.01_18prune10fine.md \
94+
--recipe ../recipes/bert-base-12layers_prune80.md \
9595
--onnx_export_path MODELS_DIR/sparse80/onnx \
9696
--report_to wandb
9797
```

integrations/huggingface-transformers/recipes/bert-base-12layers_prune90.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -15,17 +15,17 @@ limitations under the License.
1515
-->
1616

1717
---
18-
# General variables
18+
# General Variables
1919
num_epochs: &num_epochs 30
2020

21-
# pruning hyperparameters
21+
# Pruning Hyperparameters
2222
init_sparsity: &init_sparsity 0.00
2323
final_sparsity: &final_sparsity 0.90
2424
pruning_start_epoch: &pruning_start_epoch 2
2525
pruning_end_epoch: &pruning_end_epoch 20
2626
update_frequency: &pruning_update_frequency 0.01
2727

28-
# modifiers:
28+
# Modifiers
2929
training_modifiers:
3030
- !EpochRangeModifier
3131
end_epoch: 30
@@ -34,12 +34,12 @@ training_modifiers:
3434
pruning_modifiers:
3535
- !GMPruningModifier
3636
params:
37-
- re:bert.encoder.layer.([0,2,4,6,8]|11).attention.self.query.weight
38-
- re:bert.encoder.layer.([0,2,4,6,8]|11).attention.self.key.weight
39-
- re:bert.encoder.layer.([0,2,4,6,8]|11).attention.self.value.weight
40-
- re:bert.encoder.layer.([0,2,4,6,8]|11).attention.output.dense.weight
41-
- re:bert.encoder.layer.([0,2,4,6,8]|11).intermediate.dense.weight
42-
- re:bert.encoder.layer.([0,2,4,6,8]|11).output.dense.weight
37+
- re:bert.encoder.layer.*.attention.self.query.weight
38+
- re:bert.encoder.layer.*.attention.self.key.weight
39+
- re:bert.encoder.layer.*.attention.self.value.weight
40+
- re:bert.encoder.layer.*.attention.output.dense.weight
41+
- re:bert.encoder.layer.*.intermediate.dense.weight
42+
- re:bert.encoder.layer.*.output.dense.weight
4343
start_epoch: *pruning_start_epoch
4444
end_epoch: *pruning_end_epoch
4545
init_sparsity: *init_sparsity
@@ -50,21 +50,21 @@ pruning_modifiers:
5050
mask_type: unstructured
5151
log_types: __ALL__
5252
---
53-
# Bert model with pruned encoder layers
53+
# BERT Model with Pruned Encoder Layers
5454

55-
This recipe defines a pruning strategy to sparsify all encoder layers of a Bert model at 90% sparsity. It was used together with knowledge distillation to create sparse model that achives 98.4% recovery from its baseline accuracy on the Squad dataset.
56-
Training was done using 1 GPU at half precision using a training batch size of 16 with the
55+
This recipe defines a pruning strategy to sparsify all encoder layers of a BERT model at 90% sparsity. It was used together with knowledge distillation to create sparse model that achieves 98.4% recovery from the F1 metric (88.596) of the baseline model on the SQuAD dataset. (We use the checkpoint at the end of the first 2 epochs as the baseline model for comparison, right before the pruning takes effect.)
56+
Training was done using one V100 GPU at half precision using a training batch size of 16 with the
5757
[SparseML integration with huggingface/transformers](https://github.com/neuralmagic/sparseml/tree/main/integrations/huggingface-transformers).
5858

5959
## Weights and Biases
6060

61-
- [Sparse Bert on Squad](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/2ht2eqsn?workspace=user-neuralmagic)
61+
- [Sparse BERT on SQuAD](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/2ht2eqsn?workspace=user-neuralmagic)
6262

6363
## Training
6464

6565
To set up the training environment, follow the instructions on the [integration README](https://github.com/neuralmagic/sparseml/blob/main/integrations/huggingface-transformers/README.md).
6666
Using the `run_qa.py` script from the question-answering examples, the following command can be used to launch this recipe with distillation.
67-
Adjust the training command below with your setup for GPU device, checkpoint saving frequency and logging options.
67+
Adjust the training command below with your setup for GPU device, checkpoint saving frequency, and logging options.
6868

6969
*training command*
7070
```
@@ -89,7 +89,7 @@ python transformers/examples/pytorch/question-answering/run_qa.py \
8989
--distill_temperature 2.0 \
9090
--save_steps 1000 \
9191
--save_total_limit 2 \
92-
--recipe ../recipes/uni_90sparse_freq0.01_18prune10fine.md \
92+
--recipe ../recipes/bert-base-12layers_prune90.md \
9393
--onnx_export_path MODELS_DIR/sparse90/onnx \
9494
--report_to wandb
9595
```

integrations/huggingface-transformers/recipes/bert-base-12layers_prune95.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -15,17 +15,17 @@ limitations under the License.
1515
-->
1616

1717
---
18-
# General variables
18+
# General Variables
1919
num_epochs: &num_epochs 30
2020

21-
# pruning hyperparameters
21+
# Pruning Hyperparameters
2222
init_sparsity: &init_sparsity 0.00
2323
final_sparsity: &final_sparsity 0.95
2424
pruning_start_epoch: &pruning_start_epoch 2
2525
pruning_end_epoch: &pruning_end_epoch 20
2626
update_frequency: &pruning_update_frequency 0.01
2727

28-
# modifiers:
28+
# Modifiers
2929
training_modifiers:
3030
- !EpochRangeModifier
3131
end_epoch: 30
@@ -34,12 +34,12 @@ training_modifiers:
3434
pruning_modifiers:
3535
- !GMPruningModifier
3636
params:
37-
- re:bert.encoder.layer.([0,2,4,6,8]|11).attention.self.query.weight
38-
- re:bert.encoder.layer.([0,2,4,6,8]|11).attention.self.key.weight
39-
- re:bert.encoder.layer.([0,2,4,6,8]|11).attention.self.value.weight
40-
- re:bert.encoder.layer.([0,2,4,6,8]|11).attention.output.dense.weight
41-
- re:bert.encoder.layer.([0,2,4,6,8]|11).intermediate.dense.weight
42-
- re:bert.encoder.layer.([0,2,4,6,8]|11).output.dense.weight
37+
- re:bert.encoder.layer.*.attention.self.query.weight
38+
- re:bert.encoder.layer.*.attention.self.key.weight
39+
- re:bert.encoder.layer.*.attention.self.value.weight
40+
- re:bert.encoder.layer.*.attention.output.dense.weight
41+
- re:bert.encoder.layer.*.intermediate.dense.weight
42+
- re:bert.encoder.layer.*.output.dense.weight
4343
start_epoch: *pruning_start_epoch
4444
end_epoch: *pruning_end_epoch
4545
init_sparsity: *init_sparsity
@@ -51,21 +51,21 @@ pruning_modifiers:
5151
log_types: __ALL__
5252
---
5353

54-
# Bert model with pruned encoder layers
54+
# BERT Model with Pruned Encoder Layers
5555

56-
This recipe defines a pruning strategy to sparsify all encoder layers of a Bert model at 95% sparsity. It was used together with knowledge distillation to create sparse model that achives 94.7% recovery from its baseline accuracy on the Squad dataset.
56+
This recipe defines a pruning strategy to sparsify all encoder layers of a BERT model at 95% sparsity. It was used together with knowledge distillation to create sparse model that achieves 94.7% recovery from the F1 metric of the baseline model on the SQuAD dataset. (We use the checkpoint at the end of the first 2 epochs as the baseline model for comparison, right before the pruning takes effect.)
5757
Training was done using 1 GPU at half precision using a training batch size of 16 with the
5858
[SparseML integration with huggingface/transformers](https://github.com/neuralmagic/sparseml/tree/main/integrations/huggingface-transformers).
5959

6060
## Weights and Biases
6161

62-
- [Sparse Bert on Squad](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/3gv0arxd?workspace=user-neuralmagic)
62+
- [Sparse BERT on SQuAD](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/3gv0arxd?workspace=user-neuralmagic)
6363

6464
## Training
6565

6666
To set up the training environment, follow the instructions on the [integration README](https://github.com/neuralmagic/sparseml/blob/main/integrations/huggingface-transformers/README.md).
6767
Using the `run_qa.py` script from the question-answering examples, the following command can be used to launch this recipe with distillation.
68-
Adjust the training command below with your setup for GPU device, checkpoint saving frequency and logging options.
68+
Adjust the training command below with your setup for GPU device, checkpoint saving frequency, and logging options.
6969

7070
*training command*
7171
```
@@ -90,7 +90,7 @@ python transformers/examples/pytorch/question-answering/run_qa.py \
9090
--distill_temperature 2.0 \
9191
--save_steps 1000 \
9292
--save_total_limit 2 \
93-
--recipe ../recipes/uni_95sparse_freq0.01_18prune10fine.md \
93+
--recipe ../recipes/bert-base-12layers_prune95.md \
9494
--onnx_export_path MODELS_DIR/sparse95/onnx \
9595
--report_to wandb
9696
```

integrations/huggingface-transformers/recipes/bert-base-6layers_prune80.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -15,17 +15,17 @@ limitations under the License.
1515
-->
1616

1717
---
18-
# General variables
18+
# General Variables
1919
num_epochs: &num_epochs 30
2020

21-
# pruning hyperparameters
21+
# Pruning Hyperparameters
2222
init_sparsity: &init_sparsity 0.00
2323
final_sparsity: &final_sparsity 0.80
2424
pruning_start_epoch: &pruning_start_epoch 2
2525
pruning_end_epoch: &pruning_end_epoch 20
2626
update_frequency: &pruning_update_frequency 0.01
2727

28-
# modifiers:
28+
# Modifiers
2929
training_modifiers:
3030
- !EpochRangeModifier
3131
end_epoch: 30
@@ -60,21 +60,21 @@ pruning_modifiers:
6060
- bert.encoder.layer.10
6161
---
6262

63-
# Bert model with dropped and pruned encoder layers
63+
# BERT Model with Dropped and Pruned Encoder Layers
6464

65-
This recipe defines a dropping and pruning strategy to sparsify 6 encoder layers of a Bert model at 80% sparsity. It was used together with knowledge distillation to create sparse model that achives 97% recovery from its (teacher) baseline accuracy on the Squad dataset.
66-
Training was done using 1 GPU at half precision using a training batch size of 16 with the
65+
This recipe defines a dropping and pruning strategy to sparsify six encoder layers of a BERT model at 80% sparsity. It was used together with knowledge distillation to create sparse model that exceeds the F1 metric (83.632) of the baseline model by 0.02% on the SQuAD dataset. (We use the checkpoint at the end of the first 2 epochs as the baseline model for comparison, right before the pruning takes effect.)
66+
Training was done using one V100 GPU at half precision using a training batch size of 16 with the
6767
[SparseML integration with huggingface/transformers](https://github.com/neuralmagic/sparseml/tree/main/integrations/huggingface-transformers).
6868

6969
## Weights and Biases
7070

71-
- [Sparse Bert on Squad](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/ebab4np4?workspace=user-neuralmagic)
71+
- [Sparse BERT on SQuAD](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/ebab4np4?workspace=user-neuralmagic)
7272

7373
## Training
7474

7575
To set up the training environment, follow the instructions on the [integration README](https://github.com/neuralmagic/sparseml/blob/main/integrations/huggingface-transformers/README.md).
7676
Using the `run_qa.py` script from the question-answering examples, the following command can be used to launch this recipe with distillation.
77-
Adjust the training command below with your setup for GPU device, checkpoint saving frequency and logging options.
77+
Adjust the training command below with your setup for GPU device, checkpoint saving frequency, and logging options.
7878

7979
*training command*
8080
```
@@ -99,7 +99,7 @@ python transformers/examples/pytorch/question-answering/run_qa.py \
9999
--distill_temperature 2.0 \
100100
--save_steps 1000 \
101101
--save_total_limit 2 \
102-
--recipe ../recipes/uni_80sparse_freq0.01_18prune10fine_6layers.md \
102+
--recipe ../recipes/bert-base-6layers_prune80.md \
103103
--onnx_export_path MODELS_DIR/sparse80_6layers/onnx \
104104
--report_to wandb
105105
```

integrations/huggingface-transformers/recipes/bert-base-6layers_prune90.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -15,17 +15,17 @@ limitations under the License.
1515
-->
1616

1717
---
18-
# General Epoch/LR variables
18+
# General Variables
1919
num_epochs: &num_epochs 30
2020

21-
# pruning hyperparameters
21+
# Pruning Hyperparameters
2222
init_sparsity: &init_sparsity 0.00
2323
final_sparsity: &final_sparsity 0.90
2424
pruning_start_epoch: &pruning_start_epoch 2
2525
pruning_end_epoch: &pruning_end_epoch 20
2626
update_frequency: &pruning_update_frequency 0.01
2727

28-
# modifiers:
28+
# Modifiers
2929
training_modifiers:
3030
- !EpochRangeModifier
3131
end_epoch: 30
@@ -60,21 +60,21 @@ pruning_modifiers:
6060
- bert.encoder.layer.10
6161
---
6262

63-
# Bert model with dropped and pruned encoder layers
63+
# BERT Model with Dropped and Pruned Encoder Layers
6464

65-
This recipe defines a dropping and pruning strategy to sparsify 6 encoder layers of a Bert model at 90% sparsity. It was used together with knowledge distillation to create sparse model that achives 94.5% recovery from its (teacher) baseline accuracy on the Squad dataset.
66-
Training was done using 1 GPU at half precision using a training batch size of 16 with the
65+
This recipe defines a dropping and pruning strategy to sparsify six encoder layers of a BERT model at 90% sparsity. It was used together with knowledge distillation to create sparse model that achieves 99.9% recovery from the F1 metric (83.632) of the baseline model on the SQuAD dataset. (We use the checkpoint at the end of the first 2 epochs as the baseline model for comparison, right before the pruning takes effect.)
66+
Training was done using one V100 GPU at half precision using a training batch size of 16 with the
6767
[SparseML integration with huggingface/transformers](https://github.com/neuralmagic/sparseml/tree/main/integrations/huggingface-transformers).
6868

6969
## Weights and Biases
7070

71-
- [Sparse Bert on Squad](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/3qvxoroz?workspace=user-neuralmagic)
71+
- [Sparse BERT on SQuAD](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/3qvxoroz?workspace=user-neuralmagic)
7272

7373
## Training
7474

7575
To set up the training environment, follow the instructions on the [integration README](https://github.com/neuralmagic/sparseml/blob/main/integrations/huggingface-transformers/README.md).
7676
Using the `run_qa.py` script from the question-answering examples, the following command can be used to launch this recipe with distillation.
77-
Adjust the training command below with your setup for GPU device, checkpoint saving frequency and logging options.
77+
Adjust the training command below with your setup for GPU device, checkpoint saving frequency, and logging options.
7878

7979
*training command*
8080
```
@@ -99,7 +99,7 @@ python transformers/examples/pytorch/question-answering/run_qa.py \
9999
--distill_temperature 2.0 \
100100
--save_steps 1000 \
101101
--save_total_limit 2 \
102-
--recipe ../recipes/uni_90sparse_freq0.01_18prune10fine_6layers.md \
102+
--recipe ../recipes/bert-base-6layers_prune90.md \
103103
--onnx_export_path MODELS_DIR/sparse90_6layers/onnx \
104104
--report_to wandb
105105
```

integrations/huggingface-transformers/recipes/bert-base-6layers_prune95.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -15,17 +15,17 @@ limitations under the License.
1515
-->
1616

1717
---
18-
# General Epoch/LR variables
18+
# General Variables
1919
num_epochs: &num_epochs 30
2020

21-
# pruning hyperparameters
21+
# Pruning Hyperparameters
2222
init_sparsity: &init_sparsity 0.00
2323
final_sparsity: &final_sparsity 0.95
2424
pruning_start_epoch: &pruning_start_epoch 2
2525
pruning_end_epoch: &pruning_end_epoch 20
2626
update_frequency: &pruning_update_frequency 0.01
2727

28-
# modifiers:
28+
# Modifiers
2929
training_modifiers:
3030
- !EpochRangeModifier
3131
end_epoch: 30
@@ -60,21 +60,21 @@ pruning_modifiers:
6060
- bert.encoder.layer.10
6161
---
6262

63-
# Bert model with dropped and pruned encoder layers
63+
# BERT Model with Dropped and Pruned Encoder Layers
6464

65-
This recipe defines a dropping and pruning strategy to sparsify 6 encoder layers of a Bert model at 95% sparsity. It was used together with knowledge distillation to create sparse model that achives 90% recovery from its (teacher) baseline accuracy on the Squad dataset.
66-
Training was done using 1 GPU at half precision using a training batch size of 16 with the
65+
This recipe defines a dropping and pruning strategy to sparsify six encoder layers of a BERT model at 95% sparsity. It was used together with knowledge distillation to create sparse model that achives 96.2% recovery from the F1 metric of the baseline model on the SQuAD dataset. (We use the checkpoint at the end of the first 2 epochs as the baseline model for comparison, right before the pruning takes effect.)
66+
Training was done using one V100 GPU at half precision using a training batch size of 16 with the
6767
[SparseML integration with huggingface/transformers](https://github.com/neuralmagic/sparseml/tree/main/integrations/huggingface-transformers).
6868

6969
## Weights and Biases
7070

71-
- [Sparse Bert on Squad](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/3plynclw?workspace=user-neuralmagic)
71+
- [Sparse BERT on SQuAD](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/3plynclw?workspace=user-neuralmagic)
7272

7373
## Training
7474

7575
To set up the training environment, follow the instructions on the [integration README](https://github.com/neuralmagic/sparseml/blob/main/integrations/huggingface-transformers/README.md).
7676
Using the `run_qa.py` script from the question-answering examples, the following command can be used to launch this recipe with distillation.
77-
Adjust the training command below with your setup for GPU device, checkpoint saving frequency and logging options.
77+
Adjust the training command below with your setup for GPU device, checkpoint saving frequency, and logging options.
7878

7979
*training command*
8080
```
@@ -99,7 +99,7 @@ python transformers/examples/pytorch/question-answering/run_qa.py \
9999
--distill_temperature 2.0 \
100100
--save_steps 1000 \
101101
--save_total_limit 2 \
102-
--recipe ../recipes/uni_95sparse_freq0.01_18prune10fine_6layers.md \
102+
--recipe ../recipes/bert-base-6layers_prune95.md \
103103
--onnx_export_path MODELS_DIR/sparse95_6layers/onnx \
104104
--report_to wandb
105105
```

0 commit comments

Comments
 (0)