File tree Expand file tree Collapse file tree 1 file changed +13
-6
lines changed
Expand file tree Collapse file tree 1 file changed +13
-6
lines changed Original file line number Diff line number Diff line change 1- # research-BertBigCode
2- Exploration of BERT-like models trained on The Stack
1+ # BERT pre-training on The Stack
2+ Exploration of BERT-like models trained on The Stack.
33
4+ - Code used to train [ StarEncoder] ( https://huggingface.co/bigcode/starencoder ) .
5+ - StarEncoder was fine-tuned for PII detection to pre-process the data used to train [ StarCoder] ( https://arxiv.org/abs/2305.06161 )
46
5- ** Work in progress. Currently training on the subsample of The Stack. **
7+ - This repo also contains functionality to train encoders with contrastive objectives.
68
7- [ Project information .] ( https://docs.google.com/document/d/1gjf7Y2Ek64xSyl8HE3GoK1kxDgsV8kjy-9pyIBkR-RQ/edit?usp=sharing )
9+ - [ More details .] ( https://docs.google.com/document/d/1gjf7Y2Ek64xSyl8HE3GoK1kxDgsV8kjy-9pyIBkR-RQ/edit?usp=sharing )
810
911
10- ## To run locally:
12+ ## To launch pre-training:
13+
14+ After installing requirements, training can be launched via the example launcher script:
1115
1216```
1317./launcher.sh
@@ -18,4 +22,7 @@ Exploration of BERT-like models trained on The Stack
1822- ``` --train_data_name ``` can be used to use to set the training dataset.
1923
2024- Hyperparamaters can be changed in ``` exp_configs.py ``` .
21- - The tokenizer to be used is treated as a hyperparameter and then must also be set in ``` exp_configs.py ```
25+ - The tokenizer to be used is treated as a hyperparameter and then must also be set in ``` exp_configs.py ``` .
26+ - alpha is used to weigh the BERT losses (NSP+MLM) and the contrastive objective.
27+ - Setting alpha to 1 corresponds to the standard BERT objective.
28+ - Token masking probabilities are set as separate hyperparameters, one for MLM and another one for the contrastive loss.
You can’t perform that action at this time.
0 commit comments