Skip to content

Commit c5173be

Browse files
committed
Updated readme.
1 parent 4df8d49 commit c5173be

File tree

1 file changed

+13
-6
lines changed

1 file changed

+13
-6
lines changed

README.md

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,17 @@
1-
# research-BertBigCode
2-
Exploration of BERT-like models trained on The Stack
1+
# BERT pre-training on The Stack
2+
Exploration of BERT-like models trained on The Stack.
33

4+
- Code used to train [StarEncoder](https://huggingface.co/bigcode/starencoder).
5+
- StarEncoder was fine-tuned for PII detection to pre-process the data used to train [StarCoder](https://arxiv.org/abs/2305.06161)
46

5-
**Work in progress. Currently training on the subsample of The Stack.**
7+
- This repo also contains functionality to train encoders with contrastive objectives.
68

7-
[Project information.](https://docs.google.com/document/d/1gjf7Y2Ek64xSyl8HE3GoK1kxDgsV8kjy-9pyIBkR-RQ/edit?usp=sharing)
9+
- [More details.](https://docs.google.com/document/d/1gjf7Y2Ek64xSyl8HE3GoK1kxDgsV8kjy-9pyIBkR-RQ/edit?usp=sharing)
810

911

10-
## To run locally:
12+
## To launch pre-training:
13+
14+
After installing requirements, training can be launched via the example launcher script:
1115

1216
```
1317
./launcher.sh
@@ -18,4 +22,7 @@ Exploration of BERT-like models trained on The Stack
1822
- ```--train_data_name``` can be used to use to set the training dataset.
1923

2024
- Hyperparamaters can be changed in ```exp_configs.py```.
21-
- The tokenizer to be used is treated as a hyperparameter and then must also be set in ```exp_configs.py```
25+
- The tokenizer to be used is treated as a hyperparameter and then must also be set in ```exp_configs.py```.
26+
- alpha is used to weigh the BERT losses (NSP+MLM) and the contrastive objective.
27+
- Setting alpha to 1 corresponds to the standard BERT objective.
28+
- Token masking probabilities are set as separate hyperparameters, one for MLM and another one for the contrastive loss.

0 commit comments

Comments
 (0)