-
Notifications
You must be signed in to change notification settings - Fork 36
Description
Thanks for your work!
I'm now reproducing your paper, but I'm having some difficulties. When I was training language modeling tasks using the default parameters in readme, I encountered unstable training. The details are as follows:
0/200 [train] loss=5.945 [val] loss=5.917, pp=371.43, acc=0.185491 [time per itr] 1403.69ms [lr] 0.00003
0/400 [train] loss=5.655 [val] loss=5.477, pp=239.12, acc=0.196609 [time per itr] 1223.88ms [lr] 0.00005
0/600 [train] loss=5.285 [val] loss=5.259, pp=192.26, acc=0.200577 [time per itr] 1213.27ms [lr] 0.00010
0/800 [train] loss=5.326 [val] loss=5.250, pp=190.52, acc=0.197866 [time per itr] 1206.03ms [lr] 0.00015
0/1000 [train] loss=4.970 [val] loss=5.168, pp=175.63, acc=0.202474 [time per itr] 1197.92ms [lr] 0.00022
0/1200 [train] loss=5.088 [val] loss=5.093, pp=162.88, acc=0.206467 [time per itr] 1198.33ms [lr] 0.00031
0/1400 [train] loss=nan [val] loss=nan, pp=nan, acc=0.000956 [time per itr] 1183.48ms [lr] 0.00041
0/1600 [train] loss=nan [val] loss=nan, pp=nan, acc=0.001068 [time per itr] 1086.21ms [lr] 0.00052
0/1800 [train] loss=nan [val] loss=nan, pp=nan, acc=0.000971 [time per itr] 1090.91ms [lr] 0.00063
0/2000 [train] loss=nan [val] loss=nan, pp=nan, acc=0.001470 [time per itr] 1092.99ms [lr] 0.00075
0/2200 [train] loss=nan [val] loss=nan, pp=nan, acc=0.001216 [time per itr] 1090.09ms [lr] 0.00088
0/2400 [train] loss=nan [val] loss=nan, pp=nan, acc=0.001114 [time per itr] 1089.34ms [lr] 0.00101
0/2600 [train] loss=nan [val] loss=nan, pp=nan, acc=0.001083 [time per itr] 1090.32ms [lr] 0.00114
0/2800 [train] loss=nan [val] loss=nan, pp=nan, acc=0.001165 [time per itr] 1089.56ms [lr] 0.00127
0/3000 [train] loss=nan [val] loss=nan, pp=nan, acc=0.001149 [time per itr] 1085.54ms [lr] 0.00139
0/3200 [train] loss=nan [val] loss=nan, pp=nan, acc=0.001246 [time per itr] 1087.70ms [lr] 0.00151
You can see that the learning rate keeps climbing and eventually stays at 0.0200. Because of the company firewall, I couldn't access the Internet while running code, so instead of using the tiktoken library for tokenizer, I used the GPT2Tokenizer provided by the transformers library. (I downloaded the vocab.json and merges.txt files to the local host and uploaded them to the server.) The modified code(in /root/xxx/landmark-attention/lm_benchmark/data/pg19/prepare.py) is as follows:
”from transformers import GPT2Tokenizer
vocab_file_path = "/root/xxx/landmark-attention/lm_benchmark/data/pg19/vocab.json"
merges_file_path = "/root/xxx/landmark-attention/lm_benchmark/data/pg19/merges.txt"
gpt2_tokenizer = GPT2Tokenizer(
vocab_file=vocab_file_path,
merges_file=merges_file_path,
)
def _read_directory(path):
...(keep the same)
with open(os.path.join(path, filename), 'r') as f:
texts.extend(gpt2_tokenizer.encode(f.read()))
texts.append(gpt2_tokenizer.eos_token_id)
“
I want to know if this is related to training instability, because I have not made any other changes other than that, thank you very much for your reply!