Skip to content

total_train_steps too high #8

@snimu

Description

@snimu

Hi,

total_train_steps is currently at 200_000. This seems way too high; I get a val_loss of ~3.8 after around 1000 steps, and a perplexity of around 40.

Edit: When using torch 2.0, Setting total_train_steps to 1000 leads to an exception:

File "main.py", line 522, in main
    scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer=opt, max_lr=hyp['opt']['lr'], total_steps=hyp['opt']['total_train_steps'], pct_start=hyp['opt']['warmup_percent'], anneal_strategy='linear', cycle_momentum=False, div_factor=1e2, final_div_factor=.02)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 1676, in __init__
    super().__init__(optimizer, last_epoch, verbose)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 79, in __init__
    self._initial_step()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 85, in _initial_step
    self.step()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 150, in step
    values = self.get_lr()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 1714, in get_lr
    pct = (step_num - start_step) / (end_step - start_step)
ZeroDivisionError: float division by zero

(I use a slightly changed version of this package, but didn't touch main.py or any of the building blocks other than total_train_steps).

Using 'total_train_steps' = 2_000 seems to work fine for me, so I would cautiously suggest doing that :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions