-
Notifications
You must be signed in to change notification settings - Fork 20
Open
Description
I tried running the examples from the README.md. The first example (using the full dataset) worked as expected. However, the second example (using train_info_batch) resulted in the following error:
$ nohup /usr/bin/time -v python examples/cifar_example.py "--model r50 --optimizer lars --max-lr 5.2 --num_epoch 5 --delta 0.875 --ratio 0.5 --use_info_batch" >> log_test.log 2>&1
nohup: ignoring input
==> Building model..
use normal data parallel
Use info batch.
<class 'infobatch.infobatch.IBSampler'>
Epoch: 0, iterations 391
Traceback (most recent call last):
File "/home/vm03/Desktop/barbara/infobatch/InfoBatch/examples/cifar_example.py", line 269, in <module>
train_info_batch(epoch) if args.use_info_batch else train_normal(epoch)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vm03/Desktop/barbara/infobatch/InfoBatch/examples/cifar_example.py", line 191, in train_info_batch
lr_scheduler.step()
File "/home/vm03/anaconda3/envs/cp/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 241, in step
values = self.get_lr()
^^^^^^^^^^^^^
File "/home/vm03/anaconda3/envs/cp/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 2153, in get_lr
raise ValueError(
ValueError: Tried to step 1956 times. The specified number of total steps is 1955
To speed up the process, I initially limited the number of epochs, but I also ran the original example with 200 epochs and encountered the same error — though with a higher number of total steps.
It seems that the error originates from a loop that may be getting stuck inside the train_info_batch function:
for batch_idx, blobs in enumerate(trainloader):
inputs, targets = blobs
inputs, targets = inputs.to(device), targets.to(device)Metadata
Metadata
Assignees
Labels
No labels