-
Notifications
You must be signed in to change notification settings - Fork 45
Closed
Labels
bugSomething isn't workingSomething isn't workingenhancementNew feature or requestNew feature or requestin-depthDeep and valuable discussionDeep and valuable discussion
Description
当前代码在使用多卡训练时会出现 stop iteration的错,原因是某个卡上分配的数据比其他卡少,根本原因是由于在active_datasets.py中的create_X_L_file() 和 create_X_U_file(),多个卡会同时写同一个txt文件,导致先写完这个文件的卡创建dataloader时读取到了不全的txt.
解决方案:
- 在这两个函数中写文件时先随机sleep一小段时间,错开写文件的时间
time.sleep(random.uniform(0,3))
if not osp.exists(save_path):
mmcv.mkdir_or_exist(save_folder)
np.savetxt(save_path, ann[X_L_single], fmt='%s')- 在tools/train.py中每次create_xx_file后,同步各个卡的线程。加上这句
if dist.is_initialized():
torch.distributed.barrier()Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingenhancementNew feature or requestNew feature or requestin-depthDeep and valuable discussionDeep and valuable discussion