Skip to content

训练过程报错 #33

@huang-chenhai

Description

@huang-chenhai
[2022-05-05 23:18:34,655][        main.py][line: 280][    INFO] Epoch [1]       Iter [53880/184378]     Time 0.238 (0.412)   Data 0.000 (0.192)      Loss 0.0788 (0.0725)                                                 
[2022-05-05 23:18:39,059][        main.py][line: 280][    INFO] Epoch [1]       Iter [53900/184378]     Time 0.296 (0.220)   Data 0.000 (0.000)      Loss 0.0805 (0.0908)                                                 
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,
0,0], thread: [96,0,0] Assertion `input_val >= zero && input_val <= one` failed.                             /opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,
0,0], thread: [97,0,0] Assertion `input_val >= zero && input_val <= one` failed.  

您好,我按照您给的配置SLOWFAST_R101_ACAR_HR2O_DEPTH1.yaml训练,nproc_per_node=1其他为默认,数据集也是按您提供的工具分割出的图片,显卡为3080ti,报错代码如上。我debug了过程,发现

        ret = model(data)
        num_rois = ret['num_rois']
        outputs = ret['outputs']
        targets = ret['targets']

这个outputs出来的数据全是[nan,nan,nan,...]

使用SLOWFAST_R50_ACAR_HR2O.yaml这个配置好像可以正常运行,我不知道问题出在哪里,期待得到您的回复,谢谢!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions