Skip to content

Single-machine multi-GPU execution failed. #309

@lovejike

Description

@lovejike

Describe the bug(问题描述)
0.2.9 version bug.

To Reproduce(复现步骤)
Steps to reproduce the behavior:

2025-08-25 08:49:44,005 - root - ERROR - 任务执行失败: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/freya/miniconda3/lib/python3.13/site-packages/torch/nn/parallel/parallel_apply.py", line 97, in _worker
    output = module(*input, **kwargs)
  File "/home/freya/miniconda3/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/freya/miniconda3/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/freya/miniconda3/lib/python3.13/site-packages/deepctr_torch/models/dcn.py", line 76, in forward
    logit = self.linear_model(X)
  File "/home/freya/miniconda3/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/freya/miniconda3/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/freya/miniconda3/lib/python3.13/site-packages/deepctr_torch/models/basemodel.py", line 86, in forward
    linear_logit += sparse_feat_logit
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Operating environment(运行环境):

Additional context
fix: basemodel.py
https://github.com/shenweichen/DeepCTR-Torch/blob/master/deepctr_torch/models/basemodel.py#L79 linear_logit = torch.zeros([X.shape[0], 1]).to(self.device) update linear_logit = torch.zeros([X.shape[0], 1]).to(sparse_embedding_list[0].device)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions