Skip to content

Segmentation fault (core dumped) error for multiple GPUs #47

@theonegis

Description

@theonegis

Environment:

  • Python: 3.6
  • PyTorch: 0.4.0
  • OS: Ubuntu 18.04.1 LTS
  • CUDA: V9.1.85
  • GPU: Tesla K80
    Problem:
    I was running a model that does not need BatchNorm, so I changed the original DesneNet a little bit.
    Here is the code snippet:
def _cat_function_factory(conv, relu):
    def cat_function(*inputs):
        concated_features = torch.cat(inputs, 1)
        bottleneck_output = relu(conv(concated_features))
        return bottleneck_output
    return cat_function


class _DenseLayer(nn.Module):
    def __init__(self, num_input_features, growth_rate, bn_size, drop_rate):
        super(_DenseLayer, self).__init__()
        self.add_module('conv1', nn.Conv2d(num_input_features, bn_size * growth_rate, 1))
        self.add_module('relu1', nn.ReLU(inplace=True))
        self.add_module('conv2', nn.Conv2d(bn_size * growth_rate, growth_rate, 3, padding=1))
        self.add_module('relu2', nn.ReLU(inplace=True))
        self.drop_rate = drop_rate

    def forward(self, *inputs):
        cat_function = _cat_function_factory(self.conv1, self.relu1)
        if any(feature.requires_grad for feature in inputs):
            output = cp.checkpoint(cat_function, *inputs)
        else:
            output = cat_function(*inputs)
        new_features = self.relu2(self.conv2(output))
        if self.drop_rate > 0:
            new_features = F.dropout(new_features, p=self.drop_rate, training=self.training)
        return new_features


class _DenseBlock(nn.Module):
    def __init__(self, num_layers, num_input_features, bn_size, growth_rate, drop_rate):
        super(_DenseBlock, self).__init__()
        for i in range(num_layers):
            layer = _DenseLayer(num_input_features + i * growth_rate,
                                growth_rate, bn_size, drop_rate)
            self.add_module(f'denselayer{i + 1}', layer)

    def forward(self, init_features):
        features = [init_features]
        for name, layer in self.named_children():
            new_features = layer(*features)
            features.append(new_features)
        return torch.cat(features, 1)

It can run on single GPU, but it throws a Segmentation fault (core dumped) error when running on multiple GPUS. What can be caused this issues?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions