-
Notifications
You must be signed in to change notification settings - Fork 496
cannot import name 'c_lib' from partially initialized module 'byteps.torch' #448
Description
Describe the bug
Traceback (most recent call last):
File "/media/sdb1/niejinquan/compression-code/deep-gradient-compression/test.py", line 5, in
import byteps.torch as bps
File "/home/niejinquan/.conda/envs/GC/lib/python3.9/site-packages/byteps/torch/init.py", line 24, in
from byteps.torch.ops import push_pull_async_inplace as byteps_push_pull
File "/home/niejinquan/.conda/envs/GC/lib/python3.9/site-packages/byteps/torch/ops.py", line 29, in
from byteps.torch import c_lib
ImportError: cannot import name 'c_lib' from partially initialized module 'byteps.torch' (most likely due to a circular import) (/home/niejinquan/.conda/envs/GC/lib/python3.9/site-packages/byteps/torch/init.py)
To Reproduce
Steps to reproduce the behavior:
- pip install byteps
- run the test.py
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
import byteps.torch as bps
初始化
bps.init()
torch.manual_seed(42)
定义模型
class Net(nn.Module):
def init(self):
super(Net, self).init()
self.fc1 = nn.Linear(784, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = self.fc2(x)
return x
数据准备
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_dataset = datasets.MNIST('data', train=True, download=True,
transform=transform)
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=64, shuffle=True)
模型和优化器
model = Net()
optimizer = optim.SGD(model.parameters(), lr=0.01 * bps.size())
optimizer = bps.DistributedOptimizer(optimizer)
广播参数
bps.broadcast_parameters(model.state_dict(), root_rank=0)
训练循环
def train(epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data.view(-1, 784))
loss = nn.functional.cross_entropy(output, target)
loss.backward()
optimizer.step()
for epoch in range(1, 11):
train(epoch)
- See error
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
- OS: Ubuntu 18.04
- GCC version: 9.4
- CUDA and NCCL version: 11.4
- Framework (TF, PyTorch, MXNet): PyTorch
Additional context
Add any other context about the problem here.