Decoupled Neural Interfaces Using Synthetic Gradients

Install
- From source
Architecure
Usage
DNI Networks
Custom DNI Networks
Tasks
Notable stuff

This is an implementation of Decoupled Neural Interfaces using Synthetic Gradients, Jaderberg et al..

Install

pip install pytorch-dni

From source

git clone https://github.com/ixaxaar/pytorch-dni
cd pytorch-dni
pip install -r ./requirements.txt
pip install -e .

Architecure

Usage

Constructor Parameters

Following are the constructor parameters of DNI:

Argument	Default	Description
network	NA	Network to be optimized
dni_network	None	DNI network class
dni_params	{}	Parameters to be passed to the dni_network constructor
optim	None	optimizer for the network
grad_optim	'adam'	DNI module optimizer
grad_lr	0.001	DNI learning rate
hidden_size	10	hidden size of the DNI network
λ	0.5	How muc to mix backprop and synthetic gradients (0 = synthetic only, 1 = backprop only)
recursive	True	whether to optimize leaf modules or treat network as a leaf module
gpu_id	-1	GPU ID

TLDR: Use DNI to optimize every leaf module of `net` (including last layer)

from dni import DNI

# Parent network, can be anything extending nn.Module
net = WhateverNetwork(**kwargs)
opt = optim.Adam(net.parameters(), lr=0.001)

# use DNI to optimize this network
net = DNI(net, grad_optim='adam', grad_lr=0.0001)

# after that we go about our business as usual
for e in range(epoch):
  opt.zero_grad()
  output = net(input, *args)
  loss = criterion(output, target_output)
  loss.backward()

  # Optional: do this to __also__ update net's weight using backprop
  # opt.step()
...

Apply DNI to custom layer

DNI can be applied to any class extending nn.Module. In this example we supply which layers to use DNI for, as the parameter dni_layers:

from dni import *

class Net(nn.Module):
  def __init__(self, num_layers=3, hidden_size=256, dni_layers=[]):
    super(Net, self).__init__()
    self.num_layers = num_layers
    self.hidden_size = hidden_size

    self.net = [self.dni(self.layer(
        image_size*image_size if l == 0 else hidden_size,
        hidden_size
    )) if l in dni_layers else self.layer(
        image_size*image_size if l == 0 else hidden_size,
        hidden_size
    ) for l in range(self.num_layers)]
    self.final = self.layer(hidden_size, 10)

    # bind layers to this class (so that they're searchable by pytorch)
    for ctr, n in enumerate(self.net):
      setattr(self, 'layer'+str(ctr), n)

  def layer(self, input_size, hidden_size):
    return nn.Sequential(
      nn.Linear(input_size, hidden_size),
      nn.BatchNorm1d(hidden_size)
    )

  # create a DNI wrapper layer, recursive=False implies treat this layer as a leaf module
  def dni(self, layer):
    d = DNI(layer, hidden_size=256, grad_optim='adam', grad_lr=0.0001, recursive=False)
    return d

  def forward(self, x):
    output = x.view(-1, image_size*image_size)
    for layer in self.net:
      output = F.relu(layer(output))
    output = self.final(output)
    return F.log_softmax(output, dim=-1)

net = Net(num_layers=3, dni_layers=[1,2,3])

# use the gradient descent to optimize layers not optimized by DNI
opt = optim.Adam(net.final.parametes(), lr=0.001)

# after that we go about our business as usual
for e in range(epoch):
  opt.zero_grad()
  output = net(input)
  loss = criterion(output, target_output)
  loss.backward()
  opt.step()

Apply custom DNI net to all layers

from dni import *

# Custom DNI network
class MyCustomDNI(DNINetwork):

  def __init__(self, input_size, hidden_size, output_size, num_layers=2, bias=True):

    super(LinearDNI, self).__init__(input_size, hidden_size, output_size)

    self.input_size = input_size
    self.hidden_size = hidden_size * 4
    self.output_size = output_size
    self.num_layers = num_layers
    self.bias = bias

    self.net = [self.layer(
        input_size if l == 0 else self.hidden_size,
        self.hidden_size
    ) for l in range(self.num_layers)]

    # bind layers to this class (so that they're searchable by pytorch)
    for ctr, n in enumerate(self.net):
      setattr(self, 'layer'+str(ctr), n)

    # final layer (yeah, no kidding)
    self.final = nn.Linear(self.hidden_size, output_size)

  def layer(self, input_size, hidden_size):
      return nn.Linear(input_size, hidden_size)

  def forward(self, input, hidden):
    output = input
    for layer in self.net:
      output = F.relu(layer(output))
    output = self.final(output)

    return output, None

# Custom network, can be anything extending nn.Module
net = WhateverNetwork(**kwargs)
opt = optim.Adam(net.parameters(), lr=0.001)

# use DNI to optimize this network with MyCustomDNI, pass custom params to the DNI nets
net = DNI(net, grad_optim='adam', grad_lr=0.0001, dni_network=MyCustomDNI,
      dni_params={'num_layers': 3, 'bias': True})

# after that we go about our business as usual
for e in range(epoch):
  opt.zero_grad()
  output = net(input, *args)
  loss = criterion(output, target_output)
  loss.backward()

Apply custom DNI net to custom layers

Oh come on.

DNI Networks

This package ships with 3 types of DNI networks:

LinearDNI: Linear -> ReLU * num_layers -> Linear
LinearSigmoidDNI: Linear -> ReLU * num_layers -> Linear -> Sigmoid
LinearBatchNormDNI: Linear -> BatchNorm1d -> ReLU * num_layers -> Linear
RNNDNI: stacked LSTMs, GRUs or RNNs
Conv2dDNI: Conv2d -> BatchNorm2d -> MaxPool2d / AvgPool2d -> ReLU * num_layers -> Conv2d -> AvgPool2d

Custom DNI Networks

Custom DNI nets can be created using the DNINetwork interface:

from dni import *

class MyDNI(DNINetwork):
  def __init__(self, input_size, hidden_size, output_size, **kwargs):
    super(MyDNI, self).__init__(input_size, hidden_size, output_size)
    ...

  def forward(self, input, hidden):
    ...
    return output, hidden

Tasks

MNIST (FCN and CNN)

Refer to tasks/mnist/README.md

Language model

Refer to tasks/word_language_model/README.md

Copy task

The tasks included in this project are the same as those in pytorch-dnc, except that they're trained here using DNI.

Notable stuff

Using a linear SG module makes the implicit assumption that loss is a quadratic function of the activations
For best performance one should adapt the SG module architecture to the loss function used. For MSE linear SG is a reasonable choice, however for log loss one should use architectures including a sigmoid applied pointwise to a linear SG
Learning rates of the order of 1e-5 with momentum of 0.9 works well for rmsprop, adam works well with 0.001

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoupled Neural Interfaces Using Synthetic Gradients

Install

From source

Architecure

Usage

Constructor Parameters

TLDR: Use DNI to optimize every leaf module of `net` (including last layer)

Apply DNI to custom layer

Apply custom DNI net to all layers

Apply custom DNI net to custom layers

DNI Networks

Custom DNI Networks

Tasks

MNIST (FCN and CNN)

Language model

Copy task

Notable stuff

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Decoupled Neural Interfaces Using Synthetic Gradients

Install

From source

Architecure

Usage

Constructor Parameters

TLDR: Use DNI to optimize every leaf module of net (including last layer)

Apply DNI to custom layer

Apply custom DNI net to all layers

Apply custom DNI net to custom layers

DNI Networks

Custom DNI Networks

Tasks

MNIST (FCN and CNN)

Language model

Copy task

Notable stuff

TLDR: Use DNI to optimize every leaf module of `net` (including last layer)