Help me dethrone PyTorch in the DAWNBench rankings! #9669

samuela · 2022-02-23T01:14:53Z

samuela
Feb 23, 2022

The current # 8 record holder on the DAWNBench CIFAR-10 benchmark is a PyTorch ResNet by @davidcpage running on an EC2 p3.2xlarge instance (single V100 GPU). Record holders # 1-# 7 either use multiple GPUs or run on more esoteric cloud providers, so I'm not worrying about them for now.

My goal is to at least match the performance of this PyTorch implementation in JAX in terms of seconds/epoch. However, I've found that even my somewhat carefully written JAX version is a shocking order of magnitude slower than the PyTorch version. On a p3.2xlarge instance, my JAX version (attached) clocks in at about 24.2s/epoch. The PyTorch version reports completing 24 epochs in 72s, which comes out to 3s/epoch.

Some notes:

The entirety of the CIFAR-10 dataset is loaded into GPU memory for speeeedz.
Training epochs are fully jit'd with lax.scan.
Data augmentation is fully jit'd thanks to augmax.
Like the pytorch version, I use float16 weights.

Original PyTorch implementation: https://github.com/davidcpage/cifar10-fast
JAX implementation (including my shell.nix for reproducibility): https://gist.github.com/samuela/78a3f0bbac759833a0464048aa499c98

What am I doing wrong here? What do I have to do to get competitive performance out of JAX?

cc @sharadmv

YouJiacheng · 2022-02-23T09:26:43Z

YouJiacheng
Feb 23, 2022

I notice that you misunderstand the usage of jax.lax.scan
You can check following code

num_batches = 100
batch_size = 512
assert jax.lax.scan(lambda n_batches, _: (n_batches + 1, None), 0, jnp.split(jnp.zeros(num_batches * batch_size), num_batches))[0] == batch_size

Namely, your num_batches actually is the batch size and your batch_size actually is the number of batches.

11 replies

YouJiacheng Feb 23, 2022

There is no problem with split. It is lax.scan will treat a list of array as a pytree, and scan on the leading axis of each array, not the list.
You can just RUN my code.
Following code counts the number of calls and ASSERT it is equal to batch_size.
The assertion does NOT fail.

num_batches = 100
batch_size = 512
assert jax.lax.scan(lambda n_batches, _: (n_batches + 1, None), 0, jnp.split(jnp.zeros(num_batches * batch_size), num_batches))[0] == batch_size

YouJiacheng Feb 23, 2022

@samuela
https://jax.readthedocs.io/en/latest/_autosummary/jax.lax.scan.html#jax.lax.scan

where we use [t] here to denote the type t with an additional leading axis. That is, if t is an array type then [t] represents the type with an additional leading axis, and if t is a pytree (container) type with array leaves then [t] represents the type with the same pytree structure and corresponding leaves each with an additional leading axis

https://jax.readthedocs.io/en/latest/pytrees.html

In JAX, we use the term pytree to refer to a tree-like structure built out of container-like Python objects. Classes are considered container-like if they are in the pytree registry, which by default includes lists, tuples, and dicts. That is:.

samuela Feb 23, 2022
Author

Oh god you're right... That's so insanely confusing... talk about a foot-gun. Why in the world would scan ever do that?

Well good catch! I'll put together a fixed version and profile that.

YouJiacheng Feb 23, 2022

I recommend that use reshape instead of split, that will fix it.
And you can remove the list to Array conversion if you use reshape. (since p will be Array instead of list)

def make_batcher_in_paradise(num_examples: int, batch_size: int):
  assert num_examples >= batch_size
  num_batches = num_examples // batch_size
  return lambda arr: jnp.reshape(arr[:num_batches * batch_size], (num_batches, batch_size))

def step(train_state, p):
    # See https://github.com/google/jax/issues/4564 as to why the array conversion is necessary.
    p = jnp.array(p) # useless with reshape
    images = ds_images[p, :, :, :]
    labels = ds_labels[p]
    images_transformed = vmap(train_transform)(augmax_rngs[p], images)
    (l, num_correct), g = value_and_grad(batch_eval, has_aux=True)(train_state.params,
                                                                   images_transformed, labels)
    return train_state.apply_gradients(grads=g), (l, num_correct)

samuela Feb 23, 2022
Author

I've fixed and updated the gist.

jekbradbury · 2022-02-23T18:00:14Z

jekbradbury
Feb 23, 2022

Can you take a profile using jax.profiler and then upload the resulting trace.json.gz file?

3 replies

samuela Feb 23, 2022
Author

I tried wrapping my code in with jax.profiler.trace(log_dir="./logs"), but I'm getting a linking error:

2022-02-23 19:26:39.730162: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcupti.so.11.4'; dlerror: libcupti.so.11.4: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /nix/store/5lzlfgrc37wqj400b4qgpaa6kc9d2wr1-libglvnd-1.4.0/lib:/nix/store/56si9r6nbwwyaqx1j81fn4fv8ywywmfc-nvidia-x11-510.47.03/lib:/nix/store/7ic9fxlhh2jb76b9fgdhrh1cvxm9yfzm-nvidia-x11-510.47.03-lib32/lib:/nix/store/hrc96ak11li2czfqj0xf2fc1clpn6gvx-libglvnd-1.4.0/lib::/nix/store/9lv0wxqkbqw2438wrhllcyf3sx644i5z-cudatoolkit-11.5.0/lib

I'm using cuda11_cudnn82 jaxlib wheel with CUDA toolkit 11.5, cuDNN 8.3.0, and driver version 510.47.03 which is all in accordance with the supported version constraints AFAIU. Am I missing something?

hawkinsp Feb 23, 2022
Maintainer

That's (hopefully) a benign warning. Did a profile get produced?

samuela Feb 23, 2022
Author

Ah, yes I misunderstood. I've updated the gist with the new code (including profiling and the fix to the issue brought up by @YouJiacheng), and attaching the profile.tar.gz here. I did three epochs of burn-in and then profiled the training loop for 10 epochs. This was on a p3.2xlarge instance, same as the PyTorch implementation.

jekbradbury · 2022-02-23T22:14:03Z

jekbradbury
Feb 23, 2022

Your profile shows a step time of about 237-238 ms, dominated by cuDNN operations:

(For context, the annotations in the top row are names of device kernels that are called, mostly inside cuDNN; the annotations in the middle are the names of XLA ops; and the annotations in the bottom row are the kernels currently being dispatched on the host, many steps ahead of time.)

I would start by comparing the step time and cuDNN operations to a similar profile on the PyTorch side.

7 replies

samuela Feb 25, 2022
Author

Ok, just managed to get a profile. It's 1 epoch after 3 epochs of burn-in. One thing to note is that profiling the code significantly slows down runtime: it's ~3.09s/epoch before profiling kicks in and about 12-13s/epoch with profiling. So the question remains... why is jax 10x slower?

I have to go FaceTime the parents rn, but I'm hoping to start digging into the results later this evening. Don't hesitate to pester me if you'd like the exact code and experimental details to reproduce this trace.

pytorch-profile.tar.gz

samuela Mar 1, 2022
Author

I've been bumping into a few problems getting pytorch to profile the things I want and actually display them in a reasonable manner. I haven't been able to get flamegraphs working yet (GH issue forthcoming). However, I have gotten chrome-compatible traces.

I've published my testing/profiling setup in my own fork here. For convenience, you can just download the saved traces here. The chrome traces can be inspect with chrome://tracing. For reasons I do not yet understand, chrome_trace_3.json seems to have captured interesting things but chrome_trace_5.json did not.

Does anyone know how I can assess the cuDNN kernel occupancy for comparison with JAX's 12.5%? I have not been able to find a comparable measurement thus far.

samuela Mar 1, 2022
Author

Update: flamegraph issue created.

samuela Mar 8, 2022
Author

Just wanted to send a friendly ping on this. What can I do to get JAX to use the full cuDNN kernel occupancy available?

mattjj Mar 8, 2022
Maintainer

Thanks @samuela . The last couple weeks were performance review week ("perf week") at Google so we were laden with administrative work and now we have a backlog. The pings are actually very helpful though.

levskaya · 2022-03-08T04:18:33Z

levskaya
Mar 8, 2022
Collaborator

@samuela

do you have any sense how much time is spent in your on-device augmentation vs the actual NN code? In all "production" code using JAX we tend to use tf.data to set up the dataset augmentation pipeline since tf.data is great at utilizing CPUs. Shoving augmentation computations onto device is only going to hurt performance since you're leaving a lot of compute on the table.
I think a few of us are worried about scan potentially introducing some inefficiencies here. In practice, we never scan a training loop. It might be innocent of trouble, but it's a bit odd. Given jax's async dispatch you don't really gain much by doing this.

3 replies

samuela Mar 8, 2022
Author

These are interesting points... Up to now I have been trying to jit as much as possible with the general assumption "XLA fast, python slow", but it sounds like this is not always the case? How should I reason about when to jit and when not to jit?

I'll try running without data augmentation as that's pretty straightforward. I can try doing a python for loop in place of lax.scan too, though personally I would consider it a bug if the for loop significantly is faster.

samuela Mar 8, 2022
Author

Just tried without any data augmentation and it's the same speed.

samuela Mar 8, 2022
Author

I updated the gist to remove data augmentation and simplify a few other things.

jekbradbury · 2022-03-08T05:33:58Z

jekbradbury
Mar 8, 2022

I don’t think that augmentation or overheads related to scan are problematic here; the profile suggested that virtually all your time is spent in the device step. But none of us on the JAX team are experts in cuDNN, and it really feels like the issue is something like cuDNN (via XLA) selecting slow kernels.

…

On Mon, Mar 7, 2022 at 8:18 PM Anselm Levskaya ***@***.***> wrote: @samuela <https://github.com/samuela> - do you have any sense how much time is spent in your on-device augmentation vs the actual NN code? In all "production" code using JAX we tend to use tf.data to set up the dataset augmentation pipeline since tf.data is great at utilizing CPUs. Shoving augmentation computations onto device is only going to hurt performance since you're leaving a lot of compute on the table. - I think a few of us are worried about scan potentially introducing some inefficiencies here. In practice, we never scan a training loop. It might be innocent of trouble, but it's a bit odd. Given jax's async dispatch you don't really gain much by doing this. — Reply to this email directly, view it on GitHub <#9669 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACZPRNSS335WINEJMPDUEG3U63IKPANCNFSM5PC7JETA> . You are receiving this because you commented.Message ID: ***@***.***>

13 replies

levskaya Mar 10, 2022
Collaborator

Not really, no - but I promise that I'll update you here as soon as I know anything new.

samuela Mar 17, 2022
Author

Hey @levskaya, any update from the XLA folks on this?

levskaya Mar 18, 2022
Collaborator

Sorry, I haven't heard anything from them yet on the issue...

SandSnip3r Mar 24, 2022

@samuela sorry about the delay. I'm a member of the XLA GPU team and I'm on it! 😎 I'll let you know what I find.

samuela Mar 24, 2022
Author

Awesome, thanks so much @SandSnip3r ! Don't hesitate to reach out to me here or via email if there's anything I can do to be of assistance!

SandSnip3r · 2022-03-30T20:41:47Z

SandSnip3r
Mar 30, 2022

@samuela, I want to make absolutely sure that we're comparing apples-to-apples here. First thing I'm wondering about is the padding in the convolution. The Pytorch version uses padding=1, however, you do not specify a padding in your version, which appears to default to "SAME" in Linen. I was looking in the generated HLO, and the dimensions of the tensors change. I'm not entirely sure why that is; it smells like something related to padding, but that's not how i'd expect "SAME" to behave. The height and width seem to go from 32x32, to 31x31, to 30x30, to 29x29. If you want to see what I'm seeing in the HLO, open the file module_0264.jit_train_epoch.113.before_optimizations.txt . What I'm specifically looking at is in the jit_jvp_batch_eval_ module, on lines that start with something like %convolution.327. For example, lines 233, 246, 261, 274, etc.

Regardless, the discrepancy between the two paddings is a valid difference, right?

17 replies

samuela Mar 31, 2022
Author

Yeah, I think this code is especially hairy... I went down the rabbit hole and I'm happy to report that AFAICT my original code was correct all along, with the minor exception that the pytorch version doesn't use a bias term on the final layer. So that's surely costing the JAX version a whole 1e-6ms/epoch.

For anyone interested, there are a few confusing things in reading the pytorch code:

It uses a custom NN library implemented in torch_backend.py
Said NN library makes use of the unfortunate fact that dictionaries are ordered in Python.
Max pooling with kernel width 2 is used throughout the intermediate blocks but max pooling with kernel width 4 is used just before flattening at the end of the network.

Here is the entry point for those brave souls who would like to stare into the abyss and experience the abyss staring back at you.

SandSnip3r Mar 31, 2022

The Pytorch version is using MaxPool2d(2). This creates a 2x2 max pool which strides 2-at-a-time with no padding. This will result in the dimensions halving after each conv&pool. I added a little print in torch_backend.py's Network::forward and here is some size information after each step:

Node:Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  torch.Size([32, 64, 32, 32])
Node:Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  torch.Size([32, 128, 32, 32])
Node:MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  torch.Size([32, 128, 16, 16])
Node:Identity()
  torch.Size([32, 128, 16, 16])
Node:Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  torch.Size([32, 128, 16, 16])
Node:Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  torch.Size([32, 128, 16, 16])
Node:Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  torch.Size([32, 256, 16, 16])
Node:MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  torch.Size([32, 256, 8, 8])
Node:Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  torch.Size([32, 512, 8, 8])
Node:MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  torch.Size([32, 512, 4, 4])
Node:Identity()
  torch.Size([32, 512, 4, 4])
Node:Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  torch.Size([32, 512, 4, 4])
Node:Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  torch.Size([32, 512, 4, 4])
Node:MaxPool2d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
  torch.Size([32, 512, 1, 1])
Node:Flatten()
  torch.Size([32, 512])
Node:Linear(in_features=512, out_features=10, bias=False)
  torch.Size([32, 10])
Node:Mul()
  torch.Size([32, 10])
Node:CrossEntropyLoss()
  torch.Size([32])
Node:Correct()
  torch.Size([32])

Also, it's worth noticing that the JAX version is using NHWC data format, while PyTorch is using NCHW. This can be a source of other performance differences.

samuela Mar 31, 2022
Author

Ahhh great catch! That's very sneaky. I had no idea that pytorch by default sets stride = kernel size... Sure as shit, that brings down the time to 2.02s/epoch! Thank you so much @SandSnip3r! Very tricky indeed...

SandSnip3r Mar 31, 2022

I updated your code to set stride to (2, 2) and it does speed it up. However, I don't see your version converging on a good test accuracy at the same rate. Do you see the same?

samuela Mar 31, 2022
Author

That's entirely possible. My version doesn't do any batch norm, data aug, etc at the moment. I'm confident that with all the same tweaks it would get similar results. If not, then there are certainly other correctness bugs lurking, but hopefully no more performance ones!

Help me dethrone PyTorch in the DAWNBench rankings! #9669

Uh oh!

Uh oh!

Replies: 6 comments · 54 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samuela Feb 23, 2022 Author

Uh oh!

Uh oh!

samuela Feb 23, 2022 Author

Uh oh!

Uh oh!

samuela Feb 23, 2022 Author

Uh oh!

hawkinsp Feb 23, 2022 Maintainer

Uh oh!

Uh oh!

samuela Feb 23, 2022 Author

Uh oh!

Uh oh!

samuela Feb 25, 2022 Author

Uh oh!

samuela Mar 1, 2022 Author

Uh oh!

samuela Mar 1, 2022 Author

Uh oh!

samuela Mar 8, 2022 Author

Uh oh!

mattjj Mar 8, 2022 Maintainer

Uh oh!

levskaya Mar 8, 2022 Collaborator

Uh oh!

samuela Mar 8, 2022 Author

Uh oh!

samuela Mar 8, 2022 Author

Uh oh!

samuela Mar 8, 2022 Author

Uh oh!

Uh oh!

Uh oh!

levskaya Mar 10, 2022 Collaborator

Uh oh!

samuela Mar 17, 2022 Author

Uh oh!

levskaya Mar 18, 2022 Collaborator

Uh oh!

Uh oh!

samuela Mar 24, 2022 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samuela Mar 31, 2022 Author

Uh oh!

Replies: 6 comments 54 replies

samuela Feb 23, 2022
Author

samuela Feb 23, 2022
Author

samuela Feb 23, 2022
Author

hawkinsp Feb 23, 2022
Maintainer

samuela Feb 23, 2022
Author

samuela Feb 25, 2022
Author

samuela Mar 1, 2022
Author

samuela Mar 1, 2022
Author

samuela Mar 8, 2022
Author

mattjj Mar 8, 2022
Maintainer

levskaya
Mar 8, 2022
Collaborator

samuela Mar 8, 2022
Author

samuela Mar 8, 2022
Author

samuela Mar 8, 2022
Author

levskaya Mar 10, 2022
Collaborator

samuela Mar 17, 2022
Author

levskaya Mar 18, 2022
Collaborator

samuela Mar 24, 2022
Author

samuela Mar 31, 2022
Author