Why does my simple training loop peg a CPU core when training on GPU? #7722

enolan · 2021-08-26T05:43:50Z

enolan
Aug 26, 2021

I've written a little byte-level language model using Jax & Flax and for some reason when training it it pegs a CPU core, even though nearly all the work should be happening on my GPU. And nvidia-smi reports GPU utilization is consistently at 100%. So I'm confused. Meanwhile, with a TPU core on Colab the TPU idle time is around 30% and matrix unit utilization is around 5.5%. Not sure if that's the same issue - my desktop has a very powerful CPU and a pretty decent GPU, while on Colab the CPU is pretty underpowered and the TPU is very powerful, so it'd make sense for the TPU to be starved if there were some inefficiency feeding it.

Here's my training loop:

def train_loop(                                                                                                       
    model, optimizer, opt_state, params, batch_size, rng=None, n_epochs=None
):                         
    def run_train_step(opt_state, params, text_batch, rng): 
        rng, rng2 = jax.random.split(rng)
        loss, grad = jax.value_and_grad(
            lambda p: jax.vmap(
                lambda text: compute_loss(p, model, text=text, rng=rng),
                in_axes=0,                           
                out_axes=0,        
            )(text_batch).mean()                                                                                      
        )(params)
        updates, opt_state = optimizer.update(grad, opt_state) 
        params = optax.apply_updates(params, updates)
        return params, opt_state, loss, rng2

    fast_train_step = jax.jit(run_train_step, donate_argnums=[0, 1])
    # warm with dummy iter
    print("JITting...", end="", flush=True)
    params, opt_state, loss, rng = fast_train_step(
        opt_state,
        params,
        jnp.zeros([batch_size, SEQ_LEN], dtype=jnp.uint8),
        rng,
    )
    print(" done.")

    ctr = 0
    with tqdm(list(Enwik9Loader(batch_size, SEQ_LEN)), leave=True) as pbar:
        for idx, batch in enumerate(pbar):
            batch = jnp.array(batch)
            params, opt_state, loss, rng = fast_train_step( 
                opt_state, params, batch, rng
            )
            ctr += 1
            if idx == 500:
                break
    return params, opt_state

Enwik9Loader is an iterable of NumPy arrays, all of which are views into one master array. I've put the full code up here if you'd like to take a look. Revision 6802b02 corresponds to the snippet above.

line_profiler shows that 89.6% of the time is spent on the line that calls fast_train_step in the loop, and 7.4% is spent when the function is first called to JIT it. The remainder is setting up the iterable and various tiny things.

So what's going on? Is my code not dispatching fast enough to saturate the GPU, and if so why does nvidia-smi say the GPU is saturated? Is there some spinlock or something that makes the host CPU utilization a meaningless metric?

enolan · 2021-08-27T05:29:05Z

enolan
Aug 27, 2021
Author

Apparently it's actually normal for the CPU to be pegged waiting for the GPU. It's a busy loop polling for completion. See: https://forums.developer.nvidia.com/t/cpu-usage-while-waiting-for-kernel/11272/2.

Still no idea what's going on on Colab TPU, but I suppose that's a separate question now.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why does my simple training loop peg a CPU core when training on GPU? #7722

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why does my simple training loop peg a CPU core when training on GPU? #7722

Uh oh!

enolan Aug 26, 2021

Replies: 1 comment

Uh oh!

enolan Aug 27, 2021 Author

enolan
Aug 26, 2021

enolan
Aug 27, 2021
Author