In your code of the model you have
self.add_module('l' + str(hg_module), nn.Conv2d(256, num_landmarks+1, kernel_size=1, stride=1, padding=0))
and in evaler.py the last channel isn't used
pred_heatmap = outputs[-1][:, :-1, :, :][i].detach().cpu()
So I guess it's used somewhere in the loss?