After reading this paper, I found that the biggest optimization is to predict core neuros that effects output most. So... if the network is small originally, (say.. only 1000 parameters), Does the optimization technique also benefit it with a substantial improvment?