Is your feature request related to a problem? Please describe.
RHT+amax kernel prevents fusion.
Describe the solution you'd like
Estimate post rht amax from pre rht amax with a linear function. This eliminates rht+amax kernel. Make this feature optional. Make hyperparameters (amax estimation scale) to be tunable.
Validation
Validate lm loss with dense/moe models. Ensure the convergence is the same or better.