[OFT] Linear scaling for constant learning #1231
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
An experimental and sketch feature for OFT
Example:
My OFT:
The Issue
If we leave it alone, the 160-block layers will dominate the training, and the 32-block layers will effectively be frozen because their gradients will be tiny in comparison.
How to Align (The Solution)
To make the 32-block layer behave like the 160-block layer, we need to boost the 32-block layer.
We want the "Energy" (Variance) of the small block to match the large block.
The Calculation
By multiplying the weights of the 32-block layer by 2.23, we ensure it rotates the inputs with the same intensity as the 160-block layer.
TODO