Skip to content

Commit 0a13cf7

Browse files
authored
Update README.md
1 parent 48acd7a commit 0a13cf7

File tree

1 file changed

+34
-5
lines changed

1 file changed

+34
-5
lines changed

README.md

Lines changed: 34 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1-
#### MaxFactor is best described as a thoughtful integration of existing optimization techniques with specific implementation choices tailored for encoder-decoder ASR transformer models. It combines proven optimization techniques from several established algorithms, with implementation details specifically tuned for transformer architectures used in speech recognition. The optimizer makes practical engineering tradeoffs that work well empirically for speech recognition models. Its particular combination of approaches addresses practical challenges in training large speech and multimodal llms.
1+
MaxFactor is best described as a thoughtful integration of existing optimization techniques with specific implementation choices tailored for encoder-decoder ASR transformer models. It combines proven optimization techniques from several established algorithms, with implementation details specifically tuned for transformer architectures used in speech recognition.
2+
3+
The optimizer makes practical engineering tradeoffs that work well empirically for speech recognition models. Its particular combination of approaches addresses practical challenges in training large speech and multimodal llms.
24

35

46
#### MaxFactor Family Tree
@@ -34,11 +36,38 @@ Gradient Clipping
3436
MaxFactor
3537
└── Combines all above features with a couple unique twists. (and FAM)
3638
```
37-
Coming soon -
3839

39-
## Frequency-Adaptive Momentum (FAM)
40+
#### Memory Usage (relative to AdamW)
41+
MaxFactor uses **25.1% less memory** than AdamW while maintaining comparable memory efficiency to SGD (difference <0.1%).
42+
43+
### Key Advantages
44+
45+
- **Superior accuracy** on simple tasks (MNIST)
46+
- **Competitive accuracy** on complex tasks, significantly outperforming Adam/AdamW
47+
- **Faster convergence** than SGD on some datasets
48+
- **Memory efficiency** matching SGD, using ~25% less memory than Adam/AdamW
49+
- **Stable optimization** across different model architectures and datasets
50+
51+
### When to Use MaxFactor
52+
53+
MaxFactor is particularly valuable for:
54+
- Memory-constrained environments
55+
- Complex datasets where Adam/AdamW underperform
56+
- Speech recognition and other audio processing tasks
57+
- Scenarios requiring a balance of accuracy and efficiency
58+
59+
60+
61+
62+
63+
64+
65+
66+
Coming soon -
67+
Additionally, it will introduce Frequency-Adaptive Momentum (FAM), an experimental approach specifically designed for speech recognition tasks that adapts momentum based on the frequency characteristics of gradient updates.
68+
### Frequency-Adaptive Momentum (FAM)
4069

41-
### Core Concept
70+
#### Core Concept
4271

4372
- Speech signals have inherent frequency structure, with different parts of the model responding to different frequency bands. The frequency structure of speech doesn't just disappear when converted to log-mel spectrograms; it's transformed and preserved in ways that the model's parameters adapt to capture.
4473
- The Chain of Frequency Information: Original Audio → Log-Mel Spectrogram → Encoder Parameters → Gradient Updates.
@@ -48,7 +77,7 @@ Coming soon -
4877
- The model inherently develops a hierarchical representation from acoustic features to phonetic units to words.
4978
- The idea is to try and integrate a momentum scheme that adapts based on the "frequency signature" of gradient updates.
5079

51-
### Why This Optimizer Makes Sense
80+
#### Why This Optimizer Makes Sense
5281

5382
What's compelling about the Frequency-Adaptive Momentum approach is that it acknowledges this structure in the optimization process itself. Rather than treating all gradient dimensions equally, it recognizes that:
5483
- **Gradient Frequencies Matter:** The Fourier transform of gradient updates reveals patterns related to what the model is currently learning.

0 commit comments

Comments
 (0)