Problem Description
1 hadamard is performed at float64 in prior arts while ours are on bflaot16
2 weight transform could be conducted inplace, no need to run the transform in each iter of AR tuning
3 shared layers like moe, qkv
4 real random
5 fuse to ar block wise tuning,otherwise ram is high
Reproduction Steps
~
Environment Information
~
Error Logs
Additional Context
No response