Hi, thank you for your great work and sharing! I have a question regarding the training process of LLaMA 3.1-8B using MoBA.
Did you employ data-level MoBA/Full Hybrid Training, specifically using 90% of the data with MoBA and 10% with full attention?
I’d appreciate any clarification on this. Thanks again for your contributions!