Thank you for releasing the excellent model and work.
The paper appears to state that the 8B model was pre-trained natively with a 32K sequence length. I would like to ask if any additional model parallelism was used. If not used, what additional techniques, apart from FP8 training and InfLLMv2, were applied to reduce memory consumption?
Thank you.