During training, the LlamaBiModel class was used, which modifies the _update_causal_mask function from LlamaModel.
However, I noticed that the public model on HuggingFace uses the LlamaEncoderModel class from modeling_llama_encoder.py when loading for inference. This class modifies the forward function from LlamaModel.
Why is there a difference between training and inference? Are they functionally equivalent? I'm a bit confused.