Have you faced the bp incorrect in training

Hi all,
I recently take some efforts on training paddleOCR-VL-0.9B , and I found a wired thing ,the  gradiants passed from a transform layer to the previous one will lost part of gradiants (Maybe due to flash mask tech) with transformers family.

How to resolve it ? Do you have any idea ?