Hi all,
I recently take some efforts on training paddleOCR-VL-0.9B , and I found a wired thing ,the gradiants passed from a transform layer to the previous one will lost part of gradiants (Maybe due to flash mask tech) with transformers family.
How to resolve it ? Do you have any idea ?