7 lines (4 loc) · 292 Bytes

Multimodal-Vision-Transformer

Coding a vision language model from scratch

Batch Normalization

Drastic Input Batch Changes -> Input changes -> loss changes -> gradient changes -> weight updating changes -> Big change in the weights of the network -> Network will learn very slowly