Including multiple modalities in the learned tasks improves the robustness and generalizability of the learned representations