The encoder of SAM may be better than CLIP?

Could we replace the CLIP encoder with the SAM encoder in CAEv2? Do you think this could result in better outcomes? Or is the SAM encoder better in certain domains, such as high-resolution scenarios? I would like to know your opinion.

Thanks in advance.