Could we replace the CLIP encoder with the SAM encoder in CAEv2? Do you think this could result in better outcomes? Or is the SAM encoder better in certain domains, such as high-resolution scenarios? I would like to know your opinion.
Thanks in advance.
Could we replace the CLIP encoder with the SAM encoder in CAEv2? Do you think this could result in better outcomes? Or is the SAM encoder better in certain domains, such as high-resolution scenarios? I would like to know your opinion.
Thanks in advance.