How do you choose the pretraining dataset?

As mentioned in your appendix, your pretraining datasets (Image-Text) include MIMIC-CXR, PMC-OA, Quilt, LLaVA-Med, LLaVA-Med, OpenI, MM-Retinal. Some of them are domain-specific, while some of them are general medical data, such as PMC-OA and LLaVA-Med. How do you know that PMC-OA does not have overlapped corpus with those domain specific data?