Great work! The demo in https://huggingface.co/spaces/P3ngLiu/FO1_VS_SAM3_DEMO is impressive. I'm wondering whether this demo uses VLM-FO1+SAM3 or VLM-FO1+UPN? The concept detection ability seems to inherit from SAM3 from my perspective, but the performance in the demo is much better than the provided script inference_with_sam3.py.