The model you provide in modelscope and huggingface only include the LLM model. The cross attention part and visual part is missing. Ergo, based on the ckpt, we cannot re-implement your exps. Hope you can make it complete. Plus, if it is what it expected to be on the png image, it should be a greate work.