-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Description
Hi,
I am currently working on applying the VCD mitigation method to the LLaVA model on the POPE benchmark. The questions in the POPE benchmark are all binary (Yes/No), for example:
{"question_id": 1, "image": "COCO_val2014_000000016631.jpg", "text": "Is there a person in the image?", "label": "yes"}
{"question_id": 2, "image": "COCO_val2014_000000016631.jpg", "text": "Is there a refrigerator in the image?", "label": "no"}Is it possible to use VCD to further improve LLaVA's performance on these types of binary questions?
If so, could you provide guidance on how to implement it?
Below is the current implementation I am using for LLaVA to generate answers from an image:
model_id = "llava-hf/llava-1.5-7b-hf"
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
).to(0)
processor = AutoProcessor.from_pretrained(model_id)
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": question_text},
{"type": "image"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs, max_new_tokens=20, do_sample=False)
answer_text = processor.decode(output[0][2:], skip_special_tokens=True)Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels