-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
Dear authors,
Thank you for your work and the release of the d-cube dataset.
I was trying to run a pre-trained OWL-ViT model (e.g. "google/owlvit-base-patch32") on the dataset, and found the following sentences to yield a RuntimeError.
ID: 140, TEXT: "a person who wears a hat and holds a tennis racket on the tennis court",
ID: 146, TEXT: "the player who is ready to bat with both feet leaving the ground in the room",
ID: 253, TEXT: "a person who plays music with musical instrument surrounded by spectators on the street",
ID: 342, TEXT: "a fisher who stands on the shore and whose lower body is not submerged by water",
ID: 348, TEXT: "a person who stands on the stage for speech but don't open their mouths",
ID: 355, TEXT: "a person with a pen in one hand but not looking at the paper",
ID: 356, TEXT: "a billiard ball with no numbers or patterns on its surface on the table",
ID: 364, TEXT: "a person standing at the table of table tennis who is not waving table tennis rackets",
ID: 404, TEXT: "a water polo player who is in the water but does not hold the ball",
ID: 405, TEXT: "a barbell held by a weightlifter that has not been lifted above the head",
ID: 412, TEXT: "a person who wears a helmet and sling equipment but is not on the sling",
ID: 419, TEXT: "person who kneels on one knee and proposes but has nothing in his hand"
A typical error message is shown at the bottom. It seems that the pre-trained model uses max_position_embeddings = 16 in OwlViTTextConfig which is not long enough to accept the descriptions above as inputs. All the models available on Huggingface seem to use max_position_embeddings = 16. Did you encounter the same issue when running your experiments for the paper? If so, how did you handle it in the evaluation process?
Thanks in advance.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[129], line 1
----> 1 results = get_prediction(processor, model, image, [text_list[0]])
Cell In[11], line 13, in get_prediction(processor, model, image, captions, cpu_only)
9 with torch.no_grad():
10 inputs = processor(text=[captions], images=image, return_tensors="pt").to(
11 device
12 )
---> 13 outputs = model(**inputs)
14 target_size = torch.Tensor([image.size[::-1]]).to(device)
15 results = processor.post_process_object_detection(
16 outputs=outputs, target_sizes=target_size, threshold=0.05
17 )
File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:1640, in OwlViTForObjectDetection.forward(self, input_ids, pixel_values, attention_mask, output_attentions, output_hidden_states, return_dict)
1637 return_dict = return_dict if return_dict is not None else self.config.return_dict
1639 # Embed images and text queries
-> 1640 query_embeds, feature_map, outputs = self.image_text_embedder(
1641 input_ids=input_ids,
1642 pixel_values=pixel_values,
1643 attention_mask=attention_mask,
1644 output_attentions=output_attentions,
1645 output_hidden_states=output_hidden_states,
1646 )
1648 # Text and vision model outputs
1649 text_outputs = outputs.text_model_output
File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:1385, in OwlViTForObjectDetection.image_text_embedder(self, input_ids, pixel_values, attention_mask, output_attentions, output_hidden_states)
1376 def image_text_embedder(
1377 self,
1378 input_ids: torch.Tensor,
(...)
1383 ) -> Tuple[torch.FloatTensor]:
1384 # Encode text and image
-> 1385 outputs = self.owlvit(
1386 pixel_values=pixel_values,
1387 input_ids=input_ids,
1388 attention_mask=attention_mask,
1389 output_attentions=output_attentions,
1390 output_hidden_states=output_hidden_states,
1391 return_dict=True,
1392 )
1394 # Get image embeddings
1395 last_hidden_state = outputs.vision_model_output[0]
File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:1163, in OwlViTModel.forward(self, input_ids, pixel_values, attention_mask, return_loss, output_attentions, output_hidden_states, return_base_image_embeds, return_dict)
1155 vision_outputs = self.vision_model(
1156 pixel_values=pixel_values,
1157 output_attentions=output_attentions,
1158 output_hidden_states=output_hidden_states,
1159 return_dict=return_dict,
1160 )
1162 # Get embeddings for all text queries in all batch samples
-> 1163 text_outputs = self.text_model(
1164 input_ids=input_ids,
1165 attention_mask=attention_mask,
1166 output_attentions=output_attentions,
1167 output_hidden_states=output_hidden_states,
1168 return_dict=return_dict,
1169 )
1171 text_embeds = text_outputs[1]
1172 text_embeds = self.text_projection(text_embeds)
File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:798, in OwlViTTextTransformer.forward(self, input_ids, attention_mask, position_ids, output_attentions, output_hidden_states, return_dict)
796 input_shape = input_ids.size()
797 input_ids = input_ids.view(-1, input_shape[-1])
--> 798 hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
800 # num_samples, seq_len = input_shape where num_samples = batch_size * num_max_text_queries
801 # OWLVIT's text model uses causal mask, prepare it here.
802 # https://github.com/openai/CLIP/blob/cfcffb90e69f37bf2ff1e988237a0fbe41f33c04/clip/model.py#L324
803 causal_attention_mask = _create_4d_causal_attention_mask(
804 input_shape, hidden_states.dtype, device=hidden_states.device
805 )
File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:332, in OwlViTTextEmbeddings.forward(self, input_ids, position_ids, inputs_embeds)
329 inputs_embeds = self.token_embedding(input_ids)
331 position_embeddings = self.position_embedding(position_ids)
--> 332 embeddings = inputs_embeds + position_embeddings
334 return embeddings
RuntimeError: The size of tensor a (18) must match the size of tensor b (16) at non-singleton dimension 1
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels