Skip to content

OWL-ViT pre-trained models cannot accept some of the longest descriptions #13

@HarukiNishimura-TRI

Description

@HarukiNishimura-TRI

Dear authors,

Thank you for your work and the release of the d-cube dataset.

I was trying to run a pre-trained OWL-ViT model (e.g. "google/owlvit-base-patch32") on the dataset, and found the following sentences to yield a RuntimeError.

 ID: 140, TEXT: "a person who wears a hat and holds a tennis racket on the tennis court",
 ID: 146, TEXT: "the player who is ready to bat with both feet leaving the ground in the room",
 ID: 253, TEXT: "a person who plays music with musical instrument surrounded by spectators on the street",
 ID: 342, TEXT: "a fisher who stands on the shore and whose lower body is not submerged by water",
 ID: 348, TEXT: "a person who stands on the stage for speech but don't open their mouths",
 ID: 355, TEXT: "a person with a pen in one hand but not looking at the paper",
 ID: 356, TEXT: "a billiard ball with no numbers or patterns on its surface on the table",
 ID: 364, TEXT: "a person standing at the table of table tennis who is not waving table tennis rackets",
 ID: 404, TEXT: "a water polo player who is in the water but does not hold the ball",
 ID: 405, TEXT: "a barbell held by a weightlifter that has not been lifted above the head",
 ID: 412, TEXT: "a person who wears a helmet and sling equipment but is not on the sling",
 ID: 419, TEXT: "person who kneels on one knee and proposes but has nothing in his hand"

A typical error message is shown at the bottom. It seems that the pre-trained model uses max_position_embeddings = 16 in OwlViTTextConfig which is not long enough to accept the descriptions above as inputs. All the models available on Huggingface seem to use max_position_embeddings = 16. Did you encounter the same issue when running your experiments for the paper? If so, how did you handle it in the evaluation process?

Thanks in advance.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[129], line 1
----> 1 results = get_prediction(processor, model, image, [text_list[0]])

Cell In[11], line 13, in get_prediction(processor, model, image, captions, cpu_only)
      9 with torch.no_grad():
     10     inputs = processor(text=[captions], images=image, return_tensors="pt").to(
     11         device
     12     )
---> 13     outputs = model(**inputs)
     14 target_size = torch.Tensor([image.size[::-1]]).to(device)
     15 results = processor.post_process_object_detection(
     16     outputs=outputs, target_sizes=target_size, threshold=0.05
     17 )

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:1640, in OwlViTForObjectDetection.forward(self, input_ids, pixel_values, attention_mask, output_attentions, output_hidden_states, return_dict)
   1637 return_dict = return_dict if return_dict is not None else self.config.return_dict
   1639 # Embed images and text queries
-> 1640 query_embeds, feature_map, outputs = self.image_text_embedder(
   1641     input_ids=input_ids,
   1642     pixel_values=pixel_values,
   1643     attention_mask=attention_mask,
   1644     output_attentions=output_attentions,
   1645     output_hidden_states=output_hidden_states,
   1646 )
   1648 # Text and vision model outputs
   1649 text_outputs = outputs.text_model_output

File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:1385, in OwlViTForObjectDetection.image_text_embedder(self, input_ids, pixel_values, attention_mask, output_attentions, output_hidden_states)
   1376 def image_text_embedder(
   1377     self,
   1378     input_ids: torch.Tensor,
   (...)
   1383 ) -> Tuple[torch.FloatTensor]:
   1384     # Encode text and image
-> 1385     outputs = self.owlvit(
   1386         pixel_values=pixel_values,
   1387         input_ids=input_ids,
   1388         attention_mask=attention_mask,
   1389         output_attentions=output_attentions,
   1390         output_hidden_states=output_hidden_states,
   1391         return_dict=True,
   1392     )
   1394     # Get image embeddings
   1395     last_hidden_state = outputs.vision_model_output[0]

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:1163, in OwlViTModel.forward(self, input_ids, pixel_values, attention_mask, return_loss, output_attentions, output_hidden_states, return_base_image_embeds, return_dict)
   1155 vision_outputs = self.vision_model(
   1156     pixel_values=pixel_values,
   1157     output_attentions=output_attentions,
   1158     output_hidden_states=output_hidden_states,
   1159     return_dict=return_dict,
   1160 )
   1162 # Get embeddings for all text queries in all batch samples
-> 1163 text_outputs = self.text_model(
   1164     input_ids=input_ids,
   1165     attention_mask=attention_mask,
   1166     output_attentions=output_attentions,
   1167     output_hidden_states=output_hidden_states,
   1168     return_dict=return_dict,
   1169 )
   1171 text_embeds = text_outputs[1]
   1172 text_embeds = self.text_projection(text_embeds)

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:798, in OwlViTTextTransformer.forward(self, input_ids, attention_mask, position_ids, output_attentions, output_hidden_states, return_dict)
    796 input_shape = input_ids.size()
    797 input_ids = input_ids.view(-1, input_shape[-1])
--> 798 hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
    800 # num_samples, seq_len = input_shape  where num_samples = batch_size * num_max_text_queries
    801 # OWLVIT's text model uses causal mask, prepare it here.
    802 # https://github.com/openai/CLIP/blob/cfcffb90e69f37bf2ff1e988237a0fbe41f33c04/clip/model.py#L324
    803 causal_attention_mask = _create_4d_causal_attention_mask(
    804     input_shape, hidden_states.dtype, device=hidden_states.device
    805 )

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:332, in OwlViTTextEmbeddings.forward(self, input_ids, position_ids, inputs_embeds)
    329     inputs_embeds = self.token_embedding(input_ids)
    331 position_embeddings = self.position_embedding(position_ids)
--> 332 embeddings = inputs_embeds + position_embeddings
    334 return embeddings

RuntimeError: The size of tensor a (18) must match the size of tensor b (16) at non-singleton dimension 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions