-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
In attention.py demo, get_attention_by_gradcam method's inputs have image_input and text_input, I want to know why choosing text_input to deal. The demo is showed below.
def get_attention_by_gradcam(self, model, tokenizer, image_path, image_input, text_input, attr_name, target_layer):
encoder_name = getattr(model, attr_name, None)
encoder_name.encoder.layer[target_layer].crossattention.self.save_attention = True
output = model(image_input, text_input)
loss = output[:, 1].sum()
model.zero_grad()
loss.backward()
image_size = 256
temp = int(np.sqrt(image_size))
# the effect of mask is let those padding tokens multiply with 0 so that they won't be calculated in cams and
# grads , because of the text preprocess of ALBEF and TCL, mask is unuseful here
mask = **text_input**.attention_mask.view(text_input.attention_mask.size(0), 1, -1, 1, 1)
grads = **encoder_name**.encoder.layer[target_layer].crossattention.self.get_attn_gradients()
cams = encoder_name.encoder.layer[target_layer].crossattention.self.get_attention_map()
Another same question is in 'albef' attention, demo shows atter_name is 'text_encoder', The demo is showed below.
def getAttMap(self, image_path, text):
if self.model_name.lower() == 'albef':
engine = ALBEF('ALBEF_4M.pth')
model, tokenizer = engine.load_model(engine.model_id)
image_input = engine.load_data(src_type='local', data=[image_path])[0]
text_input = tokenizer(engine.pre_caption(text), return_tensors="pt")
self.get_attention_by_gradcam(model, tokenizer, image_path, image_input, text_input,
attr_name='text_encoder', target_layer=8)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels