Caption Generation question

Hi, thanks for sharing your work. I was trying to implement the captioning using the quick_inference script, but I keep getting the following question:
I used two different audio sources, but the resulting outputs were highly similar. What could be the reason for this?
python quick_inference.py     --base_model /data/workspace/code/FusionAudio/models/Llama-2-7b-chat-hf-qformer     --model_path /data/workspace/code/FusionAudio/checkpoint/FusionAudio-high-25K/checkpoint/pytorch_model.bin     --audio /data/workspace/code/SECap/dataset/wav/tx_emotion_00201000107.wav     --question "Please des
cribe this audio in detail."
This is my inference command:
python quick_inference.py     --base_model /FusionAudio/models/Llama-2-7b-chat-hf-qformer     --model_path /FusionAudio/checkpoint/FusionAudio-high-25K/checkpoint/pytorch_model.bin     --audio /data/workspace/code/SECap/dataset/wav/tx_emotion_00201000107.wav     --question "Please des
cribe this audio in detail."


Question: Please describe this audio in detail.
Audio: /data/workspace/code/SECap/dataset/wav/tx_emotion_00201000015.wav
Response: The audio is a recording of a person speaking with a strong, clear voice. The speaker has a neutral accent and delivers the message in a confident and professional tone. The language used is formal and technical, indicating that it may be a lecture or presentation on a specific topic. The speaker discusses various aspects of a particular subject, using concrete examples to illustrate their points. They use technical terms and jargon related to the field being discussed, which suggests that the audience is likely composed of experts or those familiar with the terminology. Throughout the audio, the speaker maintains a consistent pace and
﻿
﻿
Question: Please describe this audio in detail.
Audio: /data/workspace/code/SECap/dataset/wav/tx_emotion_00201000107.wav
Response: The audio is a recording of a person speaking with a strong, clear voice. The speaker has a neutral accent and delivers the message in a confident and professional tone. The language used is formal and technical, indicating that it may be a lecture or presentation on a specific topic. The speaker discusses various aspects of a particular subject, using concrete examples to illustrate their points. They use technical terms and jargon related to the field being discussed, which suggests that the audience is likely composed of experts or those familiar with the terminology. Throughout the audio, the speaker maintains a consistent pace and

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caption Generation question #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Caption Generation question #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions