-
Notifications
You must be signed in to change notification settings - Fork 5
Caption Generation question #7
Description
Hi, thanks for sharing your work. I was trying to implement the captioning using the quick_inference script, but I keep getting the following question:
I used two different audio sources, but the resulting outputs were highly similar. What could be the reason for this?
python quick_inference.py --base_model /data/workspace/code/FusionAudio/models/Llama-2-7b-chat-hf-qformer --model_path /data/workspace/code/FusionAudio/checkpoint/FusionAudio-high-25K/checkpoint/pytorch_model.bin --audio /data/workspace/code/SECap/dataset/wav/tx_emotion_00201000107.wav --question "Please des
cribe this audio in detail."
This is my inference command:
python quick_inference.py --base_model /FusionAudio/models/Llama-2-7b-chat-hf-qformer --model_path /FusionAudio/checkpoint/FusionAudio-high-25K/checkpoint/pytorch_model.bin --audio /data/workspace/code/SECap/dataset/wav/tx_emotion_00201000107.wav --question "Please des
cribe this audio in detail."
Question: Please describe this audio in detail.
Audio: /data/workspace/code/SECap/dataset/wav/tx_emotion_00201000015.wav
Response: The audio is a recording of a person speaking with a strong, clear voice. The speaker has a neutral accent and delivers the message in a confident and professional tone. The language used is formal and technical, indicating that it may be a lecture or presentation on a specific topic. The speaker discusses various aspects of a particular subject, using concrete examples to illustrate their points. They use technical terms and jargon related to the field being discussed, which suggests that the audience is likely composed of experts or those familiar with the terminology. Throughout the audio, the speaker maintains a consistent pace and
Question: Please describe this audio in detail.
Audio: /data/workspace/code/SECap/dataset/wav/tx_emotion_00201000107.wav
Response: The audio is a recording of a person speaking with a strong, clear voice. The speaker has a neutral accent and delivers the message in a confident and professional tone. The language used is formal and technical, indicating that it may be a lecture or presentation on a specific topic. The speaker discusses various aspects of a particular subject, using concrete examples to illustrate their points. They use technical terms and jargon related to the field being discussed, which suggests that the audience is likely composed of experts or those familiar with the terminology. Throughout the audio, the speaker maintains a consistent pace and