Skip to content

Caption Generation question #7

@lyx888-lyx

Description

@lyx888-lyx

Hi, thanks for sharing your work. I was trying to implement the captioning using the quick_inference script, but I keep getting the following question:
I used two different audio sources, but the resulting outputs were highly similar. What could be the reason for this?
python quick_inference.py --base_model /data/workspace/code/FusionAudio/models/Llama-2-7b-chat-hf-qformer --model_path /data/workspace/code/FusionAudio/checkpoint/FusionAudio-high-25K/checkpoint/pytorch_model.bin --audio /data/workspace/code/SECap/dataset/wav/tx_emotion_00201000107.wav --question "Please des
cribe this audio in detail."
This is my inference command:
python quick_inference.py --base_model /FusionAudio/models/Llama-2-7b-chat-hf-qformer --model_path /FusionAudio/checkpoint/FusionAudio-high-25K/checkpoint/pytorch_model.bin --audio /data/workspace/code/SECap/dataset/wav/tx_emotion_00201000107.wav --question "Please des
cribe this audio in detail."

Question: Please describe this audio in detail.
Audio: /data/workspace/code/SECap/dataset/wav/tx_emotion_00201000015.wav
Response: The audio is a recording of a person speaking with a strong, clear voice. The speaker has a neutral accent and delivers the message in a confident and professional tone. The language used is formal and technical, indicating that it may be a lecture or presentation on a specific topic. The speaker discusses various aspects of a particular subject, using concrete examples to illustrate their points. They use technical terms and jargon related to the field being discussed, which suggests that the audience is likely composed of experts or those familiar with the terminology. Throughout the audio, the speaker maintains a consistent pace and


Question: Please describe this audio in detail.
Audio: /data/workspace/code/SECap/dataset/wav/tx_emotion_00201000107.wav
Response: The audio is a recording of a person speaking with a strong, clear voice. The speaker has a neutral accent and delivers the message in a confident and professional tone. The language used is formal and technical, indicating that it may be a lecture or presentation on a specific topic. The speaker discusses various aspects of a particular subject, using concrete examples to illustrate their points. They use technical terms and jargon related to the field being discussed, which suggests that the audience is likely composed of experts or those familiar with the terminology. Throughout the audio, the speaker maintains a consistent pace and

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions