-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Hello,
I've been trying to recognize TV shows as well as ads ingested using DejaVu in real time using an HLS playlist. The shows last from a few minutes to hours and the ads generally last for a few dozen seconds.
The main problem lies in the fact that when doing the recognition on a TS segment that should match an audio file ingested by DejaVu, the input_confidence attribute, depending on the length of the segment, is really low, or not close enough to 1.
When using 60-second TS segments, the input confidence value tends towards 0. Often, the value is <= 0.1 using the default settings and can grow to <= 0.2 using these settings.
Using 6-second segments, the value is closer to 1, around 0.5 to 0.9 most of the time. However, the second result returned by the program is often closer to 1, which will be a wrong audio.
The files ingested are WMV files, and the audio specs are the following:
- 3 audio tracks
- Codec WMA 9.2
- Constant bit rate mode at 96kbps
- 2 channels
- 48 kHz sample rate
What I did is transform these WMV files into ts files using ffmpeg to match the ts segments characteristics, which are the following:
- Single audio track
- Codec AAC LC Version 4
- Muxing Mode: ADTS
- 2 channels
- 48 kHz sample rate
- Lossy compression mode
Also, something weird I noticed is that when taking a part of a TS file that I transformed from a WMV file which is ingested by DejaVu, the input_confidence will most of the time be 1 or close to 1. But when taking the same part of audio from a ts segment of my HLS playlist, the result will not be good, close to 0 for 60-second segments or close to 1 but not enough using 6-second segments. How can one explain that?
How can you get more relevant results?