libobs/audio-monitoring: Add audio sync feature for audio-only playback #12974

marko1616 · 2026-01-01T07:37:25Z

This commit resolves audio lag that occurs when using audio monitor for extended periods on sources without video.

Description

Fix #12973

Motivation and Context

As issue

How Has This Been Tested?

AFK for a long time and test whether it will have lag.

Types of changes

Bug fix (non-breaking change which fixes an issue)

Checklist:

My code has been run through clang-format.
I have read the contributing document.
My code is not on the master branch.
The code has been tested.
All commit messages are properly formatted and commits squashed where appropriate.
I have included updates to all appropriate documentation.

PatTheMav

At first glance I don't see enough meaningful differences in the effective code to justify the existence of two separate functions for this, vs. enhancing the current function to check for an associated video object directly (and then use a different set of calculations for the timestamp difference).

Also what is the root cause for the delay without an associated video source?

marko1616 · 2026-01-14T20:30:55Z

At first glance I don't see enough meaningful differences in the effective code to justify the existence of two separate functions for this, vs. enhancing the current function to check for an associated video object directly (and then use a different set of calculations for the timestamp difference).

Also what is the root cause for the delay without an associated video source?

Many factors could be causing the delay, but I suspect system lag under heavy workloads and audio driver latency are the primary reasons. Since this issue is difficult to trigger and track, I’ll need more time to investigate it further.

PatTheMav · 2026-01-16T14:06:49Z

At first glance I don't see enough meaningful differences in the effective code to justify the existence of two separate functions for this, vs. enhancing the current function to check for an associated video object directly (and then use a different set of calculations for the timestamp difference).
Also what is the root cause for the delay without an associated video source?

Many factors could be causing the delay, but I suspect system lag under heavy workloads and audio driver latency are the primary reasons. Since this issue is difficult to trigger and track, I’ll need more time to investigate it further.

Much appreciated. Apart from the code duplication I mentioned I can see how the added code would do the right thing, but I also try to understand why the original authors didn't think it was worth adding re-synchronisation to audio sources untethered to a video stream.

So maybe they simply didn't consider the situation to be probable (or possible) so it'd be a simple oversight, or the situation is indeed "not supposed to happen" and the fact that it does is indicative of a larger issue.

marko1616 · 2026-01-18T04:42:47Z

At first glance I don't see enough meaningful differences in the effective code to justify the existence of two separate functions for this, vs. enhancing the current function to check for an associated video object directly (and then use a different set of calculations for the timestamp difference).
Also what is the root cause for the delay without an associated video source?

Many factors could be causing the delay, but I suspect system lag under heavy workloads and audio driver latency are the primary reasons. Since this issue is difficult to trigger and track, I’ll need more time to investigate it further.

Much appreciated. Apart from the code duplication I mentioned I can see how the added code would do the right thing, but I also try to understand why the original authors didn't think it was worth adding re-synchronisation to audio sources untethered to a video stream.

So maybe they simply didn't consider the situation to be probable (or possible) so it'd be a simple oversight, or the situation is indeed "not supposed to happen" and the fact that it does is indicative of a larger issue.

The current synchronization logic is fundamentally limited by the condition monitor->delay_buffer.size > 0. This means we only attempt to "drop" audio if there is a backlog in our local buffer.

In scenarios like MSFS high workload, the audio is pushed into the WASAPI endpoint buffer immediately. Because our delay_buffer remains empty (size: 0), the dropping branch is never entered, even when the WASAPI pad is full and audio is clearly behind.

Manual Sync

When adjusting sync manually, "waiting" logs appear first, which allows the delay_buffer to build up. Once there is a backlog, the dropping mechanism can actually trigger:

12:25:49.031: [AudioMon] delay_buf: 0 bytes (max: 0), WASAPI pad: 0 (max: 0), rate: 48000
12:30:12.742: audio ahead of realtime, waiting, diff: 89962100, delay buffer size: 3852, cur_time: 3835392924000, front_ts: 3835482886100
...
12:32:28.443: audio behind realtime, dropping, diff: -1007200, delay buffer size: 34668, cur_time: 3971093906300, front_ts: 3971092899100
12:35:49.032: [AudioMon] delay_buf: 34668 bytes (max: 34668), WASAPI pad: 0 (max: 0), rate: 48000

High Workload

In MSFS, the frames are sent to WASAPI as soon as they arrive. The delay_buffer stays at 0, so the sync code sees "nothing to drop," while the hardware buffer accumulates latency:

20:22:12.689: [AudioMon] delay_buf: 0 bytes (max: 0), WASAPI pad: 0 (max: 0), rate: 48000
// ...
03:12:12.720: [AudioMon] delay_buf: 0 bytes (max: 0), WASAPI pad: 2400 (max: 2400), rate: 48000

Solution？

The current dropping mechanism cannot reach frames once they are in the WASAPI endpoint buffer. If the delay_buffer is empty, we have no way to "skip" ahead to regain sync.

To fix this, we need a mechanism that handles synchronization when the bottleneck is the endpoint buffer itself:

Dynamic Resampling: Adjust the playback rate via IAudioClock::SetRate (if supported) or a software resampler to "catch up."
Endpoint Reset: Brutally call Stop/Reset/Start on the IAudioClient to flush the hardware buffer when the WASAPI pad exceeds a threshold without a local backlog.

Debugging Code

I am using the following to track the relationship between our local buffer and the WASAPI padding:

static void debug_buffer_stats(struct audio_monitor *monitor, UINT32 wasapi_pad)
{
    static uint64_t last_log = 0;
    static uint32_t max_pad = 0;
    static size_t max_delay_buf = 0;

    uint64_t now = os_gettime_ns();

    if (wasapi_pad > max_pad) max_pad = wasapi_pad;
    if (monitor->delay_buffer.size > max_delay_buf)
        max_delay_buf = monitor->delay_buffer.size;

    if (now - last_log >= 600000000000ULL) {
        blog(LOG_INFO,
            "[AudioMon] delay_buf: %zu bytes (max: %zu), "
            "WASAPI pad: %u (max: %u), rate: %u",
            monitor->delay_buffer.size, max_delay_buf,
            wasapi_pad, max_pad,
            monitor->sample_rate);

        last_log = now;
        max_pad = 0;
        max_delay_buf = 0;
    }
}

Since these solutions involve resetting the audio stream or changing timing, they might be considered breaking changes. How should we proceed with implementing a "pad-aware" sync? I'm looking forward to your opinion and thx. I think only through those two syncing can control the lagging to acceptable level. For now I make the maximum delay to 10ms(delay_buffer) + 10ms(WASAPI) and testing for more extreme case.

PatTheMav · 2026-01-19T12:18:57Z

@norihiro Would like to hear your opinion on this - you've looked into looking concepts from signal theory to fix similar stuff in the past, is there something we can do to capture this?

Given that there is no way to handle this scenario, we got a "blank canvas" to come up with a solution so to speak, so if there is a smarter way to solve this (rather than the somewhat brutish way OBS handles this in other places), it'd be a great opportunity to do so.

PatTheMav · 2026-01-19T14:39:18Z

In scenarios like MSFS high workload, the audio is pushed into the WASAPI endpoint buffer immediately. Because our delay_buffer remains empty (size: 0), the dropping branch is never entered, even when the WASAPI pad is full and audio is clearly behind.

So the root cause of the problem seems to be that potentially the output device lags behind and is not able to play back audio at the rate that OBS creates new audio samples and thus the "padding" value in the endpoint buffer increases over time?

And if the monitored source has no video component, OBS simply ignores this padding value entirely and pushes more and more data into the endpoint buffer. By your description it seems that we got lucky so far that the endpoint buffer always had enough capacity to accept additional samples despite the growing padding (otherwise OBS would have effectively "reset" the output client).

When a video source is monitored (and audio is not detached), OBS would indeed calculate a theoretical absolute nanosecond timestamp for the "oldest" remaining audio in the endpoint buffer (effectively amount of buffers multiplied with the duration of a sample at the current sampling rate) and either "skip" rendering either because the oldest remaining sample is more than 75ms ahead of the last video frame timestamp, or because that same sample is more than 75ms behind the last video frame timestamp (and buffered audio data is waiting for playback).

It's only when the remaining data in the endpoint buffer is within 75ms of the last video frame timestamp that OBS would simply push the audio data into it, but audio data is always buffered internally.

At least from my naive POV I agree that we should always consider the padding in the endpoint buffer (not only when we have a video frame timestamp associated with the audio data) and thus detect if we produce audio quicker than the audio device seems to be able to consume it.

Without an associated video source, its own frame times are obvious points with which audio data needs to be synchronised, but without those, the audio data itself becomes the sync points and the padding seems to be the best indication for how much "out of sync" the output becomes.

marko1616 · 2026-01-19T14:52:43Z

In scenarios like MSFS high workload, the audio is pushed into the WASAPI endpoint buffer immediately. Because our delay_buffer remains empty (size: 0), the dropping branch is never entered, even when the WASAPI pad is full and audio is clearly behind.

So the root cause of the problem seems to be that potentially the output device lags behind and is not able to play back audio at the rate that OBS creates new audio samples and thus the "padding" value in the endpoint buffer increases over time?

And if the monitored source has no video component, OBS simply ignores this padding value entirely and pushes more and more data into the endpoint buffer. By your description it seems that we got lucky so far that the endpoint buffer always had enough capacity to accept additional samples despite the growing padding (otherwise OBS would have effectively "reset" the output client).

When a video source is monitored (and audio is not detached), OBS would indeed calculate a theoretical absolute nanosecond timestamp for the "oldest" remaining audio in the endpoint buffer (effectively amount of buffers multiplied with the duration of a sample at the current sampling rate) and either "skip" rendering either because the oldest remaining sample is more than 75ms ahead of the last video frame timestamp, or because that same sample is more than 75ms behind the last video frame timestamp (and buffered audio data is waiting for playback).

It's only when the remaining data in the endpoint buffer is within 75ms of the last video frame timestamp that OBS would simply push the audio data into it, but audio data is always buffered internally.

At least from my naive POV I agree that we should always consider the padding in the endpoint buffer (not only when we have a video frame timestamp associated with the audio data) and thus detect if we produce audio quicker than the audio device seems to be able to consume it.

Without an associated video source, its own frame times are obvious points with which audio data needs to be synchronised, but without those, the audio data itself becomes the sync points and the padding seems to be the best indication for how much "out of sync" the output becomes.

Looking at the current code, the maximum endpoint buffer length is set to 10,000,000 × 100ns = 1000ms = 1s (for reference on why 100ns is used as the base unit, see the Microsoft documentation). This is an exceptionally long delay—so much so that I suspect it might be a mistake. In scenarios where I need real-time audio monitoring (such as in-ear monitoring), this accumulated latency becomes quite uncomfortable to work with. Moreover, since we cannot directly control the endpoint buffer, this also means the video sync-related delay could take up to 1 second before a forced reset is triggered.

Based on my perceptual testing, I believe a maximum internal delay of 30ms (20ms for the endpoint buffer + 10ms for OBS's internal buffer) would be a much better choice. Additionally, my tests suggest that simply resetting the buffer when the WASAPI endpoint buffer is relatively small doesn't cause any noticeable perceptual impact, and I've made the syncing also working for audio only source if anyone need it(I think it is more appropriate behaviour)

BTW I'am greatly appreciate it for anyone review this PR and give their opinion —including yourself. And I think this PR is good to go now.

This commit resolves audio lag that occurs when using audio monitor for extended periods on sources without video.

marko1616 · 2026-01-19T15:22:37Z

Considering if there will be some case video source could't feed audio frame in 10ms I just add WASIAPI buffer to 50ms for some trade off. I'll continue test for related feature.Like investigate on video source feed latency, and exam if it is necessary to decoupled video and audio only source(They have really different latency) or makeing WASIAPI buffer size dynamic.
For now I just want it keeps simple just focus on that 1s huge lag.

norihiro · 2026-01-20T10:37:35Z

I assume WASAPI and OBS have different clocks, hence, in my opinion, we need to resample the audio.
I cannot find the document for IAudioClock::SetRate but FFmpeg provides a software resampler like I implemented the #6351, which addresses the clock drift between the audio source and OBS.

For Decklink output, we had an experiment that dynamically adjust decklink's clock based on the offset between OBS and the hardware.
(I'm sorry that I didn't have enough time for recent months to address the issue for Decklink.)
Like this, I think we can dynamically adjust the resampler to keep the WASAPI's buffer constant; get the buffer size, average it for recent seconds, get the difference between the averaged buffer size vs a target buffer size, and adjust the resampler.

Regarding the difficulty to trigger and track, I want to recommend to plot a buffer size and some other metrics on a graph.

marko1616 · 2026-01-20T11:10:05Z

I assume WASAPI and OBS have different clocks, hence, in my opinion, we need to resample the audio. I cannot find the document for IAudioClock::SetRate but FFmpeg provides a software resampler like I implemented the #6351, which addresses the clock drift between the audio source and OBS.

For Decklink output, we had an experiment that dynamically adjust decklink's clock based on the offset between OBS and the hardware. (I'm sorry that I didn't have enough time for recent months to address the issue for Decklink.) Like this, I think we can dynamically adjust the resampler to keep the WASAPI's buffer constant; get the buffer size, average it for recent seconds, get the difference between the averaged buffer size vs a target buffer size, and adjust the resampler.

Regarding the difficulty to trigger and track, I want to recommend to plot a buffer size and some other metrics on a graph.

Yep, I just added the debug plots and outputs as we discussed earlier. In my opinion, your ideas are quite good (I hadn't noticed anyone else attempting these fixes before). I actually considered implementing a similar solution earlier, but decided against it due to the complexity. For now, I'm using a simpler approach. I will try to test the performance and approach of your PR on this particular scene. I'm glad to assist you if you need to adapt those for audio monitoring.

PatTheMav · 2026-01-20T12:19:53Z

I assume WASAPI and OBS have different clocks, hence, in my opinion, we need to resample the audio. I cannot find the document for IAudioClock::SetRate but FFmpeg provides a software resampler like I implemented the #6351, which addresses the clock drift between the audio source and OBS.

That PR came to my mind as well, which is why I pinged you. Without having looked at it in too much detail, is there a possibility for OBS to do this work at a higher level than the platform-specific monitoring implementations?

It would be great if we could implement this just once in OBS rather than 3 times (for Pulse, WASAPI and CoreAudio).

pkviet

Thanks for your PR. My first reaction to this PR was the same as @norihiro that the better fix would be to resample instead of just dropping samples in order to preserve the audio integrity. Also it feels kind of risky to allow just a 10 ms lag so about 512 samples at 48 kHz. I would really be hesitant to merge the PR without extensive tests across a variety of monitoring devices to ensure audio quality is still fine. The monitoring sub-system is sometimes abused to provide an extra output instead of just monitoring so I'd be extra careful. IMO, I'd really prefer a dynamic resampling as done by FFmpeg aresample filter for instance, which can do a dynamic compensation: https://ffmpeg.org/ffmpeg-filters.html#Examples-16

pkviet · 2026-01-20T15:25:44Z