Skip to content

Conversation

@marko1616
Copy link

@marko1616 marko1616 commented Jan 1, 2026

This commit resolves audio lag that occurs when using audio monitor for extended periods on sources without video.

Description

Fix #12973

Motivation and Context

As issue

How Has This Been Tested?

AFK for a long time and test whether it will have lag.

Types of changes

Bug fix (non-breaking change which fixes an issue)

Checklist:

  • My code has been run through clang-format.
  • I have read the contributing document.
  • My code is not on the master branch.
  • The code has been tested.
  • All commit messages are properly formatted and commits squashed where appropriate.
  • I have included updates to all appropriate documentation.

Copy link
Member

@PatTheMav PatTheMav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance I don't see enough meaningful differences in the effective code to justify the existence of two separate functions for this, vs. enhancing the current function to check for an associated video object directly (and then use a different set of calculations for the timestamp difference).

Also what is the root cause for the delay without an associated video source?

@marko1616
Copy link
Author

At first glance I don't see enough meaningful differences in the effective code to justify the existence of two separate functions for this, vs. enhancing the current function to check for an associated video object directly (and then use a different set of calculations for the timestamp difference).

Also what is the root cause for the delay without an associated video source?

Many factors could be causing the delay, but I suspect system lag under heavy workloads and audio driver latency are the primary reasons. Since this issue is difficult to trigger and track, I’ll need more time to investigate it further.

@marko1616 marko1616 marked this pull request as draft January 14, 2026 20:32
@PatTheMav
Copy link
Member

At first glance I don't see enough meaningful differences in the effective code to justify the existence of two separate functions for this, vs. enhancing the current function to check for an associated video object directly (and then use a different set of calculations for the timestamp difference).
Also what is the root cause for the delay without an associated video source?

Many factors could be causing the delay, but I suspect system lag under heavy workloads and audio driver latency are the primary reasons. Since this issue is difficult to trigger and track, I’ll need more time to investigate it further.

Much appreciated. Apart from the code duplication I mentioned I can see how the added code would do the right thing, but I also try to understand why the original authors didn't think it was worth adding re-synchronisation to audio sources untethered to a video stream.

So maybe they simply didn't consider the situation to be probable (or possible) so it'd be a simple oversight, or the situation is indeed "not supposed to happen" and the fact that it does is indicative of a larger issue.

@marko1616
Copy link
Author

marko1616 commented Jan 18, 2026

At first glance I don't see enough meaningful differences in the effective code to justify the existence of two separate functions for this, vs. enhancing the current function to check for an associated video object directly (and then use a different set of calculations for the timestamp difference).
Also what is the root cause for the delay without an associated video source?

Many factors could be causing the delay, but I suspect system lag under heavy workloads and audio driver latency are the primary reasons. Since this issue is difficult to trigger and track, I’ll need more time to investigate it further.

Much appreciated. Apart from the code duplication I mentioned I can see how the added code would do the right thing, but I also try to understand why the original authors didn't think it was worth adding re-synchronisation to audio sources untethered to a video stream.

So maybe they simply didn't consider the situation to be probable (or possible) so it'd be a simple oversight, or the situation is indeed "not supposed to happen" and the fact that it does is indicative of a larger issue.

The current synchronization logic is fundamentally limited by the condition monitor->delay_buffer.size > 0. This means we only attempt to "drop" audio if there is a backlog in our local buffer.

In scenarios like MSFS high workload, the audio is pushed into the WASAPI endpoint buffer immediately. Because our delay_buffer remains empty (size: 0), the dropping branch is never entered, even when the WASAPI pad is full and audio is clearly behind.


Manual Sync

When adjusting sync manually, "waiting" logs appear first, which allows the delay_buffer to build up. Once there is a backlog, the dropping mechanism can actually trigger:

12:25:49.031: [AudioMon] delay_buf: 0 bytes (max: 0), WASAPI pad: 0 (max: 0), rate: 48000
12:30:12.742: audio ahead of realtime, waiting, diff: 89962100, delay buffer size: 3852, cur_time: 3835392924000, front_ts: 3835482886100
...
12:32:28.443: audio behind realtime, dropping, diff: -1007200, delay buffer size: 34668, cur_time: 3971093906300, front_ts: 3971092899100
12:35:49.032: [AudioMon] delay_buf: 34668 bytes (max: 34668), WASAPI pad: 0 (max: 0), rate: 48000

High Workload

In MSFS, the frames are sent to WASAPI as soon as they arrive. The delay_buffer stays at 0, so the sync code sees "nothing to drop," while the hardware buffer accumulates latency:

20:22:12.689: [AudioMon] delay_buf: 0 bytes (max: 0), WASAPI pad: 0 (max: 0), rate: 48000
// ...
03:12:12.720: [AudioMon] delay_buf: 0 bytes (max: 0), WASAPI pad: 2400 (max: 2400), rate: 48000

Solution?

The current dropping mechanism cannot reach frames once they are in the WASAPI endpoint buffer. If the delay_buffer is empty, we have no way to "skip" ahead to regain sync.

To fix this, we need a mechanism that handles synchronization when the bottleneck is the endpoint buffer itself:

  1. Dynamic Resampling: Adjust the playback rate via IAudioClock::SetRate (if supported) or a software resampler to "catch up."
  2. Endpoint Reset: Brutally call Stop/Reset/Start on the IAudioClient to flush the hardware buffer when the WASAPI pad exceeds a threshold without a local backlog.

Debugging Code

I am using the following to track the relationship between our local buffer and the WASAPI padding:

static void debug_buffer_stats(struct audio_monitor *monitor, UINT32 wasapi_pad)
{
    static uint64_t last_log = 0;
    static uint32_t max_pad = 0;
    static size_t max_delay_buf = 0;

    uint64_t now = os_gettime_ns();

    if (wasapi_pad > max_pad) max_pad = wasapi_pad;
    if (monitor->delay_buffer.size > max_delay_buf)
        max_delay_buf = monitor->delay_buffer.size;

    if (now - last_log >= 600000000000ULL) {
        blog(LOG_INFO,
            "[AudioMon] delay_buf: %zu bytes (max: %zu), "
            "WASAPI pad: %u (max: %u), rate: %u",
            monitor->delay_buffer.size, max_delay_buf,
            wasapi_pad, max_pad,
            monitor->sample_rate);

        last_log = now;
        max_pad = 0;
        max_delay_buf = 0;
    }
}

Since these solutions involve resetting the audio stream or changing timing, they might be considered breaking changes. How should we proceed with implementing a "pad-aware" sync? I'm looking forward to your opinion and thx. I think only through those two syncing can control the lagging to acceptable level. For now I make the maximum delay to 10ms(delay_buffer) + 10ms(WASAPI) and testing for more extreme case.

@marko1616 marko1616 force-pushed the audio-monitor-fix branch 3 times, most recently from 41ea4ab to 7447a1b Compare January 18, 2026 08:41
@PatTheMav
Copy link
Member

@norihiro Would like to hear your opinion on this - you've looked into looking concepts from signal theory to fix similar stuff in the past, is there something we can do to capture this?

Given that there is no way to handle this scenario, we got a "blank canvas" to come up with a solution so to speak, so if there is a smarter way to solve this (rather than the somewhat brutish way OBS handles this in other places), it'd be a great opportunity to do so.

@PatTheMav
Copy link
Member

In scenarios like MSFS high workload, the audio is pushed into the WASAPI endpoint buffer immediately. Because our delay_buffer remains empty (size: 0), the dropping branch is never entered, even when the WASAPI pad is full and audio is clearly behind.

So the root cause of the problem seems to be that potentially the output device lags behind and is not able to play back audio at the rate that OBS creates new audio samples and thus the "padding" value in the endpoint buffer increases over time?

And if the monitored source has no video component, OBS simply ignores this padding value entirely and pushes more and more data into the endpoint buffer. By your description it seems that we got lucky so far that the endpoint buffer always had enough capacity to accept additional samples despite the growing padding (otherwise OBS would have effectively "reset" the output client).

When a video source is monitored (and audio is not detached), OBS would indeed calculate a theoretical absolute nanosecond timestamp for the "oldest" remaining audio in the endpoint buffer (effectively amount of buffers multiplied with the duration of a sample at the current sampling rate) and either "skip" rendering either because the oldest remaining sample is more than 75ms ahead of the last video frame timestamp, or because that same sample is more than 75ms behind the last video frame timestamp (and buffered audio data is waiting for playback).

It's only when the remaining data in the endpoint buffer is within 75ms of the last video frame timestamp that OBS would simply push the audio data into it, but audio data is always buffered internally.


At least from my naive POV I agree that we should always consider the padding in the endpoint buffer (not only when we have a video frame timestamp associated with the audio data) and thus detect if we produce audio quicker than the audio device seems to be able to consume it.

Without an associated video source, its own frame times are obvious points with which audio data needs to be synchronised, but without those, the audio data itself becomes the sync points and the padding seems to be the best indication for how much "out of sync" the output becomes.

@marko1616
Copy link
Author

marko1616 commented Jan 19, 2026

In scenarios like MSFS high workload, the audio is pushed into the WASAPI endpoint buffer immediately. Because our delay_buffer remains empty (size: 0), the dropping branch is never entered, even when the WASAPI pad is full and audio is clearly behind.

So the root cause of the problem seems to be that potentially the output device lags behind and is not able to play back audio at the rate that OBS creates new audio samples and thus the "padding" value in the endpoint buffer increases over time?

And if the monitored source has no video component, OBS simply ignores this padding value entirely and pushes more and more data into the endpoint buffer. By your description it seems that we got lucky so far that the endpoint buffer always had enough capacity to accept additional samples despite the growing padding (otherwise OBS would have effectively "reset" the output client).

When a video source is monitored (and audio is not detached), OBS would indeed calculate a theoretical absolute nanosecond timestamp for the "oldest" remaining audio in the endpoint buffer (effectively amount of buffers multiplied with the duration of a sample at the current sampling rate) and either "skip" rendering either because the oldest remaining sample is more than 75ms ahead of the last video frame timestamp, or because that same sample is more than 75ms behind the last video frame timestamp (and buffered audio data is waiting for playback).

It's only when the remaining data in the endpoint buffer is within 75ms of the last video frame timestamp that OBS would simply push the audio data into it, but audio data is always buffered internally.

At least from my naive POV I agree that we should always consider the padding in the endpoint buffer (not only when we have a video frame timestamp associated with the audio data) and thus detect if we produce audio quicker than the audio device seems to be able to consume it.

Without an associated video source, its own frame times are obvious points with which audio data needs to be synchronised, but without those, the audio data itself becomes the sync points and the padding seems to be the best indication for how much "out of sync" the output becomes.

Looking at the current code, the maximum endpoint buffer length is set to 10,000,000 × 100ns = 1000ms = 1s (for reference on why 100ns is used as the base unit, see the Microsoft documentation). This is an exceptionally long delay—so much so that I suspect it might be a mistake. In scenarios where I need real-time audio monitoring (such as in-ear monitoring), this accumulated latency becomes quite uncomfortable to work with. Moreover, since we cannot directly control the endpoint buffer, this also means the video sync-related delay could take up to 1 second before a forced reset is triggered.

Based on my perceptual testing, I believe a maximum internal delay of 30ms (20ms for the endpoint buffer + 10ms for OBS's internal buffer) would be a much better choice. Additionally, my tests suggest that simply resetting the buffer when the WASAPI endpoint buffer is relatively small doesn't cause any noticeable perceptual impact, and I've made the syncing also working for audio only source if anyone need it(I think it is more appropriate behaviour)

BTW I'am greatly appreciate it for anyone review this PR and give their opinion —including yourself. And I think this PR is good to go now.

@marko1616 marko1616 marked this pull request as ready for review January 19, 2026 14:58
@PatTheMav PatTheMav requested a review from pkviet January 19, 2026 15:08
This commit resolves audio lag that occurs when using audio monitor
for extended periods on sources without video.
@marko1616
Copy link
Author

marko1616 commented Jan 19, 2026

Considering if there will be some case video source could't feed audio frame in 10ms I just add WASIAPI buffer to 50ms for some trade off. I'll continue test for related feature.Like investigate on video source feed latency, and exam if it is necessary to decoupled video and audio only source(They have really different latency) or makeing WASIAPI buffer size dynamic.
For now I just want it keeps simple just focus on that 1s huge lag.

@norihiro
Copy link
Contributor

I assume WASAPI and OBS have different clocks, hence, in my opinion, we need to resample the audio.
I cannot find the document for IAudioClock::SetRate but FFmpeg provides a software resampler like I implemented the #6351, which addresses the clock drift between the audio source and OBS.

For Decklink output, we had an experiment that dynamically adjust decklink's clock based on the offset between OBS and the hardware.
(I'm sorry that I didn't have enough time for recent months to address the issue for Decklink.)
Like this, I think we can dynamically adjust the resampler to keep the WASAPI's buffer constant; get the buffer size, average it for recent seconds, get the difference between the averaged buffer size vs a target buffer size, and adjust the resampler.

Regarding the difficulty to trigger and track, I want to recommend to plot a buffer size and some other metrics on a graph.

@marko1616
Copy link
Author

marko1616 commented Jan 20, 2026

I assume WASAPI and OBS have different clocks, hence, in my opinion, we need to resample the audio. I cannot find the document for IAudioClock::SetRate but FFmpeg provides a software resampler like I implemented the #6351, which addresses the clock drift between the audio source and OBS.

For Decklink output, we had an experiment that dynamically adjust decklink's clock based on the offset between OBS and the hardware. (I'm sorry that I didn't have enough time for recent months to address the issue for Decklink.) Like this, I think we can dynamically adjust the resampler to keep the WASAPI's buffer constant; get the buffer size, average it for recent seconds, get the difference between the averaged buffer size vs a target buffer size, and adjust the resampler.

Regarding the difficulty to trigger and track, I want to recommend to plot a buffer size and some other metrics on a graph.

Yep, I just added the debug plots and outputs as we discussed earlier. In my opinion, your ideas are quite good (I hadn't noticed anyone else attempting these fixes before). I actually considered implementing a similar solution earlier, but decided against it due to the complexity. For now, I'm using a simpler approach. I will try to test the performance and approach of your PR on this particular scene. I'm glad to assist you if you need to adapt those for audio monitoring.

@PatTheMav
Copy link
Member

I assume WASAPI and OBS have different clocks, hence, in my opinion, we need to resample the audio. I cannot find the document for IAudioClock::SetRate but FFmpeg provides a software resampler like I implemented the #6351, which addresses the clock drift between the audio source and OBS.

That PR came to my mind as well, which is why I pinged you. Without having looked at it in too much detail, is there a possibility for OBS to do this work at a higher level than the platform-specific monitoring implementations?

It would be great if we could implement this just once in OBS rather than 3 times (for Pulse, WASAPI and CoreAudio).

Copy link
Member

@pkviet pkviet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR. My first reaction to this PR was the same as @norihiro that the better fix would be to resample instead of just dropping samples in order to preserve the audio integrity. Also it feels kind of risky to allow just a 10 ms lag so about 512 samples at 48 kHz. I would really be hesitant to merge the PR without extensive tests across a variety of monitoring devices to ensure audio quality is still fine. The monitoring sub-system is sometimes abused to provide an extra output instead of just monitoring so I'd be extra careful. IMO, I'd really prefer a dynamic resampling as done by FFmpeg aresample filter for instance, which can do a dynamic compensation: https://ffmpeg.org/ffmpeg-filters.html#Examples-16


#define ACTUALLY_DEFINE_GUID(name, l, w1, w2, b1, b2, b3, b4, b5, b6, b7, b8) \
EXTERN_C const GUID DECLSPEC_SELECTANY name = {l, w1, w2, {b1, b2, b3, b4, b5, b6, b7, b8}}
EXTERN_C const GUID DECLSPEC_SELECTANY name = {l, w1, w2, {b1, b2, b3, b4, b5, b6, b7, b8}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-format issue ? this should not belong to this PR


#define do_log(level, format, ...) \
blog(level, "[audio monitoring: '%s'] " format, obs_source_get_name(monitor->source), ##__VA_ARGS__)
blog(level, "[audio monitoring: '%s'] " format, obs_source_get_name(monitor->source), ##__VA_ARGS__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-format issue

"diff: %lld, delay buffer size: %lu, "
"v: %llu: a: %llu",
diff, (int)monitor->delay_buffer.size, last_frame_ts, front_ts);
blog(LOG_INFO,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-format issue

"diff: %lld, delay buffer size: %lu, "
"v: %llu: a: %llu",
diff, (int)monitor->delay_buffer.size, last_frame_ts, front_ts);
blog(LOG_INFO,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-format issue


/* ------------------------------------------ *
* Init device */
* Init device */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-format issue


/* ------------------------------------------ *
* Init client */
* Init client */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-format issue


/* ------------------------------------------ *
* Init resampler */
* Init resampler */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-format issue


/* ------------------------------------------ *
* Init client */
* Init client */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-format issue

}
} else {
diff = (int64_t)front_ts - (int64_t)cur_time;
if (diff > 10000000) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did you come to that value of 10 ms ? (our audio tick is 1024 samples so at 48 kHz it is 21.3 ms so i am assuming you picked the middle ?) did you try other values ? larger or smaller ?

continue;
}
} else {
if (diff < -10000000 && monitor->delay_buffer.size > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where samples are dropped; how did you reach that value of 10 ms ? it might make drops more frequent and really affect audio quality to the benefit of better sync. I am not convinced that is a good trade-off (more on that later)

@RytoEX
Copy link
Member

RytoEX commented Jan 20, 2026

This is an exceptionally long delay—so much so that I suspect it might be a mistake.

I am pretty sure this was intentional at the time the audio sub-system was implemented and not a mistake. @Lain-B

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Audio lag occurs when using audio monitor for long time on sources without video.

5 participants