Skip to content

Stream Stuck in "Busy" State When Forward Backend API Fails #4631

@jackboy1006

Description

@jackboy1006

Describe the bug

When the forward backend API returns an HTTP 500 error (or any error), the stream becomes permanently stuck in a "busy" state and cannot be republished until SRS is restarted. This creates a "ghost busy" stream that is not visible via SRS API but prevents any new publish attempts for that stream.

Version

  • SRS Version: v6.0-r0 (also tested on v6.0.155)
  • Deployment: Kubernetes cluster
  • Feature: Dynamic Forward with Backend API

To Reproduce

Steps to reproduce the behavior:

  1. Configure SRS with dynamic forward backend:
vhost __defaultVhost__ {
    forward {
        enabled on;
        backend http://backend-service/api/NewStream;
    }
}
  1. Create a backend service that returns HTTP 500 error

  2. Push stream using OBS:

    • Configure OBS with RTMP URL: rtmp://srs-server/live/test
    • Start streaming
  3. Backend returns HTTP 500, stream publish fails

  4. Try to push the same stream again using OBS

    • Stop and restart streaming in OBS
  5. See error: "stream is busy"

  6. Check SRS API - stream doesn't appear:

curl http://srs-server:1985/api/v1/streams
  1. The stream is permanently stuck and only SRS restart can recover it

Expected behavior

When backend API fails, one of these should happen:

  • on_unpublish() should be called to reset the publish state
  • There should be a timeout mechanism to auto-clear the busy state
  • Backend API errors should not prevent the stream from being published (just skip forwarding)

Additional context

Root Cause Analysis (Based on Source Code)

The Issue Flow:

  1. Client publishes stream to SRS
  2. SRS calls SrsLiveSource::on_publish()
  3. on_publish() sets can_publish_ = false before calling hub_->on_publish()
  4. hub_->on_publish()create_backend_forwarders()on_forward_backend() calls the backend API
  5. Backend API returns HTTP 500 → error propagates back
  6. on_publish() returns error, but can_publish_ is already false
  7. Critical: If on_unpublish() is not called to reset can_publish_ = true, the stream is stuck

Source Code Evidence:

File: trunk/src/app/srs_app_rtmp_source.cpp

srs_error_t SrsLiveSource::on_publish()
{
    ...
    can_publish_ = false;  // ← Set to false BEFORE calling hub
    
    // Notify the hub about the publish event.
    if (hub_ && (err = hub_->on_publish()) != srs_success) {
        return srs_error_wrap(err, "hub publish");  // ← Returns error if backend fails
    }
    ...
}

void SrsLiveSource::on_unpublish()
{
    ...
    can_publish_ = true;  // ← Only place where it's reset to true
}

bool SrsLiveSource::can_publish(bool is_edge)
{
    ...
    return can_publish_;  // ← Checked by acquire_publish()
}

File: trunk/src/app/srs_app_rtmp_conn.cpp

// Check whether RTMP stream is busy.
if (!source->can_publish(info_->edge_)) {
    return srs_error_new(ERROR_SYSTEM_STREAM_BUSY, "rtmp: stream %s is busy", req->get_stream_url().c_str());
}

Why Auto-Cleanup Doesn't Work:

bool SrsLiveSource::stream_is_dead()
{
    // still publishing?
    if (!can_publish_ || !publish_edge_->can_publish()) {
        return false;  // ← If can_publish_ is false, source is NEVER cleaned up!
    }
    ...
}

Workaround

Ensure the backend API never returns errors:

[HttpPost]
public async Task<IResult> HandleNewStream(...)
{
    try
    {
        var response = await srsService.HandleNewStream(request);
        return Results.Ok(response);
    }
    catch (Exception ex)
    {
        logger.Error(ex, "Error processing request");
        
        // Always return HTTP 200 + Code 0, even on error
        return Results.Ok(new SrsForwardResponse
        {
            Code = 0,
            Data = new SrsForwardData { Urls = [] }  // Empty URLs = no forwarding
        });
    }
}

Proposed Fix

Option 1: Ensure on_unpublish() is Always Called

In SrsRtmpConn::publishing(), ensure release_publish() properly calls on_unpublish() even when on_publish() fails.

Option 2: Add Timeout Mechanism

Add a timestamp to track when can_publish_ was set to false, and auto-reset after a timeout (e.g., 30 seconds).

Option 3: Change Error Handling Strategy

Don't fail the entire publish when only the forward backend fails. Log the error and continue publishing without forwarding.

Related Issues

Production Impact

This issue affects production systems where:

  • Backend services may experience temporary failures
  • Streams need to be quickly republished after disconnection
  • Automatic recovery is essential for reliability

The current behavior requires manual intervention (pod restart) which is not acceptable for high-availability systems.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions