Skip to content

fix: add global deadline and mitigate PoC validation timeout attack#827

Open
ouicate wants to merge 2 commits intogonka-ai:upgrade-v0.2.11from
ouicate:fix/poc-validation-timeout
Open

fix: add global deadline and mitigate PoC validation timeout attack#827
ouicate wants to merge 2 commits intogonka-ai:upgrade-v0.2.11from
ouicate:fix/poc-validation-timeout

Conversation

@ouicate
Copy link

@ouicate ouicate commented Feb 28, 2026

Malicious participants could force the off-chain PoC validation pipeline to exceed its on-chain time budget by delaying HTTP proof requests. Workers would continue retrying indefinitely past the on-chain submission window, producing validation results that could never be submitted.

Root causes addressed:

  1. No global deadline linked to on-chain window:

    • Added computeValidationDeadline() that calculates remaining time from the epoch's EndOfPoCValidation block height (~5.41s/block)
    • ValidateAll now uses context.WithTimeout instead of context.WithCancel
    • 60s safety buffer (configurable via DeadlineBuffer) ensures time for final submission before window closes
    • Deadline-aware context propagates to all HTTP and ML node calls, cancelling in-flight requests when the window expires
  2. Worker context not propagated to HTTP calls:

    • validateParticipant now receives the deadline-aware ctx parameter instead of creating its own context.Background()
    • HTTP proof fetches and ML node requests respect the global deadline
  3. Busy-wait spin on retry-after:

    • Workers now sleep 100ms when encountering not-yet-ready items
    • Added ctx.Done() check during retry re-queue to prevent deadlock on context cancellation
  4. Retry-exhausted participants not reported:

    • Enabled reporting of participants that exhaust all 15 retries as invalid to the chain, preventing attackers from suffering no penalty

Malicious participants could force the off-chain PoC validation pipeline
to exceed its on-chain time budget by delaying HTTP proof requests.
Workers would continue retrying indefinitely past the on-chain submission
window, producing validation results that could never be submitted.

Root causes addressed:

1. No global deadline linked to on-chain window:
   - Added computeValidationDeadline() that calculates remaining time from
     the epoch's EndOfPoCValidation block height (~5.41s/block)
   - ValidateAll now uses context.WithTimeout instead of context.WithCancel
   - 60s safety buffer (configurable via DeadlineBuffer) ensures time for
     final submission before window closes
   - Deadline-aware context propagates to all HTTP and ML node calls,
     cancelling in-flight requests when the window expires

2. Worker context not propagated to HTTP calls:
   - validateParticipant now receives the deadline-aware ctx parameter
     instead of creating its own context.Background()
   - HTTP proof fetches and ML node requests respect the global deadline

3. Busy-wait spin on retry-after:
   - Workers now sleep 100ms when encountering not-yet-ready items
   - Added ctx.Done() check during retry re-queue to prevent deadlock
     on context cancellation

4. Retry-exhausted participants not reported:
   - Enabled reporting of participants that exhaust all 15 retries as
     invalid to the chain, preventing attackers from suffering no penalty
@ouicate ouicate changed the base branch from main to upgrade-v0.2.11 February 28, 2026 14:50
case workChan <- work:
case <-ctx.Done():
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not bad to cancel here before we sleep.
But it was reported as preventing deadlock on context cancelation. How deadlock could happen here? Isn't after continue it will go to for and check ctx.Done() there?

@akup
Copy link
Contributor

akup commented Mar 5, 2026

Workers would continue retrying indefinitely past the on-chain submission window

Why workers can keep retrying indefinitely?

Retries at validation workers are limited to MaxRetries
if work.attempt < v.config.MaxRetries-1

Moreover there could be situations when block is not finished in 5.41 and even adding 60 seconds doesn't give guarantees that works will not be cancelled before validation window closes.

Maybe it is more precise to cancel ctx, when phase is switched, and do not add context.withTimeout:

// Stop workers when the chain moves from PoCValidatePhase to the next phase (e.g. PoCValidateWindDown / Inference)
	phaseCheckInterval := v.config.PhaseCheckInterval
	if phaseCheckInterval <= 0 {
		phaseCheckInterval = 3 * time.Second
	}
	go func() {
		ticker := time.NewTicker(phaseCheckInterval)
		defer ticker.Stop()
		for {
			select {
			case <-ctx.Done():
				return
			case <-ticker.C:
				state := v.phaseTracker.GetCurrentEpochState()
				if state == nil {
					continue
				}
				if state.CurrentPhase != types.PoCValidatePhase {
					logging.Info("OffChainValidator: validation phase ended, stopping workers", types.PoC,
						"currentPhase", state.CurrentPhase, "blockHeight", state.CurrentBlock.Height)
					cancel()
					return
				}
			}
		}
	}()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants