Skip to content

Conversation

@msrathore-db
Copy link
Collaborator

Summary

Implements straggler download mitigation for CloudFetch to improve query performance by detecting and cancelling abnormally slow parallel downloads. The feature monitors active downloads and automatically retries stragglers when faster slots become available, with an optional fallback to sequential mode after a configurable threshold.

Changes

New Classes:

  • StragglerDownloadDetector (StragglerDetector.cs)

    • Core detection algorithm using median throughput analysis
    • Configurable multiplier (default 1.5x slower than median)
    • Minimum completion quantile (default 60%)
    • Straggler padding grace period (default 5 seconds)
    • Sequential fallback threshold tracking
    • Duplicate detection prevention via tracking dictionary
  • FileDownloadMetrics (FileDownloadMetrics.cs)

    • Tracks per-file download performance
    • Start time, completion time, and throughput calculation
    • Straggler cancellation flag
    • Thread-safe state management
  • CloudFetchStragglerMitigationConfig (CloudFetchStragglerMitigationConfig.cs)

    • Configuration management and validation
    • Connection property parsing
    • Default values management
    • Parameter range validation

CloudFetchDownloader Integration:

  • Background monitoring thread (runs every 5 seconds)
  • Per-file CancellationTokenSource for clean cancellation
  • Automatic retry mechanism for cancelled stragglers
  • Metrics tracking for all active downloads
  • Sequential fallback mode support
  • Thread-safe metrics and cancellation token management

Configuration Parameters:
All parameters use adbc.databricks.cloudfetch. prefix:

  • straggler_mitigation_enabled (default: false) - Feature toggle
  • straggler_multiplier (default: 1.5) - Throughput multiplier for detection
  • straggler_quantile (default: 0.6) - Minimum completion percentage before detection
  • straggler_padding_seconds (default: 5) - Grace period before flagging as straggler
  • max_stragglers_per_query (default: 10) - Threshold to trigger sequential fallback
  • synchronous_fallback_enabled (default: true) - Enable automatic fallback to sequential mode

Benefits

  • Performance: Up to 50% improvement in queries with straggler downloads (based on Java JDBC driver results)
  • Robustness: Handles network variability and slow storage automatically
  • Safety: Opt-in feature with zero overhead when disabled
  • Flexibility: All parameters are tunable for different network conditions
  • Backward Compatible: No changes to existing behavior when disabled

Technical Details

Detection Algorithm:

  1. Wait until 60% of downloads complete (configurable quantile)
  2. Calculate median throughput from completed downloads
  3. Identify downloads running 50% slower than median (configurable multiplier)
  4. Apply 5-second grace period before flagging as straggler
  5. Cancel via per-file CancellationTokenSource
  6. Track cancelled downloads to prevent duplicate detection

Sequential Fallback:

  • Triggered after 10 stragglers detected in current batch (configurable)
  • Applies only to remaining files in current batch
  • Resets for next FetchResults call
  • Prevents thrashing in consistently slow network conditions

Thread Safety:

  • ConcurrentDictionary for metrics and cancellation tokens
  • Atomic operations for counter increments
  • Proper cleanup in finally blocks
  • Linked cancellation tokens for cascading shutdown

Testing

38 tests total, all passing:

Unit Tests (19):

  • FileDownloadMetrics throughput calculation (before/after completion)
  • FileDownloadMetrics straggler flag setting
  • StragglerDownloadDetector parameter validation (multiplier, quantile)
  • Median calculation correctness (odd/even counts)
  • Quantile threshold enforcement
  • Fallback threshold triggering
  • Empty metrics list handling
  • Cancelled downloads filtering
  • Duplicate detection prevention (tracking dictionary)
  • CancellationTokenSource atomic replacement
  • Cleanup behavior under exceptions
  • Shutdown cancellation respect
  • Concurrent CTS cleanup
  • Counter overflow protection (long type)
  • Concurrent modification safety

E2E Tests (19):

  • Slow download identification and cancellation
  • Fast downloads not marked as stragglers
  • Minimum completion quantile requirement
  • Sequential fallback activation after threshold
  • Sequential mode enforcement (one download at a time)
  • No stragglers detected in sequential mode
  • Sequential fallback applies only to current batch
  • Monitoring thread respects cancellation
  • Parallel mode respects max parallel downloads
  • Cancelled straggler retry logic
  • Mixed speed download scenarios
  • Clean shutdown during monitoring
  • Feature disabled by default verification
  • Configuration parameter definitions
  • Configuration parameter naming convention
  • StragglerDownloadDetector creation
  • FileDownloadMetrics creation
  • Counter increments atomically

Documentation

  • straggler-mitigation-design.md - Comprehensive design doc with algorithm details, implementation notes, configuration guide, and usage examples
  • Inline code comments - Detailed documentation in all new classes

🤖 Generated with Claude Code

Copy link
Collaborator

@eric-wang-1990 eric-wang-1990 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 2 overall comments:

  1. This is too large a PR for a reasonable review, especially when it contains many concurrent control scenario, which needs careful review and design, is it possible to break into smaller pieces?
  2. Have you been able to try out in PowerBI E2E?

{
try
{
await Task.Delay(StragglerMonitoringInterval, cancellationToken).ConfigureAwait(false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this monitoring every 2 second? What does this mean for performance/speed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't try benchmarking on the value for monitoring the performance. Once we confirm the design of the feature, Ill run benchmarking for this param

_downloadSemaphore.Release();
bool shouldAcquireSequential = _isSequentialMode;
bool acquiredSequential = false;
if (shouldAcquireSequential)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you need this extra shouldAcquireSequential to get the snapshot of _isSequentialMode? What if the value _isSequentialMode changes between 404 and 406?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning was to not release the semaphore if it was never acquired. But this is a potential race condition. Updated the code. Thanks

…apshot of isSequentialMode for acquiring the semaphore
@msrathore-db
Copy link
Collaborator Author

I have 2 overall comments:

  1. This is too large a PR for a reasonable review, especially when it contains many concurrent control scenario, which needs careful review and design, is it possible to break into smaller pieces?
  2. Have you been able to try out in PowerBI E2E?

I can raise stacked PRs for this feature. That should make it easier to review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants