Skip to content

Conversation

@msrathore-db
Copy link
Collaborator

@msrathore-db msrathore-db commented Nov 17, 2025

🥞 Stacked PR

Use this link to review incremental changes.


Summary

Integrates straggler mitigation components into CloudFetchDownloader. Implementation only - tests in next PR.

CloudFetchDownloader Integration:

  • Background monitoring thread runs every 5 seconds to check for stragglers
  • Per-file CancellationTokenSource for clean cancellation of individual downloads
  • Automatic retry mechanism for cancelled stragglers when slots become available
  • Metrics tracking for all active downloads via FileDownloadMetrics
  • Sequential fallback mode after threshold (default: 10 stragglers)
  • Thread-safe management with ConcurrentDictionary and atomic operations

Key Implementation Details:

  • Monitoring thread respects main cancellation token
  • Per-file CTS cleanup prevents memory leaks
  • Sequential mode applies only to current batch, resets on next FetchResults call
  • Duplicate detection prevents same file from being flagged multiple times
  • Zero overhead when disabled (default state)

Code Changes:

  • ✅ Add _stragglerMitigationConfig field
  • ✅ Initialize StragglerDownloadDetector when enabled
  • ✅ Add _activeDownloadMetrics dictionary for tracking
  • ✅ Add _perFileCancellationTokens for individual cancellation
  • ✅ Add _isSequentialMode flag and counter
  • ✅ Implement MonitorForStragglers() background thread
  • ✅ Update DownloadFileAsync() to track metrics and respect per-file CTS
  • ✅ Add cleanup logic in Dispose()

CloudFetchDownloader Integration:
- Background monitoring thread runs every 5 seconds
- Per-file CancellationTokenSource for clean cancellation
- Automatic retry mechanism for cancelled stragglers
- Metrics tracking for all active downloads
- Sequential fallback mode support
- Thread-safe metrics and cancellation token management

Implementation only - tests in next PR.

Builds on: stack/straggler-detector
// Straggler mitigation fields
private readonly bool _isStragglerMitigationEnabled;
private readonly StragglerDownloadDetector? _stragglerDetector;
private readonly ConcurrentDictionary<long, FileDownloadMetrics>? _activeDownloadMetrics;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should make these part of the detector for better class isolation design. and use event notificaiton to the download detector for reporting

var config = stragglerConfig ?? CloudFetchStragglerMitigationConfig.Disabled;
_isStragglerMitigationEnabled = config.Enabled;

if (config.Enabled)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, this logic should in the detector and if not enabled, the detector should just return nothing


_cancellationTokenSource?.Cancel();

// Stop straggler monitoring if running
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this to the straggler detector?

* See the License for the specific language governing permissions and
* limitations under the License.
*/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall feedback on this file, the change in this file should be minimal, it should just get a download failure with a reason of straggler download detection and retry. rest of the logic should be handled by the straggler download detector

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants