Skip to content

premchand11/sftp-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SFTP Pipeline

High-performance concurrent SFTP file transfer pipeline for Go.

Built for large-scale ingestion with:

  • Parallel SFTP downloads
  • Worker-pool based processing
  • SFTP connection pooling
  • Retry with exponential backoff + jitter
  • Context-aware cancellation
  • Configurable read strategy (ReadAll vs Streaming)
  • Benchmark suite for performance analysis

Architecture Overview

+---------------------+
|     Job Producer    |
+----------+----------+
           |
           v
+---------------------+      (N goroutines)
|    Download Pool    |
+----------+----------+
           |
      Buffered Channel
           |
           v
+---------------------+      (M goroutines)
|   Processing Pool   |
+----------+----------+
           |
           v
        ProcessFunc

Pipeline Stages

1.Job Producer

Feeds file download jobs into the pipeline.


2. Download Pool (I/O Bound)

  • Uses a configurable number of goroutines
  • Acquires SFTP clients from a connection pool
  • Retries transient errors using exponential backoff + jitter
  • Streams or fully reads files based on configuration

3. Processing Pool (CPU Bound)

  • Separate worker pool
  • Executes user-defined ProcessFunc
  • Tracks success / failure metrics
  • Ensures proper resource cleanup

Configuration

type Config struct {
    SFTPReaders   int
    Workers       int
    BufferSize    int
    MaxRetries    int
    ReadMode      ReadMode
    StreamBufSize int
}

Read Modes

const (
    ReadAllMode ReadMode = iota
    StreamingMode
)
Mode Memory Usage Speed Best For
ReadAll O(file size) Faster Small–medium files
Streaming O(buffer size) Slightly slower Large files (50MB+)

Retry Strategy

  • Exponential backoff
  • Random jitter to avoid thundering herd
  • Retries only transient errors
  • Permanent errors (e.g., file not found) fail immediately

Connection Pooling

  • Pre-creates SFTP clients
  • Reuses connections safely
  • Prevents repeated SSH handshakes
  • Avoids connection exhaustion

Context-Aware Cancellation

The pipeline fully respects context.Context:

  • Stops job production
  • Cancels retry backoff
  • Stops workers gracefully
  • Prevents goroutine leaks

Benchmark Results (50MB File)

Environment

  • Local Docker SFTP server
  • Go 1.22
  • Intel i7-1355U
  • Single 50MB file

BenchmarkTransfer_ReadAll

~651ms/op
~754MB allocated
~22k allocs/op

BenchmarkTransfer_Streaming

~2.11s/op
~123MB allocated
~55k allocs/op

What This Means

ReadAllMode

  • Fastest throughput
  • High memory usage (O(file size))
  • Suitable for small/medium files
  • Loads entire file into memory

StreamingMode

  • ~6× lower memory usage
  • Slower due to SFTP packet round trips
  • Suitable for large files or memory-constrained systems
  • Memory usage bounded by buffer

Why This Matters

This benchmark demonstrates explicit performance tradeoffs between:

  • Throughput vs Memory
  • CPU vs I/O bound workloads
  • Allocation pressure
  • Real-world system constraints

Understanding and measuring these tradeoffs is more important than raw speed.


Usage Example

cfg := pipeline.DefaultConfig()
cfg.ReadMode = pipeline.StreamingMode

p := pipeline.New(cfg)

metrics, err := p.Transfer(
    ctx,
    sshCfg,
    jobs,
    func(ctx context.Context, r pipeline.FileResult) error {
        _, err := io.Copy(io.Discard, r.Reader)
        return err
    },
)

fmt.Println(metrics, err)

Metrics

type Metrics struct {
    Success   int32
    Failed    int32
    Cancelled int32
}

Metrics are updated atomically for thread safety.


Design Goals

This project demonstrates:

  • Advanced Go concurrency patterns
  • Worker pool architecture
  • Resource pooling
  • Backpressure handling
  • Retry classification
  • Memory vs throughput tradeoffs
  • Clean modular package design
  • Benchmark-driven performance analysis

Security Note

ssh.InsecureIgnoreHostKey() is used for development.

For production, implement strict host key verification.


Ideal Use Cases

  • High-volume SFTP ingestion
  • ETL pipelines
  • Background file processors
  • Memory-sensitive ingestion systems
  • Concurrent file transformation workflows

License

MIT

About

High-performance concurrent SFTP file processing pipeline in Go with streaming vs read-all benchmarks.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages