Skip to content

joshbwlng/s3-gc-test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

S3 Garbage Collection Tool

A safe, online garbage collector for Docker Distribution registries using S3 storage.

Features

  • Safe deletion - Uses time-window approach to avoid race conditions
  • No downtime - Can run while registry is active
  • Cost optimized - Minimizes S3 API calls
  • State tracking - Remembers when blobs became unreferenced
  • Dry-run mode - Test before deleting
  • Detailed reporting - Shows exactly what was deleted

How It Works

The tool implements a time-based safety mechanism similar to Zot registry:

  1. Mark Phase: Scans all manifests to build a set of referenced blobs
  2. Sweep Phase: Identifies unreferenced blobs
  3. Safety Check: Only deletes blobs that have been unreferenced for > N hours
  4. State Tracking: Maintains state file to track unreferenced duration

Why This Is Safe

Timeline:
T0: GC starts, marks blob as unreferenced
T1: Client uploads manifest referencing that blob
T2: GC checks age - blob just became unreferenced (< 1 hour old)
T3: GC skips deletion (safety window = 48 hours)
T4: Next GC run - blob is now referenced again, safe!

The safety window must be longer than your maximum image push time.

Installation

cd tools/
pip install -r requirements.txt

Usage

First Run (Dry-Run Recommended)

# See what would be deleted without actually deleting
python s3-gc.py --bucket my-registry-bucket --dry-run

This will:

  • Scan all manifests
  • Identify unreferenced blobs
  • Show what would be deleted
  • Create gc-state.json to track unreferenced blobs

Production Run

# Actually delete blobs unreferenced for > 48 hours
python s3-gc.py --bucket my-registry-bucket --safety-hours 48

Scheduled Runs

Add to cron for regular garbage collection:

# Run every hour
0 * * * * cd /path/to/tools && python s3-gc.py --bucket my-registry-bucket --safety-hours 48 >> gc-cron.log 2>&1

Advanced Options

# Custom S3 prefix
python s3-gc.py --bucket my-registry-bucket \
                --prefix docker/registry/v2/ \
                --safety-hours 72

# Use specific AWS profile
AWS_PROFILE=production python s3-gc.py --bucket my-registry-bucket

# Verbose logging
python s3-gc.py --bucket my-registry-bucket --verbose

# Custom state file
python s3-gc.py --bucket my-registry-bucket \
                --state-file /var/lib/gc-state.json

Configuration

AWS Credentials

The tool uses boto3, so configure AWS credentials via:

# Environment variables
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-east-1

# Or AWS CLI profile
export AWS_PROFILE=production

# Or ~/.aws/credentials
[default]
aws_access_key_id = ...
aws_secret_access_key = ...

IAM Permissions Required

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::my-registry-bucket",
        "arn:aws:s3:::my-registry-bucket/*"
      ]
    }
  ]
}

Safety Window Recommendations

Upload Frequency Recommended Safety Window
Continuous 72-96 hours
Daily builds 48-72 hours
Infrequent 24-48 hours

Formula: Safety Window > (Max Upload Duration × 2) + GC Scan Interval

Cost Optimization

This tool is designed to minimize S3 costs:

API Calls Per Run

For a registry with:

  • 100 repositories
  • 500 manifests total
  • 10,000 blobs

Estimated S3 API calls:

  • List repositories: ~1 LIST operation
  • List manifests: ~100 LIST operations
  • Read manifests: ~500 GET operations
  • List all blobs: ~10 LIST operations (with pagination)
  • Delete blobs: ~N DELETE operations (N = deletable blobs)

Total: ~611 + N operations

At AWS S3 pricing (us-east-1):

  • LIST: $0.005 per 1,000 requests
  • GET: $0.0004 per 1,000 requests
  • DELETE: $0.0000 per 1,000 requests

Cost per run: ~$0.001 (less than a penny!)

Optimization Tips

  1. Run less frequently: Hourly runs are usually overkill. Consider:

    • Every 6 hours for active registries
    • Daily for low-traffic registries
  2. Increase safety window: Longer windows mean fewer deletions per run

  3. Monitor state file size: The state file grows with unreferenced blobs. Consider cleanup:

    # Clean up state older than 30 days
    python s3-gc.py --cleanup-state --days 30

State File

The tool maintains gc-state.json to track when blobs became unreferenced:

{
  "sha256:abc123...": "2025-12-13T10:30:00+00:00",
  "sha256:def456...": "2025-12-14T15:45:00+00:00"
}
  • Purpose: Track unreferenced duration
  • Location: Current directory (or use --state-file)
  • Backup: Recommended to backup this file
  • Cleanup: Automatically removes referenced blobs from tracking

Monitoring

Log Files

# View real-time logs
tail -f gc.log

# Search for errors
grep ERROR gc.log

# Count deletions
grep "Deleting blob" gc.log | wc -l

Output Statistics

Each run produces a summary:

GC Statistics:
  Total blobs found:        10,234
  Referenced blobs:         8,456
  Unreferenced blobs:       1,778
  Skipped (too new):        1,650
  Deleted blobs:            128
  Bytes deleted:            5,234,567,890 (4.87 GB)
  Errors encountered:       0

Metrics to Track

  • Deleted blobs per run: Should stabilize over time
  • Skipped (too new): Should be > 0 (shows safety is working)
  • Errors: Should be 0
  • Bytes deleted: Monitor storage savings

Troubleshooting

"Too many blobs being deleted"

If you see a large number of deletions on first run:

# First run with dry-run
python s3-gc.py --bucket my-registry-bucket --dry-run

# Review what would be deleted
less gc.log

# If it looks wrong, increase safety window
python s3-gc.py --bucket my-registry-bucket --safety-hours 96

"NoSuchBucket error"

Check bucket name and AWS credentials:

aws s3 ls s3://my-registry-bucket/

"State file corruption"

Delete and recreate:

rm gc-state.json
python s3-gc.py --bucket my-registry-bucket --dry-run

"Out of memory"

For very large registries (100k+ blobs), consider:

  1. Run on a machine with more RAM
  2. Process repositories in batches
  3. Use --verbose to see progress

Safety Features

  1. Dry-run by default: Must explicitly disable
  2. Confirmation prompt: Asks for confirmation before deleting
  3. Time-based safety: Won't delete recent blobs
  4. State persistence: Tracks deletion candidates across runs
  5. Detailed logging: Audit trail of all deletions
  6. Error handling: Continues on errors, reports at end

Comparison with Built-in GC

Feature Built-in GC This Tool
Requires downtime ✅ Yes ❌ No
Read-only mode needed ✅ Yes ❌ No
Time-based safety ❌ No ✅ Yes
Direct S3 access ❌ No ✅ Yes
Cost optimized ❌ No ✅ Yes
State tracking ❌ No ✅ Yes

Contributing

Improvements welcome! Consider adding:

  • Progress bars for long operations
  • Prometheus metrics export
  • Slack/email notifications
  • Parallel processing for large registries
  • S3 lifecycle policy integration
  • Support for other storage backends

License

Use at your own risk. Test thoroughly in non-production first.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors