Skip to content

[Feature][P1]: Investigate and Implement FSx for Persistent Caching #28653

@rzabarazesh

Description

@rzabarazesh

🚀 The feature, motivation and pitch

Description

Amazon FSx for Lustre is a high-performance shared file system. It's optimized for HPC and build workloads, making it perfect for Docker layer caching. This task investigates using FSx for Lustre as a shared persistent cache across all build instances, with ECR as a fallback.

What You'll Do

  1. Research FSx for Lustre configuration options and best practices
  2. Create Terraform configuration for FSx Lustre file system
    • Determine optimal size (start with 1.2TB minimum)
    • Choose throughput tier (125 MB/s per TiB is cheapest)
    • Enable LZ4 compression for space savings
    • Configure security groups for NFS access
  3. Develop mount automation:
    • Create systemd service to mount FSx on instance boot
    • Add to Packer AMI configuration
    • Handle mount failures gracefully (fallback to local)
  4. Configure Docker to use FSx-mounted storage
  5. Test performance with real builds
  6. Compare costs and performance vs ECR-only solution
  7. Document setup and troubleshooting procedures

Deliverables

  • Terraform configuration for FSx file system
  • Mount automation script (mount-fsx-cache.sh)
  • Updated Packer AMI with FSx mount
  • Docker daemon configuration for FSx storage
  • Performance test results (build times, cache hit rates)

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions