Skip to content

CRT reaise AWS_ERROR_S3_REQUEST_HAS_COMPLETED when upload Llama 70b checkpoints. #219

@crazyguitar

Description

@crazyguitar

s3torchconnector version

s3torchconnector-1.2.3

s3torchconnectorclient version

s3torchconnectorclient-1.2.3

AWS Region

us-west-2

Describe the running environment

EC2 instance p4d.24xlarge
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"
Amazon Linux release 2 (Karoo)

What happened?

Hi team,

I was running the Llama v2 70b model with 32 nodes using the Slurm job scheduler, based on the SageMaker Model Parallelism Library v2 (using a specific Docker image). To improve checkpoint performance, I tried using the s3connector. However, when writing checkpoints to S3, the writer stream encountered an error: "Unknown CRT error: CRT error 14366: aws-c-s3: AWS_ERROR_S3_REQUEST_HAS_COMPLETED, Request has already complete". Note that Llama v2 7b, 13b did not encounter this issue.

This error seemed to indicate that I had hit the S3 rate limit. The issue was resolved when I increased the part_size from 8Mb to 32Mb. However, the error code was ambiguous, and I was unsure if the problem was indeed related to the rate limit. Could the team help to explain this specific error. Thank you.

Relevant log output

9397  4: [rank34]:   File "/opt/conda/lib/python3.11/site-packages/s3torchconnector/s3writer.py", line 40, in write                                                                                      
9398  4: [rank34]:     self.stream.write(data)                                                                                                                                                           
9399  4: [rank34]: s3torchconnectorclient._mountpoint_s3_client.S3Exception: Client error: Unknown CRT error: CRT error 14366: aws-c-s3: AWS_ERROR_S3_REQUEST_HAS_COMPLETED, Request has already complete

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions