-
Notifications
You must be signed in to change notification settings - Fork 26
Description
s3torchconnector version
s3torchconnector-1.2.3
s3torchconnectorclient version
s3torchconnectorclient-1.2.3
AWS Region
us-west-2
Describe the running environment
EC2 instance p4d.24xlarge
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"
Amazon Linux release 2 (Karoo)
What happened?
Hi team,
I was running the Llama v2 70b model with 32 nodes using the Slurm job scheduler, based on the SageMaker Model Parallelism Library v2 (using a specific Docker image). To improve checkpoint performance, I tried using the s3connector. However, when writing checkpoints to S3, the writer stream encountered an error: "Unknown CRT error: CRT error 14366: aws-c-s3: AWS_ERROR_S3_REQUEST_HAS_COMPLETED, Request has already complete". Note that Llama v2 7b, 13b did not encounter this issue.
This error seemed to indicate that I had hit the S3 rate limit. The issue was resolved when I increased the part_size from 8Mb to 32Mb. However, the error code was ambiguous, and I was unsure if the problem was indeed related to the rate limit. Could the team help to explain this specific error. Thank you.
Relevant log output
9397 4: [rank34]: File "/opt/conda/lib/python3.11/site-packages/s3torchconnector/s3writer.py", line 40, in write
9398 4: [rank34]: self.stream.write(data)
9399 4: [rank34]: s3torchconnectorclient._mountpoint_s3_client.S3Exception: Client error: Unknown CRT error: CRT error 14366: aws-c-s3: AWS_ERROR_S3_REQUEST_HAS_COMPLETED, Request has already completeCode of Conduct
- I agree to follow this project's Code of Conduct