CRT reaise AWS_ERROR_S3_REQUEST_HAS_COMPLETED when upload Llama 70b checkpoints.

### s3torchconnector version

s3torchconnector-1.2.3

### s3torchconnectorclient version

s3torchconnectorclient-1.2.3

### AWS Region

us-west-2

### Describe the running environment

EC2 instance p4d.24xlarge
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"
Amazon Linux release 2 (Karoo)

### What happened?

Hi team, 

I was running the Llama v2 70b model with 32 nodes using the Slurm job scheduler, based on the [SageMaker Model Parallelism Library v2](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-model-parallel-support-v2.html) (using a specific Docker image). To improve checkpoint performance, I tried using the s3connector. However, when writing checkpoints to S3, the writer stream encountered an error: "Unknown CRT error: CRT error 14366: aws-c-s3: AWS_ERROR_S3_REQUEST_HAS_COMPLETED, Request has already complete". Note that Llama v2 7b, 13b did not encounter this issue.

This error seemed to indicate that I had hit the S3 rate limit. The issue was resolved when I increased the part_size from 8Mb to 32Mb. However, the error code was ambiguous, and I was unsure if the problem was indeed related to the rate limit. Could the team help to explain this specific error. Thank you.

### Relevant log output

```shell
9397  4: [rank34]:   File "/opt/conda/lib/python3.11/site-packages/s3torchconnector/s3writer.py", line 40, in write                                                                                      
9398  4: [rank34]:     self.stream.write(data)                                                                                                                                                           
9399  4: [rank34]: s3torchconnectorclient._mountpoint_s3_client.S3Exception: Client error: Unknown CRT error: CRT error 14366: aws-c-s3: AWS_ERROR_S3_REQUEST_HAS_COMPLETED, Request has already complete
```


### Code of Conduct

- [X] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CRT reaise AWS_ERROR_S3_REQUEST_HAS_COMPLETED when upload Llama 70b checkpoints. #219

s3torchconnector version

s3torchconnectorclient version

AWS Region

Describe the running environment

What happened?

Relevant log output

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CRT reaise AWS_ERROR_S3_REQUEST_HAS_COMPLETED when upload Llama 70b checkpoints. #219

Description

s3torchconnector version

s3torchconnectorclient version

AWS Region

Describe the running environment

What happened?

Relevant log output

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions