Skip to content

Conversation

@huydhn
Copy link
Contributor

@huydhn huydhn commented Sep 18, 2024

As reported by @guangy10 in pytorch/executorch#5441 (comment), it turns out that the job timeout can be set to a higher value (120 minutes), but the AWS credential still expires after 1 hour https://github.com/aws-actions/configure-aws-credentials.

The fix here is to set the role duration correctly. However, the max that it can go is 18000 seconds. I believe that's plenty of room already, but if a higher is needed, we will need to update the server side https://github.com/pytorch-labs/pytorch-gha-infra/blob/main/runners/gha_roles.tf#L1425

Testing

The role duration is set correctly to 7200 seconds (2 hours) https://github.com/pytorch/test-infra/actions/runs/10916800466/job/30298898748#step:4:4

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 18, 2024
@huydhn huydhn requested review from a team and guangy10 September 18, 2024 06:20
@vercel
Copy link

vercel bot commented Sep 18, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
torchci ⬜️ Ignored (Inspect) Visit Preview Sep 18, 2024 7:41pm

run: |
set -ex
TIMEOUT_IN_SECONDS=$(( TIMEOUT_IN_MINUTES * 60 ))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The role duration probably should be somewhat longer than the bare timeout. Say 1h more than the timeout. But I wouldn't go with a complex solution here, instead just asking for the longest duration you can ask. I don't see any downsides of doing this, and it is less prone to failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, let me just stick the max duration here to keep it simple

@huydhn huydhn merged commit 3baf8aa into main Sep 18, 2024
7 checks passed
huydhn added a commit to huydhn/test-infra that referenced this pull request Sep 24, 2024
As reported by @guangy10 in
pytorch/executorch#5441 (comment),
it turns out that the job timeout can be set to a higher value (120
minutes), but the AWS credential still expires after 1 hour
https://github.com/aws-actions/configure-aws-credentials.

The fix here is to set the role duration correctly. However, the max
that it can go is 18000 seconds. I believe that's plenty of room
already, but if a higher is needed, we will need to update the server
side
https://github.com/pytorch-labs/pytorch-gha-infra/blob/main/runners/gha_roles.tf#L1425

### Testing

The role duration is set correctly to 7200 seconds (2 hours)
https://github.com/pytorch/test-infra/actions/runs/10916800466/job/30298898748#step:4:4
kit1980 pushed a commit that referenced this pull request Sep 24, 2024
As reported by @guangy10 in

pytorch/executorch#5441 (comment),
it turns out that the job timeout can be set to a higher value (120
minutes), but the AWS credential still expires after 1 hour
https://github.com/aws-actions/configure-aws-credentials.

The fix here is to set the role duration correctly. However, the max
that it can go is 18000 seconds. I believe that's plenty of room
already, but if a higher is needed, we will need to update the server
side

https://github.com/pytorch-labs/pytorch-gha-infra/blob/main/runners/gha_roles.tf#L1425

### Testing

The role duration is set correctly to 7200 seconds (2 hours)
https://github.com/pytorch/test-infra/actions/runs/10916800466/job/30298898748#step:4:4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants