Skip to content

Conversation

akashveramd
Copy link

@akashveramd akashveramd commented Sep 8, 2025

Some runners do not have AWS CLI installed. Hence, in PyTorch we rely on github aws-actions instead of the CLI for authentication. To provide support for workflow files using linux_job_v2.yml, added aws-actions authentication to linux_job_v2.yml.

It is needed to get torchtitan PR going pytorch/torchtitan#1260. It currently faces issue with https://github.com/pytorch/torchtitan/actions/runs/16353043263/job/46204468985?pr=1260#step:8:221

Copy link

vercel bot commented Sep 8, 2025

@akashveramd is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 8, 2025
@huydhn
Copy link
Contributor

huydhn commented Sep 8, 2025

Unfortunately, rolling this would be hard work because all callers of linux_job_v2 would need to pass id-token: write permission explicitly. This is a popular workflow with a long list of callers https://github.com/search?type=code&q=org%3Apytorch+linux_job_v2.yml, so I'm not sure this is the approach we want to take

malfet
malfet previously approved these changes Sep 23, 2025
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it fails lots of tests, isn't it?

@akashveramd akashveramd reopened this Oct 2, 2025
@pytorch-bot pytorch-bot bot dismissed malfet’s stale review October 2, 2025 18:59

This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.

@akashveramd
Copy link
Author

Hi @malfet, it seems you approved the request. But I closed the PR thinking I have found a work around where we don't need to change linux_job_v2.yaml. However, I think I was wrong about it. Having aws-setup as a part of linux_job_v2.yaml looks like an only option. Can you please re-approve the changes?

@akashveramd akashveramd requested a review from malfet October 3, 2025 16:51
@malfet
Copy link
Contributor

malfet commented Oct 3, 2025

@akashveramd I didn't approve it, I've requested changes as it's break all existing CI workflows

@akashveramd
Copy link
Author

akashveramd commented Oct 3, 2025

@akashveramd I didn't approve it, I've requested changes as it's break all existing CI workflows

@malfet: I think those failures are happening because the workflow is running from a forked PR and the permission to gha_workflow_s3_and_ecr_read_only is not granted to it. I had a similar issue in my torchtitan PR where I was trying to do aws authentication inside the workflow. @huydhn pointed to this issue in the slack channel discussion.

As a short-term workaround, Huy gave me write permission to the pytorch/torchtitan repo and I migrated my torchtitan PR directly inside pytorch/torchtitan, instead of my fork. With that aws issues disappeared.

@malfet
Copy link
Contributor

malfet commented Oct 3, 2025

@huydhn do you mind validating those then?

@huydhn
Copy link
Contributor

huydhn commented Oct 3, 2025

Um, no, my comment here #7115 (comment) still stands. This workflow does have id-token: write which is needed by OIDC. If we add it here to linux_job_v2, we need to do the same for all callers of the workflow, which are many https://github.com/search?type=code&q=org%3Apytorch+linux_job_v2.yml. So, it's better to not use this for ROCm jobs, or create a new reusable workflow for ROCm

You might have fixed it on torchtitan, but other users of linux_job_v2 would need to do the same on their end

@akashveramd
Copy link
Author

akashveramd commented Oct 3, 2025

Um, no, my comment here #7115 (comment) still stands. This workflow does have id-token: write which is needed by OIDC. If we add it here to linux_job_v2, we need to do the same for all callers of the workflow, which are many https://github.com/search?type=code&q=org%3Apytorch+linux_job_v2.yml. So, it's better to not use this for ROCm jobs, or create a new reusable workflow for ROCm

You might have fixed it on torchtitan, but other users of linux_job_v2 would need to do the same on their end

@huydhn: Instead of modifying this workflow, I can create a new reusable workflow for ROCm with aws authentication. Something like linux_job_v2_rocm.yml

@akashveramd akashveramd force-pushed the av_linux_job_v2_main branch from fdc2593 to f7ee220 Compare October 3, 2025 22:37
@akashveramd
Copy link
Author

@huydhn, @malfet: Created a new reusable workflow linux_job_v2_rocm.yml and rolled back changes to linux_job_v2.yml.

@huydhn
Copy link
Contributor

huydhn commented Oct 3, 2025

@huydhn: Instead of modifying this workflow, I can create a new reusable workflow for ROCm with aws authentication. Something like linux_job_v2_rocm.yml

Yes, that sounds reasonable

Copy link

vercel bot commented Oct 3, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Preview Updated (UTC)
torchci Ignored Ignored Preview Oct 3, 2025 10:46pm

@huydhn
Copy link
Contributor

huydhn commented Oct 3, 2025

A suggestion here is to add a test workflow like https://github.com/pytorch/test-infra/blob/main/.github/workflows/test_linux_job_v2.yml for this new one. It's hard to avoid failures otherwise

@akashveramd
Copy link
Author

A suggestion here is to add a test workflow like https://github.com/pytorch/test-infra/blob/main/.github/workflows/test_linux_job_v2.yml for this new one. It's hard to avoid failures otherwise

@huydhn: Sorry, are you suggesting that I should create something like .github/workflows/test_linux_job_v2_rocm.yml and that will run for linux_job_v2_rocm.yml? Similar to how .github/workflows/test_linux_job_v2.yml runs for linux_job_v2.yml.

@huydhn
Copy link
Contributor

huydhn commented Oct 3, 2025

@huydhn: Sorry, are you suggesting that I should create something like .github/workflows/test_linux_job_v2_rocm.yml and that will run for linux_job_v2_rocm.yml? Similar to how .github/workflows/test_linux_job_v2.yml runs for linux_job_v2.yml.

Yup, exactly, you can call this new workflows rocm_job.yml and test_rocm_job.yml IMO. linux_job_v2_rocm.yml and test_linux_job_v2_rocm.yml is too lengthy :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants