Skip to content

Conversation

@sarnex
Copy link
Contributor

@sarnex sarnex commented Nov 26, 2024

We now have two AMDGPU runners, but we had to use one of them for build because we lost the PVC build machines.

I just readded the PVC machines and made them our only build systems (they are really fast too), so now both AMDGPU runners only do AMDGPU testing.

The AMDGPU tests run in about 10 minutes, and with two runners, I think we have enough runner bandwidth to justify re-enabling it in precommit testing (we also have two NVIDIA runners and we run that in precommit and it takes double the time to run).

These two machines are stable, we don't use the unstable runner anymore.

@sarnex sarnex marked this pull request as ready for review November 26, 2024 17:10
@sarnex sarnex requested a review from a team as a code owner November 26, 2024 17:10
@sarnex sarnex merged commit 9ccc9ee into intel:sycl Nov 26, 2024
12 checks passed
@npmiller
Copy link
Contributor

npmiller commented Dec 3, 2024

We've had a PR where testing failed on the gfx1031 failed, but passed on the gfx1030, since gfx1031 is not officially supported by ROCm, maybe it make more sense to only have the one gfx1030 machine in pre-commit, what do you think? Would it still be enough bandwidth with just one machine?

@sarnex
Copy link
Contributor Author

sarnex commented Dec 3, 2024

@npmiller I think one machine is too little. If that test fails either consistently or sporadically and we're confident it's because of the architecture, we can just make it UNSUPPORTED on that architecture IMO. All of the other tests pass on both machines so I don't think we need to get rid of the 1031 machine.

@aelovikov-intel
Copy link
Contributor

@npmiller I think one machine is too little. If that test fails either consistently or sporadically and we're confident it's because of the architecture, we can just make it UNSUPPORTED on that architecture IMO. All of the other tests pass on both machines so I don't think we need to get rid of the 1031 machine.

That makes the result flaky (UNSUPPORTED/PASS) though... I'm not blocking anything, just highlighting this drawback.

@sarnex
Copy link
Contributor Author

sarnex commented Dec 3, 2024

@aelovikov-intel Ah yeah I see, whether the test runs or not will depend on what runner gets allocated. Since we don't have that many changes that are likely to affect HIP only nor tests where we expected to have to restrict based on AMD arch, IMO it's fine because any issue should get caught eventually. If we do this for many tests then we definitely need a better solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants