Skip to content

Our default ABI for amdgpu is undocumented and doesn't exist on Debian. The error diagnostics are dreadful. #152933

@JonChesterfield

Description

@JonChesterfield

If we compile with clang today and run on Debian today, the error message is HSA_STATUS_ERROR_INVALID_CODE_OBJECT. I've raised ROCm/ROCR-Runtime#321 in the hope that HSA might produce a more useful diagnostic.

The primary context is #118515, "Use COV6 by default". I think that was premature, partly because Debian is still on rocm 6.1 which can't run that, partly because we seem to be waiting on information on what cov6 actually is relative to cov5. It seems better to not default to cov6 until those change.

AMD's documentation on calling conventions is pretty sparse in general. There isn't a "stable ABI" as such, more a continual drift at the IR level and jumpy changes at the binary level. There's some information at https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-code-object-metadata-v5 for previous binary ABI versions. The only references to v6 are in the context of a "generic processor" which refers to a "generic code object version number" but doesn't say what that might be. I've been trying to find information on what changed for v6 and found nothing.

For non-amd-people, the "code object version" nomenclature here means "ABI version", particularly as relates to the boundary between x64 and amdgpu, i.e. the runtime code that calls functions on the gpu from a host ("launches kernels"). Code built for version N and for version M will have some things in common and some things different. Number and order of parameters passed to kernel functions, whether an x64 machine is expected to have allocated a heap or not, how other information is encoded in side channels. Probably more things. This is designed carefully then thrown over the wall. This isn't encoded in the triple or the data layout. I don't seek to change any of that here, merely to ask whether we can roll back to a version that runs on most Linux boxes out there and thus unbreak everyone that was broken by 118515, or at least document what v6 is.

Thanks

(edit: if it turns out that all v6 contains is a "generic ISA" approximation thing, all of the users of which would have to be deliberately opting into, we should default to v5 and leave it there until v7 comes along, and so should rocm, and also hsa probably shouldn't be refusing to run v6)

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions