[AMDGPU] missed add-with-carry optimization

Consider this example
```
typedef uint uint32;
typedef ulong uint64;
typedef struct { uint64 s0; uint32 s1; } uint96;

uint96 uint96_add_64(const uint96 x, const uint64 y)
	uint96 r;
	const uint64 s0 = x.s0 + y;
	r.s0 = s0;
	r.s1 = x.s1 + (s0 < y);
	return r;
}
```

It gets compiled to
```
	v_add_co_u32 v0, vcc_lo, v3, v0
	v_add_co_ci_u32_e32 v1, vcc_lo, v4, v1, vcc_lo
	v_cmp_lt_u64_e32 vcc_lo, v[0:1], v[3:4]
	v_add_co_ci_u32_e32 v2, vcc_lo, 0, v2, vcc_lo
```
v_cmp_lt_u64_e32 shouldn’t be needed since vcc_lo already contains the carry.

Expected
```
	v_add_co_u32 v0, vcc_lo, v3, v0
	v_add_co_ci_u32_e32 v1, vcc_lo, v4, v1, vcc_lo
	v_add_co_ci_u32_e32 v2, vcc_lo, 0, v2, vcc_lo
```

See for more examples
https://github.com/ROCm/ROCm/issues/4717
https://github.com/ROCm/ROCm/issues/477#issuecomment-827000869

There's already an optimization for uint (32bit+32bit), which generates an optimal code
```
v_add_co_u32_e32 v2, vcc, v3, v2
v_addc_co_u32_e32 v2, vcc, 0, v2, vcc
```
However, the code is not optimized for ulong combinations (32-bit + 64-bit or 64-bit + 64-bit). See https://github.com/llvm/llvm-project/commit/1c9a93ae3ad0d8d085efe3af38ca65e4a7b2f307 ... could the optimization be extended for those cases? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] missed add-with-carry optimization #152992

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[AMDGPU] missed add-with-carry optimization #152992

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions