Skip to content

[AMDGPU] missed add-with-carry optimization #152992

@ahorek

Description

@ahorek

Consider this example

typedef uint uint32;
typedef ulong uint64;
typedef struct { uint64 s0; uint32 s1; } uint96;

uint96 uint96_add_64(const uint96 x, const uint64 y)
	uint96 r;
	const uint64 s0 = x.s0 + y;
	r.s0 = s0;
	r.s1 = x.s1 + (s0 < y);
	return r;
}

It gets compiled to

	v_add_co_u32 v0, vcc_lo, v3, v0
	v_add_co_ci_u32_e32 v1, vcc_lo, v4, v1, vcc_lo
	v_cmp_lt_u64_e32 vcc_lo, v[0:1], v[3:4]
	v_add_co_ci_u32_e32 v2, vcc_lo, 0, v2, vcc_lo

v_cmp_lt_u64_e32 shouldn’t be needed since vcc_lo already contains the carry.

Expected

	v_add_co_u32 v0, vcc_lo, v3, v0
	v_add_co_ci_u32_e32 v1, vcc_lo, v4, v1, vcc_lo
	v_add_co_ci_u32_e32 v2, vcc_lo, 0, v2, vcc_lo

See for more examples
ROCm/ROCm#4717
ROCm/ROCm#477 (comment)

There's already an optimization for uint (32bit+32bit), which generates an optimal code

v_add_co_u32_e32 v2, vcc, v3, v2
v_addc_co_u32_e32 v2, vcc, 0, v2, vcc

However, the code is not optimized for ulong combinations (32-bit + 64-bit or 64-bit + 64-bit). See 1c9a93a ... could the optimization be extended for those cases? Thanks!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions