Skip to content

[Draft] Accelerate Half with FP16 ISA#122649

Draft
anthonycanino wants to merge 2 commits intodotnet:mainfrom
anthonycanino:half-xmm-struct-abi
Draft

[Draft] Accelerate Half with FP16 ISA#122649
anthonycanino wants to merge 2 commits intodotnet:mainfrom
anthonycanino:half-xmm-struct-abi

Conversation

@anthonycanino
Copy link
Contributor

Draft PR for in-progress work to accelerate System.Half with FP16 ISA.

Current work done:

  1. Add a TYP_HALF to the .NET runtime, which is treated like a TYP_SIMDXX, but with some notable differences. Namely, a TYP_HALF is passed around via the xmm registers, and while it will pass a varTypeIsStruct test, it must be treated as a primitive in other places.

  2. Accelerate System.Half operations with the TYP_HALF and some FP16 intrinsics. Not every possible function has been accelerated yet.

For discussion:

  1. I have currently worked around some checks to make TYP_HALF behave like a struct and a primitive. It's very ad-hoc at the moment.

  2. Much of the work to transform the named System.Half intrinsics into a sequence of intrinsic nodes is done in importcall.cpp and might want to be moved up into some of the gtNewSimdXX nodes.

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 18, 2025
@anthonycanino
Copy link
Contributor Author

@tannergooding @jakobbotsch please take a look when you get a chance.

@anthonycanino
Copy link
Contributor Author

@dotnet/intel @tannergooding may I get some high level feedback on the structure of the PR?

Comment on lines 578 to 581
if (!compOpportunisticallyDependsOn(InstructionSet_AVX10v1))
{
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this last, not first, otherwise code gets tagged as benefiting from using AVX10v1 unnecessarily

Comment on lines -5393 to -5394
// kmov instructions reach this path with EA_8BYTE size, even on x86
|| IsKMOVInstruction(ins)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for removing this part of the assert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think that was an error, will fix.


case INS_vmovsh:
{
hasSideEffect = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this have a side effect of clearing the upper-bits?

That is, it always does DEST[MAXVL:128] := 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, I will change.


#if defined(TARGET_AMD64)
case INS_movsxd:
case INS_vmovsh:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't TARGET_AMD64 exclusive as vmovsh is listed with V/V for support, so is valid for both 64 and 32-bit mode.

Comment on lines +11819 to +11822
if (IsXMMReg(reg))
{
return emitXMMregName(reg);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be TARGET_AMD64 exclusive either.

else if (code & 0xFF000000)
{
if (size == EA_2BYTE)
if (size == EA_2BYTE && (ins != INS_vmovsh && ins != INS_vaddsh))
Copy link
Member

@tannergooding tannergooding Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use && !IsSimdInstruction(ins)?

case INS_movapd:
case INS_movupd:
// todo-xarch-half: come back to fix
case INS_vmovsh:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be grouped with vmovss and vmovsd? While we may not have exact numbers, I'd expect it to have identical perf/latency to those rather than the more general movaps and friends.

float insLatency = insLatencyInfos[ins];

// todo-xarch-half: hacking an exit on the unhandled ins to make prototyping easier
if (ins == INS_vcvtss2sh || ins == INS_vcvtsh2ss || ins == INS_vaddsh || ins == INS_vsubsh ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to put most of these with the v*ss and v*sd equivalents prior to mergine this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and for the above, I will get the proper numbers before putting the PR in non-draft.

Comment on lines 23081 to 23082
// todo-half: this is only to create zero constant half nodes for use in instrincis, anything
// else will not work
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this comment.

Presumably we just need a FloatingPointUtils::convertDoubleToHalf(...) method which returns a float16_t type (these were added in C++23, which is newer than our baseline, so we'd just typedef uint16_t float16_t; for the time being).

We then vecCon->gtSimdVal.f16[i] = cnsVal

{
if (arg->IsCnsFltOrDbl())
{
simdVal.f16[argIdx] = static_cast<uint16_t>(arg->AsDblCon()->DconValue());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks incorrect as it does a double->uint16_t cast, when we rather need double->float16_t

}
}
else if (node->TypeIs(TYP_VOID))
else if (node->TypeIs(TYP_VOID) || node->TypeIs(TYP_INT))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it was also a bug, I have removed.

if (sizeBytes < getMinVectorByteLength())
{
*pSimdBaseJitType = simdBaseType;
// The struct itself is accelerated, in this case, it is `Half`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an assert(sizeBytes == 2) in case we add other sizes in the future?

break;
}

case NI_System_Half_op_Increment:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these, like Increment/Decrement, could be merged as well using lookupHalfIntrinsic

Comment on lines 2224 to 2225
if (srcSize == 2)
return INS_vmovsh;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General convention is to have braces, particularly if it is part of an if/else chain:

Suggested change
if (srcSize == 2)
return INS_vmovsh;
if (srcSize == 2)
{
return INS_vmovsh;
}

Comment on lines 9781 to 9784
// if (node->TypeGet() == TYP_HALF)
//{
// return false;
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code?

Comment on lines 4387 to 4393
case TYP_HALF:
#ifdef TARGET_X86
useCandidates = RBM_FLOATRET;
#else
useCandidates = RBM_FLOATRET.GetFloatRegSet();
#endif
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be identical to the TYP_FLOAT path and can be collapsed to share it:

Suggested change
case TYP_HALF:
#ifdef TARGET_X86
useCandidates = RBM_FLOATRET;
#else
useCandidates = RBM_FLOATRET.GetFloatRegSet();
#endif
break;
case TYP_HALF:

Comment on lines 4402 to 4407
// We ONLY want the valid double register in the RBM_DOUBLERET mask.
#ifdef TARGET_AMD64
useCandidates = (RBM_DOUBLERET & RBM_ALLDOUBLE).GetFloatRegSet();
#else
useCandidates = (RBM_DOUBLERET & RBM_ALLDOUBLE).GetFloatRegSet();
#endif // TARGET_AMD64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR, but these two paths are the same

@anthonycanino
Copy link
Contributor Author

@tannergooding I've made a number of changes for the PR.

I think I will go ahead and add the F16C conversions, and then bottle this up as one PR. That should cover most of the initial acceleration for Half, which would address #123017 and #123018

@anthonycanino
Copy link
Contributor Author

I was incorrect about F16C: it looks like it is for vectorized fp16 conversions.

I think we are converging on a first PR now. I am looking into if any remaining operations can be covered with the FP16 ISA instructions.

@tannergooding
Copy link
Member

I was incorrect about F16C: it looks like it is for vectorized fp16 conversions.

It can still be used to accelerate a lot of functionality for scalars (and is essentially the same support needed where we generate ConvertToHalf(CreateScalarUnsafe(value)).ToScalar())

However, I think it's fine to wait for a subsequent PR to do that work (we do want to do it since that covers all x86-64-v3, i.e. AVX2 capable, hardware).

@dotnet-policy-service
Copy link
Contributor

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@anthonycanino anthonycanino reopened this Feb 19, 2026
Copilot AI review requested due to automatic review settings February 19, 2026 20:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Draft work to enable System.Half acceleration on xarch by introducing a dedicated TYP_HALF JIT type and mapping key Half operations/conversions to AVX10v1 FP16 scalar instructions, while updating VM calling-convention plumbing to match the new ABI behavior.

Changes:

  • Mark System.Half and several operators/properties/conversions as [Intrinsic] to enable JIT recognition and expansion.
  • Extend CoreCLR VM + JIT ABI paths so Half can be passed/returned in FP registers on xarch when AVX10v1 is available.
  • Add broad JIT support for TYP_HALF across SIMD/type normalization, codegen/emitter, HW intrinsics tables, and value numbering.

Reviewed changes

Copilot reviewed 44 out of 45 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/libraries/System.Private.CoreLib/src/System/Half.cs Marks Half members as intrinsics to enable JIT recognition/expansion.
src/coreclr/vm/reflectioninvocation.cpp Adjusts reg type map so reflection invocation passes Half like float on xarch.
src/coreclr/vm/methodtable.h Declares MethodTable::IsNativeHalfType() for xarch ABI checks.
src/coreclr/vm/invokeutil.cpp Ensures Half args are copied/extended appropriately for call dispatch on xarch.
src/coreclr/vm/class.cpp Implements IsNativeHalfType() gated by intrinsic-ness, layout, and AVX10v1.
src/coreclr/vm/callingconvention.h Treats Half as FP-reg passed/returned in arg iterator/return flags on xarch.
src/coreclr/vm/callhelpers.cpp Updates call descriptor reg map generation to treat Half like R4 when applicable.
src/coreclr/vm/amd64/profiler.cpp Updates profiler arg/return handling to treat native Half as FP register data.
src/coreclr/jit/vartype.h Adds helper varTypeIsAccelerated and updates float arg-reg usage for TYP_HALF.
src/coreclr/jit/valuenumfuncs.h Expands xarch HW intrinsic VN macro shape to include a TYP_HALF slot.
src/coreclr/jit/valuenum.h Adds VNForHalfCon and type conversion traits for TYP_HALF.
src/coreclr/jit/valuenum.cpp Implements Half constant VN allocation and extends various VN helpers for TYP_HALF.
src/coreclr/jit/utils.h Adds FloatingPointUtils::convertDoubleToFloat16 declaration.
src/coreclr/jit/utils.cpp Implements software double -> float16 conversion used in vector constant materialization.
src/coreclr/jit/typelist.h Defines TYP_HALF in the core type list with FP-reg classification.
src/coreclr/jit/simd.h Adds float16_t lanes to SIMD value unions; introduces SIZE_UNKNOWN.
src/coreclr/jit/simd.cpp Extends SIMD type recognition to normalize System.Half to TYP_HALF (xarch+AVX10v1).
src/coreclr/jit/scopeinfo.cpp Extends variable location encoding to handle TYP_HALF in stack/register locs.
src/coreclr/jit/registeropswasm.cpp Marks TYP_HALF as invalid for wasm value types.
src/coreclr/jit/regalloc.cpp Allows register allocation candidacy for TYP_HALF.
src/coreclr/jit/namedintrinsiclist.h Adds NI_System_Half_* named intrinsics and expands HW intrinsic macro shape.
src/coreclr/jit/morph.cpp Updates struct/SIMD size checks to include accelerated types; excludes TYP_HALF from struct promotion.
src/coreclr/jit/lsraxarch.cpp Extends LSR handling for new AVX10v1 FP16 FMA scalar intrinsic.
src/coreclr/jit/lsrabuild.cpp Ensures return handling includes TYP_HALF in float return candidates.
src/coreclr/jit/lowerxarch.cpp Adds lowering for AVX10v1 half-compare helpers and updates scalar base-type asserts.
src/coreclr/jit/lower.cpp Treats TYP_HALF similarly to SIMD for some lowering paths; excludes from FP store retyping.
src/coreclr/jit/lclvars.cpp Updates struct promotion helper to use accelerated-type sizing predicate.
src/coreclr/jit/instrsxarch.h Updates instruction metadata/flags for FP16 scalar ops and defines AVX10v1 FMA range markers.
src/coreclr/jit/instr.cpp Selects INS_vmovsh for 2-byte FP-reg load/store/copy (TYP_HALF) on xarch.
src/coreclr/jit/importercalls.cpp Adds importer expansions for System.Half ops/conversions/properties to AVX10v1 scalar intrinsics; adjusts Half arg normalization.
src/coreclr/jit/importer.cpp Extends struct normalization logic to treat intrinsic 2-byte Half as accelerated TYP_HALF.
src/coreclr/jit/hwintrinsiccodegenxarch.cpp Enables AVX10v1 family codegen path and relaxes base-type asserts for TYP_HALF.
src/coreclr/jit/hwintrinsic.h Expands instruction table storage on xarch to include a TYP_HALF instruction slot.
src/coreclr/jit/hwintrinsic.cpp Updates HW intrinsic macro expansion and type-range checks to include TYP_HALF.
src/coreclr/jit/gentree.h Allows TYP_HALF in some floating-constant assertions and adds vector-constant population for half lanes.
src/coreclr/jit/gentree.cpp Extends zero constants, scalar create, to-scalar asserts, and embedded rounding handling to include TYP_HALF.
src/coreclr/jit/float16.h Adds shared float16_t typedef for JIT components without relying on C++23.
src/coreclr/jit/emitxarch.cpp Extends xarch emitter for AVX10v1 ranges, EVEX prefix maps, vmovsh, and perf scoring for FP16 instructions.
src/coreclr/jit/emit.h Adds perf-score throughput constants used by new FP16 perf modeling.
src/coreclr/jit/compiler.h Adds Half intrinsic helper declarations and renames SIMD-size predicate to “accelerated”.
src/coreclr/jit/compiler.cpp Implements isNativeHalfStructType and uses it to map 2-byte structs to TYP_HALF when applicable.
src/coreclr/jit/codegenxarch.cpp Treats TYP_HALF like floating for return registers and stack arg emission in key paths.
src/coreclr/jit/codegencommon.cpp Updates struct-return assertions to allow TYP_HALF special casing.
src/coreclr/jit/abi.cpp Maps 2-byte ABI passing segments to TYP_HALF.

Both simd.cpp, gentree.cpp, and utils.cpp need a definition of float16_t
but do not share a common header.

Defining here so as to not create accidental implict include dependencies.
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "implict" should be "implicit".

Suggested change
Defining here so as to not create accidental implict include dependencies.
Defining here so as to not create accidental implicit include dependencies.

Copilot uses AI. Check for mistakes.
Comment on lines +4445 to +4452
case NI_System_Half_FusedMultiplyAdd:
{
#if defined(TARGET_XARCH)
if (compOpportunisticallyDependsOn(InstructionSet_AVX10v1))
{
// We are constructing a chain of intrinsics similar to:
// return FMA.MultiplyAddScalar(
// Vector128.CreateScalarUnsafe(x),
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New JIT intrinsic expansion for System.Half is introduced here (AVX10v1-based lowering), but the PR doesn't add corresponding JIT/HardwareIntrinsics tests. Please add targeted tests (correctness + codegen) under the existing AVX10v1 HW-intrinsics test projects so regressions/call-conv mismatches are caught.

Copilot uses AI. Check for mistakes.
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using System.Runtime.Intrinsics; appears unused in this file (the [Intrinsic] attribute comes from System.Runtime.CompilerServices). Unused using directives typically trigger CS8019, which may fail the build if warnings are treated as errors.

Consider removing this using unless another Intrinsics type is going to be referenced from Half.cs.

Copilot uses AI. Check for mistakes.
if (sigType == TYP_STRUCT)
{
var_types normSigType = impNormStructType(classHnd);
sigType == (normSigType == TYP_HALF) ? TYP_HALF : sigType;
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sigType is not updated here because the ternary expression uses == instead of assignment. This means Half arguments will still be treated as TYP_STRUCT, which can cause incorrect implicit-coercion decisions and GDV incompatibility checks.

Change this to an assignment (and keep the conditional expression) so the signature type is actually normalized to TYP_HALF when appropriate.

Suggested change
sigType == (normSigType == TYP_HALF) ? TYP_HALF : sigType;
sigType = (normSigType == TYP_HALF) ? TYP_HALF : sigType;

Copilot uses AI. Check for mistakes.
Comment on lines +1868 to +1872
// todo-half: we need to make this work properly
ValueNum ValueNumStore::VNForHalfCon(float cnsVal)
{
return VnForConst(cnsVal, GetFloatCnsMap(), TYP_HALF);
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VNForHalfCon currently uses GetFloatCnsMap() as its lookup table. This will cause Half constants and float constants with the same bit pattern to share the same ValueNum, which breaks the invariant that a VN’s constant storage type matches its var_types (e.g., you can end up with a VN allocated in a TYP_FLOAT chunk but later used as TYP_HALF).

Introduce a dedicated Half constant map (keyed by the 16-bit Half payload or a distinct key type) so Half constants cannot collide with float constants.

Copilot uses AI. Check for mistakes.
Comment on lines +2278 to +2280
static constexpr uint64_t HALF_POSITIVE_INFINITY_BITS = 0x7C00;
static constexpr uint64_t HALF_NEGATIVE_INFINITY_BITS = 0xFC00;

Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HALF_POSITIVE_INFINITY_BITS/HALF_NEGATIVE_INFINITY_BITS are declared as uint64_t but are returned from helpers that return float16_t (uint16_t). This introduces implicit narrowing conversions that are likely to trigger warnings (and may fail the build under /WX).

Consider making these constants uint16_t (or explicitly casting at the return sites) so the return type matches without narrowing.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments