Skip to content

Conversation

@jurahul
Copy link
Contributor

@jurahul jurahul commented Jul 1, 2025

This change attempts to reduce the size of the disassembler code generated by DecoderEmitter.

Current state:

  1. Currently, the code generated by the decoder emitter consists of two key functions: decodeInstruction which is the entry point into the generated code and decodeToMCInst which is invoked when a decode op is reached in the while traversing through the decoder table. Both functions are templated on InsnType which is the raw instruction bits that are supplied to decodeInstruction.
  2. Several backends call decodeInstruction with different types, leading to several template instantiations of this function in the final code. As an example, AMDGPU instantiates this function with type DecoderUInt128 type for decoding 96/128 bit instructions, uint64_t for decoding 64-bit instructions, and uint32_t for decoding 32-bit instructions.
  3. Since there is just one decodeToMCInst generated, it has code that handles all instruction sizes. The decoders emitted for different instructions sizes rarely have any intersection with each other. That means, in the AMDGPU case, the instantiation with InsnType == DecoderUInt128 has decoder code for 32/64-bit instructions that is never exercised. Conversely, the instantiation with InsnType == uint64_t has decoder code for 128/96/32 bit instructions that is never exercised. This leads to unnecessary dead code in the generated disassembler binary.

With this change, the DecoderEmitter will stop generating a single templated decodeInstruction and will instead generate several overloaded versions of this function and the associated decodeToMCInst function as well. Instead of using the templated InsnType, it will use an auto-inferred type which can be either a standard C++ integrer type, APInt, or a std::bitset. As a results, decoders for 32-bit instructions will appear only in the 32-bit variant of decodeToMCinst and 64-bit decoders will appear only in 64-bit variant and that will fix the code duplication in the templated variant.

Additionally, the DecodeIndex will now be computed per-instruction bitwidth as instead of being computed globally across all bitwidths in the earlier case. So, the values will generally be smaller than before and hence will consume less bytes in their ULEB128 encoding in the decoder tables, resulting in further reduction in the size of the decode tables.

Since this non-templated decoder also needs some changes in the C++ code, added an option GenerateTemplatedDecoder to InstrInfo that is defaulted to false, but targets can set to true to fall back to using templated code. The goal is to migrate all targets to use non-templated decoder and deprecate this option in future.

Adopt this feature for the AMDGPU backend. In a release build, this results in a net 35% reduction in the .text size of libLLVMAMDGPUDisassembler.so and a 5% reduction in the .rodata size. Actual numbers measured locally for a Linux x86_64 build using clang-18.1.3 toolchain are:

.text 378780 -> 244684, i.e., a 35% reduction in size
.rodata 352648 -> 334960 i.e., a 5% reduction in size

For targets that do not use multiple instantiations of decodeInstruction, opting in into this feature may not result in code/data size improvement but potential compile time improvements by avoiding the use of templated code.

@topperc
Copy link
Collaborator

topperc commented Jul 1, 2025

What's the motivation?

@jurahul
Copy link
Contributor Author

jurahul commented Jul 1, 2025

Please see the PR description that I just added. I also have data to support this, will add soon (tabulating it ATM)

@jurahul
Copy link
Contributor Author

jurahul commented Jul 1, 2025

Failure is in the unit test (TableGen/VarLenDecoder.td), I need to update it.

@topperc
Copy link
Collaborator

topperc commented Jul 1, 2025

Why can't the disassembler emitter figure out the bitwidths from the tablegen input? It already makes separate tables for each instruction size.

@topperc
Copy link
Collaborator

topperc commented Jul 1, 2025

Why can't the disassembler emitter figure out the bitwidths from the tablegen input? It already makes separate tables for each instruction size.

Oh is it because RISC-V uses 3 bit widths, but only 2 types for DecodeInstruction?

@topperc
Copy link
Collaborator

topperc commented Jul 1, 2025

Can we store the type in the Instruction class in the .td files like the bitwidth instead of introducing a complex command line argument?

@jurahul
Copy link
Contributor Author

jurahul commented Jul 1, 2025

Right, this is POC at this which shows that the proposed optimization works. I am open to changing the interface here as well. The command line one was simple enough to not mess with tablegen instruction class etc, but that is an option, though it feels more intrusive. The command line is moderately complex and localized to the decoder emitter.

@jurahul
Copy link
Contributor Author

jurahul commented Jul 1, 2025

Repeating the type per-instruction record might be redundant (and we would need more verification as well to verify for a given size, all insts of that size have the C++ type specified and its consistent). One option is to add a new InstructionTypeAndSize class that records this information, and DecoderEmitter can use that if its present else fall back to templated code. Something like

class InstructionDecoderTypeAndSize<string CPPType, list<int> Bitwidths> {
}

class InstructionDecoderTypeAndSizes<list<InstructionDecoderTypeAndSize>> {
}

and a particular backend can define a single record of type InstructionDecoderTypeAndSizes<> which the DecoderEmitter will use. This is essentially encoding the command line option as a record.

// RISCV.td
// Opt-in to non-templated deocder code.
def : InstructionDecoderTypeAndSizes<[
                InstructionDecoderTypeAndSize<"uint64_t", [48]>,
                InstructionDecoderTypeAndSize<"uint32_t", [16,32]>]>;

or more simply

class InstructionDecoderTypeAndSizes<list<string> CPPTypes, list<list<int>> Bitwidths> {
}

def : InstructionDecoderTypeAndSizes<
           [ "uint32_t", uint64_t"],
           [ [16,32],    [64]     ]>;

@jurahul jurahul force-pushed the decoder_emitter_type_specialization branch from 30d0838 to 2d7d1dc Compare July 1, 2025 22:19
@topperc
Copy link
Collaborator

topperc commented Jul 1, 2025

Repeating the type per-instruction record might be redundant (and we would need more verification as well to verify for a given size, all insts of that size have the C++ type specified and its consistent). One option is to add a new InstructionTypeAndSize class that records this information, and DecoderEmitter can use that if its present else fall back to templated code. Something like

class InstructionDecoderTypeAndSize<string CPPType, list<int> Bitwidths> {
}

class InstructionDecoderTypeAndSizes<list<InstructionDecoderTypeAndSize>> {
}

and a particular backend can define a single record of type InstructionDecoderTypeAndSizes<> which the DecoderEmitter will use. This is essentially encoding the command line option as a record.

// RISCV.td
// Opt-in to non-templated deocder code.
def : InstructionDecoderTypeAndSizes<[
                InstructionDecoderTypeAndSize<"uint64_t", [48]>,
                InstructionDecoderTypeAndSize<"uint32_t", [16,32]>]>;

or more simply

class InstructionDecoderTypeAndSizes<list<string> CPPTypes, list<list<int>> Bitwidths> {
}

def : InstructionDecoderTypeAndSizes<
           [ "uint32_t", uint64_t"],
           [ [16,32],    [64]     ]>;

RISCV uses a common base class for each of the 3 instruction sizes. Other targets may be similar.

class RVInst<dag outs, dag ins, string opcodestr, string argstr,                 
             list<dag> pattern, InstFormat format>                               
    : RVInstCommon<outs, ins, opcodestr, argstr, pattern, format> {              
  field bits<32> Inst;                                                           
  // SoftFail is a field the disassembler can use to provide a way for           
  // instructions to not match without killing the whole decode process. It is   
  // mainly used for ARM, but Tablegen expects this field to exist or it fails   
  // to build the decode table.                                                  
  field bits<32> SoftFail = 0;                                                   
  let Size = 4;                                                                  
}                                                                                
                                                                                 
class RVInst48<dag outs, dag ins, string opcodestr, string argstr,               
               list<dag> pattern, InstFormat format>                             
    : RVInstCommon<outs, ins, opcodestr, argstr, pattern, format> {              
  field bits<48> Inst;                                                           
  field bits<48> SoftFail = 0;                                                   
  let Size = 6;                                                                  
}                                                                                
                                                                                 
class RVInst64<dag outs, dag ins, string opcodestr, string argstr,               
               list<dag> pattern, InstFormat format>                             
    : RVInstCommon<outs, ins, opcodestr, argstr, pattern, format> {              
  field bits<64> Inst;                                                           
  field bits<64> SoftFail = 0;                                                   
  let Size = 8;                                                                  
}

@jurahul
Copy link
Contributor Author

jurahul commented Jul 1, 2025

Repeating the type per-instruction record might be redundant (and we would need more verification as well to verify for a given size, all insts of that size have the C++ type specified and its consistent). One option is to add a new InstructionTypeAndSize class that records this information, and DecoderEmitter can use that if its present else fall back to templated code. Something like

class InstructionDecoderTypeAndSize<string CPPType, list<int> Bitwidths> {
}

class InstructionDecoderTypeAndSizes<list<InstructionDecoderTypeAndSize>> {
}

and a particular backend can define a single record of type InstructionDecoderTypeAndSizes<> which the DecoderEmitter will use. This is essentially encoding the command line option as a record.

// RISCV.td
// Opt-in to non-templated deocder code.
def : InstructionDecoderTypeAndSizes<[
                InstructionDecoderTypeAndSize<"uint64_t", [48]>,
                InstructionDecoderTypeAndSize<"uint32_t", [16,32]>]>;

or more simply

class InstructionDecoderTypeAndSizes<list<string> CPPTypes, list<list<int>> Bitwidths> {
}

def : InstructionDecoderTypeAndSizes<
           [ "uint32_t", uint64_t"],
           [ [16,32],    [64]     ]>;

RISCV uses a common base class for each of the 3 instruction sizes. Other targets may be similar.

class RVInst<dag outs, dag ins, string opcodestr, string argstr,                 
             list<dag> pattern, InstFormat format>                               
    : RVInstCommon<outs, ins, opcodestr, argstr, pattern, format> {              
  field bits<32> Inst;                                                           
  // SoftFail is a field the disassembler can use to provide a way for           
  // instructions to not match without killing the whole decode process. It is   
  // mainly used for ARM, but Tablegen expects this field to exist or it fails   
  // to build the decode table.                                                  
  field bits<32> SoftFail = 0;                                                   
  let Size = 4;                                                                  
}                                                                                
                                                                                 
class RVInst48<dag outs, dag ins, string opcodestr, string argstr,               
               list<dag> pattern, InstFormat format>                             
    : RVInstCommon<outs, ins, opcodestr, argstr, pattern, format> {              
  field bits<48> Inst;                                                           
  field bits<48> SoftFail = 0;                                                   
  let Size = 6;                                                                  
}                                                                                
                                                                                 
class RVInst64<dag outs, dag ins, string opcodestr, string argstr,               
               list<dag> pattern, InstFormat format>                             
    : RVInstCommon<outs, ins, opcodestr, argstr, pattern, format> {              
  field bits<64> Inst;                                                           
  field bits<64> SoftFail = 0;                                                   
  let Size = 8;                                                                  
}

Right, but nonetheless, we will have the type specified per instruction instance and we will still need to validate for example that for all instructions with a particular size, the type string is same. To me that seems unnecessary duplication of this information and then additional verification to make sure that it's consistent. Also, unlike the size in bytes, which is a core property of the instruction, its C++ type to represent its bits in memory seems not a core property. Many backends seems to choose the same type (for example uint64_t) for all their 16/32/48/64 bit insts. Adoption wise as well, sticking it in the per-inst record seems more invasive (for example, in our and several other downstream backends the core instruction records are auto-generated so the adoption curve for this increases further).

@jurahul
Copy link
Contributor Author

jurahul commented Jul 2, 2025

Requesting not review per-se but opinion on the user interface for this optimization. Choices proposed are:

  • Command line option, as in this PR. +: non-intrusive in terms of .td files, -: need to parse it and parsing can be flaky.
  • Per instruction record carries cpp type (@topperc 's suggestion): +: No command line option parsing flakiness, -: (IMO) too invasive, I see difficulty or increased complexity in adoption for auto-generated inst defs used in several downstream backends, needs additional validation for consistent across all insts of a given size.
  • Option embedded as a new singleton records in the .td file: +:No command line option parsing flakiness, less intrusive than the option below (as the single def is standalone not attached to anything else), no consistency checks. -: ?

@topperc
Copy link
Collaborator

topperc commented Jul 2, 2025

Many backends seems to choose the same type (for example uint64_t) for all their 16/32/48/64 bit insts.

I'm probably going to change to uint64_t for RISC-V. The 48-bit instructions are only used by one vendor and are relatively recent additions. I think the duplication cost just wasn't considered when they were added.

I agree adding to the Inst class might be too invasive. I still think it should be in .td files somehow. Needing to change a CMake file and replicating to GN and the other build systems when a new instruction width is added seems bad.

@jurahul
Copy link
Contributor Author

jurahul commented Jul 2, 2025

Many backends seems to choose the same type (for example uint64_t) for all their 16/32/48/64 bit insts.

I'm probably going to change to uint64_t for RISC-V. The 48-bit instructions are only used by one vendor and are relatively recent additions. I think the duplication cost just wasn't considered when they were added.

I agree adding to the Inst class might be too invasive. I still think it should be in .td files somehow. Needing to change a CMake file and replicating to GN and the other build systems when a new instruction width is added seems bad.

Right, is the option #3 above palatable? We essentially encode it as a standalone record that the DecoderEmitter will look for.

@topperc
Copy link
Collaborator

topperc commented Jul 2, 2025

Many backends seems to choose the same type (for example uint64_t) for all their 16/32/48/64 bit insts.

I'm probably going to change to uint64_t for RISC-V. The 48-bit instructions are only used by one vendor and are relatively recent additions. I think the duplication cost just wasn't considered when they were added.
I agree adding to the Inst class might be too invasive. I still think it should be in .td files somehow. Needing to change a CMake file and replicating to GN and the other build systems when a new instruction width is added seems bad.

Right, is the option #3 above palatable? We essentially encode it as a standalone record that the DecoderEmitter will look for.

Maybe it should be stored in the InstrInfo class like isLittleEndianEncoding or the Target class?

topperc added a commit to topperc/llvm-project that referenced this pull request Jul 2, 2025
…6. NFC

Insn is passed to decodeInstruction which is a template function
based on the type of Insn. By using uint64_t we ensure only one
version of decodeInstruction is created. This reduces the file size
of RISCVDisassembler.cpp.o by ~25% in my local build.

This should get even more size benefit than llvm#146593.
@jurahul
Copy link
Contributor Author

jurahul commented Jul 2, 2025

InstrInfo seems reasonable. Let me rework the PR to add that

@jurahul jurahul force-pushed the decoder_emitter_type_specialization branch from 2d7d1dc to 04366ee Compare July 2, 2025 14:38
@jurahul
Copy link
Contributor Author

jurahul commented Jul 2, 2025

@topperc Please let me know if this new scheme looks ok. If yes, I'll migrate the rest of the targets (Right now I just changed AARCH64 and RISCV) to use this, and add some unit tests for a final review.

@topperc
Copy link
Collaborator

topperc commented Jul 2, 2025

@topperc Please let me know if this new scheme looks ok. If yes, I'll migrate the rest of the targets (Right now I just changed AARCH64 and RISCV) to use this, and add some unit tests for a final review.

Does this have any binary size effect on RISCV after #146619?

@jurahul
Copy link
Contributor Author

jurahul commented Jul 2, 2025

@topperc Please let me know if this new scheme looks ok. If yes, I'll migrate the rest of the targets (Right now I just changed AARCH64 and RISCV) to use this, and add some unit tests for a final review.

Does this have any binary size effect on RISCV after #146619?

I have not tested. My speculation is, no binary size change but just minor compile time improvement by avoiding template specialization. I'll check and report back.

@jurahul
Copy link
Contributor Author

jurahul commented Jul 2, 2025

Looks like templating adds a little bit to the code size. Building the RISCVDisassembler.cpp.o in a release config with/without this change results in the following:

Old:
196112 ./build/lib/Target/RISCV/Disassembler/CMakeFiles/LLVMRISCVDisassembler.dir/RISCVDisassembler.cpp.o 
New:
196096 ./build/lib/Target/RISCV/Disassembler/CMakeFiles/LLVMRISCVDisassembler.dir/RISCVDisassembler.cpp.o

So, 16 bytes less. Not significant though.

@topperc
Copy link
Collaborator

topperc commented Jul 2, 2025

Looks like templating adds a little bit to the code size. Building the RISCVDisassembler.cpp.o in a release config with/without this change results in the following:

Old:
196112 ./build/lib/Target/RISCV/Disassembler/CMakeFiles/LLVMRISCVDisassembler.dir/RISCVDisassembler.cpp.o 
New:
196096 ./build/lib/Target/RISCV/Disassembler/CMakeFiles/LLVMRISCVDisassembler.dir/RISCVDisassembler.cpp.o

So, 16 bytes less. Not significant though.

Could just be a difference in the name mangling of the function name? Or are you checking the .text size?

@jurahul
Copy link
Contributor Author

jurahul commented Jul 2, 2025

yeah, your guess was right. I dumped the sizes with size -A and I see:

New:

.text._ZN12_GLOBAL__N_114decodeToMCInstEjN4llvm14MCDisassembler12DecodeStatusEmRNS0_6MCInstEmPKS1_Rb     27023

.text._ZN12_GLOBAL__N_117decodeInstructionEPKhRN4llvm6MCInstEmmPKNS2_14MCDisassemblerERKNS2_15MCSubtargetInfoE        9238 

Old:                                                                                                                                                                                                                                                  

.text._ZN12_GLOBAL__N_114decodeToMCInstImEEN4llvm14MCDisassembler12DecodeStatusEjS3_T_RNS1_6MCInstEmPKS2_Rb                             27023
 .text._ZN12_GLOBAL__N_117decodeInstructionImEEN4llvm14MCDisassembler12DecodeStatusEPKhRNS1_6MCInstET_mPKS2_RKNS1_15MCSubtargetInfoE      9238

That is, text sizes are the same but mangled names are different and that likely leads to larger object file sizes.

@jurahul
Copy link
Contributor Author

jurahul commented Jul 2, 2025

Note though that what you did for RISCV may not be applicable/desirable for all targets. For example, AMDGPU has 128 bit instructions, so I am assuming if we just use a 128-bit type for all instructions, we may pay a penalty in terms of the bit extraction costs (32 vs 64-bit may not be as bad).

@jurahul
Copy link
Contributor Author

jurahul commented Jul 2, 2025

@topperc My question is still unanswered. WDYT of this new interface to op-in into this optimization?

// CHECK-LARGE-NEXT: /* 25 */ MCD::OPC_Fail,

// CHECK-LARGE: if (!Check(S, DecodeInstB(MI, insn, Address, Decoder))) { DecodeComplete = false; return MCDisassembler::Fail; }
// CHECK-LARGE: if (!Check(S, DecodeInstB(MI, Insn, Address, Decoder))) { DecodeComplete = false; return MCDisassembler::Fail; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes could have been avoided / made into separate PR.

// Helper macro to disable inlining of `fieldFromInstruction` for integer types.
#if defined(_MSC_VER) && !defined(__clang__)
__declspec(noinline)
#define DEC_EMIT_NO_INLINE __declspec(noinline)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the macro doesn't seem necessary?

@jurahul
Copy link
Contributor Author

jurahul commented Aug 19, 2025

FYI I'm seeing linker errors with the current PR:

ld.lld: error: undefined symbol: llvm::MCDisassembler::DecodeStatus DecodeDisp<19u>(llvm::MCInst&, unsigned int, unsigned long, llvm::MCDisassembler const*)

They are preceded by compiler warnings:

[2106/2314] Building CXX object lib/Target/Sparc/Disassembler/CMakeFiles/LLVMSparcDisassembler.dir/SparcDisassembler.cpp.o
../llvm/lib/Target/Sparc/Disassembler/SparcDisassembler.cpp:270:31: warning: function 'DecodeDisp<19U>' has internal linkage but is not defined [-Wundefined-internal]
constexpr static DecodeStatus DecodeDisp(MCInst &MI, uint32_t ImmVal,
                              ^
lib/Target/Sparc/SparcGenDisassemblerTables.inc:2053:19: note: used here
    if (!Check(S, DecodeDisp<19>(MI, tmp, Address, Decoder))) { return MCDisassembler::Fail; }
                  ^

I'm using stock Ubuntu-22.04 clang/lld (14.0.0).

(resolved conflicts once again, sorry)

I cannot reproduce it. I am on Ubuntu 24.04.1 LTS and clang 18.1.3.

@s-barannikov
Copy link
Contributor

Something along the lines of:
https://gcc.godbolt.org/z/v74Gaeb5c

I think we need to move DecodeDisp definition above the #include

@jurahul
Copy link
Contributor Author

jurahul commented Aug 20, 2025

Ok, I finally have some perf data. I tested 2 configurations: one where there is a single decodeInstructionImpl function that operates on the highest bitwidth and that's what all the entry points call (so less code size) and the second where it's specialized to each bit width (more code size) and ran against AMDGPU and RISCV mc-disassembler test (~250). Note that in either case, the final call to decodeToMCInst still uses the actual bitwidth, so the difference is that OPC_ExtractField and OPC_CheckField ops in the decoder tables operate at the highest bit width vs specialized for the actual bit width.

Ignoring any outliers, I see upto 10% perf regression if we always use the highest bitwidth. Here's an example:

 for i in {0..10}; do  ./build/bin/llvm-mc -triple=amdgcn -mcpu=gfx950 -disassemble -show-encoding --runs=3000 --time-trace --time-trace-file=/tmp/xyz -o /dev/null  < /extra_drive/upstream-llvm/llvm-project/llvm/test/MC/Disassembler/AMDGPU/gfx950_dasm_xdlops.txt && python3 -m json.tool /tmp/xyz | grep -A 3 "Total getInstruction" | grep "avg us"; done
                "avg us": 527
                "avg us": 516
                "avg us": 520
                "avg us": 517
                "avg us": 521
                "avg us": 525
                "avg us": 518
                "avg us": 508
                "avg us": 509
                "avg us": 509
                "avg us": 509

vs

for i in {0..10}; do  ./build/bin/llvm-mc -triple=amdgcn -mcpu=gfx950 -disassemble -show-encoding --runs=3000 --time-trace --time-trace-file=/tmp/xyz -o /dev/null  < /extra_drive/upstream-llvm/llvm-project/llvm/test/MC/Disassembler/AMDGPU/gfx950_dasm_xdlops.txt && python3 -m json.tool /tmp/xyz | grep -A 3 "Total getInstruction" | grep "avg us"; done
                "avg us": 479
                "avg us": 481
                "avg us": 478
                "avg us": 479
                "avg us": 479
                "avg us": 478
                "avg us": 476
                "avg us": 470
                "avg us": 456
                "avg us": 470
                "avg us": 464

This is by changing the TimeProfiler to report microsecs vs millisecs. Based on this data, I am concluding that we need to prefer decode speed over code size and go with a specialized decodeInstruction per bit width. @topperc and @s-barannikov please let me know if this makes sense. I can then work on addressing outstanding comments and resurrecting this back for another round of reviews.

@jurahul
Copy link
Contributor Author

jurahul commented Aug 20, 2025

MC decoder impl vs no-impl perf data - Sheet2.pdf

Attaching the data as a PDF (not sure how to directly share it). Basically, ignoring the first and last few entries as outliers. most other entries are a +ve number (regression).

@topperc
Copy link
Collaborator

topperc commented Aug 20, 2025

one where there is a single decodeInstructionImpl function that operates on the highest bitwidth

Does that mean a std::bitset is used on AMDGPU even for things that fit in uint64_t or uint32_t?

@jurahul
Copy link
Contributor Author

jurahul commented Aug 20, 2025

one where there is a single decodeInstructionImpl function that operates on the highest bitwidth

Does that mean a std::bitset is used on AMDGPU even for things that fit in uint64_t or uint32_t?

Right, if we generate a single "impl" function for decodeInstruction, it always operates on the highest bitwidth, so bitset<128> for AMDGPU. Just to refresh our memory, this was proposed as a way to control the code size increase we saw for RISCV. That's because today, for RISCV, we instantiate the code once using uint64_t only, whereas these changes could have created > 1 instance of the decodeInstruction (without the impl function) and that resulted in code size increase. Using a single impl function that operates on the highest bitwidth size was a potential way to control this code size increase, but the above data suggests that it comes at a runtime cost.

@jurahul
Copy link
Contributor Author

jurahul commented Aug 20, 2025

Here's what the code looks like with a single impl function:

static DecodeStatus decodeInstructionImpl(const uint8_t DecodeTable[], MCInst &MI, const std::bitset<128> &Insn, uint64_t Address, const MCDisassembler *DisAsm, const MCSubtargetInfo &STI, function_ref<DecodeStatus(unsigned, DecodeStatus, MCInst &, uint64_t, const MCDisassembler *, bool &)> decodeToMCInstPtr) {
  const FeatureBitset &Bits = STI.getFeatureBits();

  const uint8_t *Ptr = DecodeTable;
  uint64_t CurFieldValue = 0;
  DecodeStatus S = MCDisassembler::Success;
  while (true) {
    ptrdiff_t Loc = Ptr - DecodeTable;
    const uint8_t DecoderOp = *Ptr++;
    switch (DecoderOp) {
    default:
      errs() << Loc << ": Unexpected decode table opcode: "
             << (int)DecoderOp << '\n';
      return MCDisassembler::Fail;
    case MCD::OPC_ExtractField: {

...
    case MCD::OPC_Decode: {
      // Decode the Opcode value.
      unsigned Opc = decodeULEB128AndIncUnsafe(Ptr);
      unsigned DecodeIdx = decodeULEB128AndIncUnsafe(Ptr);

      MI.clear();
      MI.setOpcode(Opc);
      bool DecodeComplete;
      S = decodeToMCInstPtr(DecodeIdx, S, MI, Address, DisAsm, DecodeComplete);

...
}

static DecodeStatus decodeInstruction(const uint8_t DecodeTable[], MCInst &MI,
    uint32_t Insn, uint64_t Address, const MCDisassembler *DisAsm,
    const MCSubtargetInfo &STI) {
  std::bitset<128> InsnMaxWidth = Insn;
  auto DecodeToMCInst = [Insn](unsigned DecodeIdx, DecodeStatus S, MCInst &MI,
                                 uint64_t Address, const MCDisassembler *DisAsm,
                                 bool &DecodeComplete) {
    return decodeToMCInst32(DecodeIdx, S, Insn, MI, Address, DisAsm, DecodeComplete);
  };
  return decodeInstructionImpl(DecodeTable, MI, InsnMaxWidth, Address, DisAsm, STI, DecodeToMCInst);
}

static DecodeStatus decodeInstruction(const uint8_t DecodeTable[], MCInst &MI,
    uint64_t Insn, uint64_t Address, const MCDisassembler *DisAsm,
    const MCSubtargetInfo &STI) {
  std::bitset<128> InsnMaxWidth = Insn;
  auto DecodeToMCInst = [Insn](unsigned DecodeIdx, DecodeStatus S, MCInst &MI,
                                 uint64_t Address, const MCDisassembler *DisAsm,
                                 bool &DecodeComplete) {
    return decodeToMCInst64(DecodeIdx, S, Insn, MI, Address, DisAsm, DecodeComplete);
  };
  return decodeInstructionImpl(DecodeTable, MI, InsnMaxWidth, Address, DisAsm, STI, DecodeToMCInst);
}

static DecodeStatus decodeInstruction(const uint8_t DecodeTable[], MCInst &MI,
    const std::bitset<96> &Insn, uint64_t Address, const MCDisassembler *DisAsm,
    const MCSubtargetInfo &STI) {
  const std::bitset<96> Mask(maskTrailingOnes<uint64_t>(64));
  std::bitset<128> InsnMaxWidth((Insn & Mask).to_ullong());
  InsnMaxWidth |= std::bitset<128>(((Insn >> 64) & Mask).to_ullong()) << 64;

  auto DecodeToMCInst = [&Insn](unsigned DecodeIdx, DecodeStatus S, MCInst &MI,
                                 uint64_t Address, const MCDisassembler *DisAsm,
                                 bool &DecodeComplete) {
    return decodeToMCInst96(DecodeIdx, S, Insn, MI, Address, DisAsm, DecodeComplete);
  };
  return decodeInstructionImpl(DecodeTable, MI, InsnMaxWidth, Address, DisAsm, STI, DecodeToMCInst);
}

static DecodeStatus decodeInstruction(const uint8_t DecodeTable[], MCInst &MI,
    const std::bitset<128> &Insn, uint64_t Address, const MCDisassembler *DisAsm,
    const MCSubtargetInfo &STI) {
  std::bitset<128> InsnMaxWidth = Insn;
  auto DecodeToMCInst = [&Insn](unsigned DecodeIdx, DecodeStatus S, MCInst &MI,
                                 uint64_t Address, const MCDisassembler *DisAsm,
                                 bool &DecodeComplete) {
    return decodeToMCInst128(DecodeIdx, S, Insn, MI, Address, DisAsm, DecodeComplete);
  };
  return decodeInstructionImpl(DecodeTable, MI, InsnMaxWidth, Address, DisAsm, STI, DecodeToMCInst);
}

So each bit-width specific decodeInstruction up-converts the inst bits to max-width and then calls decodeInstructionImpl. In the final decode step (handling of OPC_Decode, we call the passed in lambda, so that code still operates on the native smaller instruction width). This helps with code size. Without the impl, the while (true) loop that traverses the decoder table will be inlined in each decodeInstruction overload, resulting in code size increase, but OPC_CheckField and OPC_ExtractField operate on the smaller actual instruction bitwidth, leading to faster decode.

@s-barannikov
Copy link
Contributor

Am I right that this is getInstruction execution time, not llvm-mc execution time? Maybe it makes more sense to measure llvm-mc execution time? I think getInstruction has less than 1% impact on llvm-mc execution time (due to I/O, string formatting, etc.).

I don't see many RISCV tests in the table. What is the avg difference there?

Could the difference be related not to the use of std::bitset, but to the use of lambdas?

@jurahul
Copy link
Contributor Author

jurahul commented Aug 20, 2025

Yes, this is measuring getInstruction time and not the other stuff. If we include llvm-mc execution time, I am not sure we will get any meaningful data as the cost of I/O, string formatting will dominate the execution time. For RISCV there only a handful of tests in check-llvm-mc-disassembler-risvc target, so those are the ones that I included in the runs.

I am not sure if we can implement this without the lambda (which will be execute exactly once per getInstruction call I think as the final deocde step).

@jurahul
Copy link
Contributor Author

jurahul commented Aug 20, 2025

One option to have backends choose between the two versions (i.e., in total support 3 modes currently, templated, with a decode impl function for less code size, and with a per-bitwidth specialized decodeInstruction) but I feel that too many options and too much complexity and maintenance over time. I think we want to drill down to just one form that works well for most backends and deprecate the templated support eventually

@s-barannikov
Copy link
Contributor

I am not sure we will get any meaningful data as the cost of I/O, string formatting will dominate the execution time.

Yes, but that's the time the user perceives. If the overall execution time of llvm-mc doesn't change (or changes on the order of microseconds), then what's the difference?

which will be execute exactly once per getInstruction call I think as the final deocde step

Ah, right...

@jurahul
Copy link
Contributor Author

jurahul commented Aug 20, 2025

I mean depends on what you do with the disassembled instruction. in llvm-mc we print it, but may be some other piece of code does something else (like say a binary analysis tool that uses the disassembler to just build MCInst representation of the binary and then does something). But yeah, I don't know how critical the execution time for this is in a larger context. If we are ok with small regression in it for smaller code size, we go with the impl version, else we stick to the per-bitwidth specialization all the way (note decodeToMCInst will already be specialized) and eat the extra code size cost.

@jurahul
Copy link
Contributor Author

jurahul commented Aug 20, 2025

Also, because getInstruction is so small, the measurement is noisy (in the spreadsheet, I have run0-4 numbers and then the average and those run0-4 numbes sometimes have large variation), but the command line dumps above do show definite slowdown in that function with the Impl version.

@s-barannikov
Copy link
Contributor

s-barannikov commented Aug 20, 2025

One option to have backends choose between the two versions (i.e., in total support 3 modes currently, templated, with a decode impl function for less code size, and with a per-bitwidth specialized decodeInstruction) but I feel that too many options and too much complexity and maintenance over time. I think we want to drill down to just one form that works well for most backends and deprecate the templated support eventually

(Just thoughts)
After spending some time with the disassembler backend, I think in the end it might make sense to emit one table and one function that works like the one for variable-length encodings, processing all possible sizes for the namespace+hwmode combination. That is, consumes data as we go. That would relieve us from having to have multiple tables for each size, the decoder function would determine the size itself based on the encoding in the low bits of an instruction. The Size field then could just be ignored for the purposes of disassembly (or used in an assert that the decoder consumed the right number of bytes).

@jurahul
Copy link
Contributor Author

jurahul commented Aug 20, 2025

Hmm..I haven't really looked closely into how the var len decoder works, so no comments, might need to investigate more. Another thought, the origin of this change was code duplication in decodeToMCInst when the decoder is instantiated multiple times. How about a smaller scoped fix if possible? Still generate templated code, but

  1. add conditions that the type has to be one of the CPPTypes that the decoder emitter decides. That is, for AMDGPU, we will pre-determine that the InsnType has to be one of unit32_t, uint64_t, bitset<96> and bitset<128>. And just add a static_assert() in the generated decodeInstruction() to check that.
  2. For each decoder, track which bitwidths its valid for and then in the generated (templated) decodeToMCInst, add appropriate compile time guards that for a given instantiation, the dead cases are stripped out. It could be something like:
if (size_in_bits == 32 && Idx != <list of decoder indexes for 32-bit insts>)
  llvm_unreachable("");

switch (Idx) {
case X:
}

Assuming that 'X' is not listed in the <list of decoder indexes for 32-bit insts>, the compiler can DCE case X as unreachable in the switch case. Something like that.

It seems this will still achieve similar results in terms of eliminating cases that are dead, while not being as disruptive/intrusive as the current change. And then we can potentially explore the new direction that @s-barannikov suggested.

@s-barannikov
Copy link
Contributor

I'm all for the simplest code.

if (size_in_bits == 32 && Idx != <list of decoder indexes for 32-bit insts>)
llvm_unreachable("");

There might be an Idx that is valid for several sizes (since decoders are shared).
Also, the list can be quite large, can't it? It would make it more difficult to debug the generated code.

@s-barannikov
Copy link
Contributor

s-barannikov commented Aug 20, 2025

decoders are shared

This fact might ruin the whole idea...

Assuming we have a solution for this (like clearing TableInfo.Decoders), how about emitting several decodeToMCInst and choosing the right one at call site via constexpr if? Can this work?

@jurahul
Copy link
Contributor Author

jurahul commented Aug 21, 2025

Ok, I am seeing some success with a variation of the idea I had before and implementation wise it seems much simpler than this PR. Basically, we add C++ code to "cull" cases in the generated decodeToMCinst based on the bitwidth. So, say if a particular decoder is used only for 32-bit insts, we modify the generated case to use:

case 300:
   if constexpr (InsnBitWidth != 32)
        llvm_unreachable()

So, if the decoder is instantiated multiple times with different InsnType (root cause of this issue), specific instantiations will have this code deleted without having to actually generate different versions of the code. So, this is essentially template instantiation time culling of decoders that are not active/relevant of the current bitwidth. Some data is as follows:

                        Trunk              DecoderCulling off       Decoder Culling ON         %reduction/increase
AMDGPU .text            611351                  820791					276887					55% decrease
AMDGPU .rodata          373928                  380744					365592					3% decrease
 
RISCV .text             47279                    98023					49091					3% increase
RISCV .rodata           35850                    38618					37306					4% increase

POC PR is here: #154775

Note that the DecoderCulling off just means that the above PR but with CullDecoders set to false. This shows the increase in code size as we now need to instantiate per instruction bitwidth type and actually use different types for different bitwidths, resulting in additional instantiations. But that's just an intermediate data point.

@jurahul
Copy link
Contributor Author

jurahul commented Aug 21, 2025

@s-barannikov and @topperc I am proposing moving in this direction if you guys agree to not make the decoder emitter code too complex. We still are not doing the per-bitwidth decode index assignment (which we found gives us ~5% savings in the space for the decoder tables) but we can revisit that and see if there is a way to fit that here. That PR also implements @s-barannikov's suggestion to move some of the fixed generated code to a common header, I'll actually do that as a NFC PR first.

@s-barannikov
Copy link
Contributor

s-barannikov commented Aug 21, 2025

One option to have backends choose between the two versions (i.e., in total support 3 modes currently, templated, with a decode impl function for less code size, and with a per-bitwidth specialized decodeInstruction) but I feel that too many options and too much complexity and maintenance over time. I think we want to drill down to just one form that works well for most backends and deprecate the templated support eventually

(Just thoughts) After spending some time with the disassembler backend, I think in the end it might make sense to emit one table and one function that works like the one for variable-length encodings, processing all possible sizes for the namespace+hwmode combination. That is, consumes data as we go. That would relieve us from having to have multiple tables for each size, the decoder function would determine the size itself based on the encoding in the low bits of an instruction. The Size field then could just be ignored for the purposes of disassembly (or used in an assert that the decoder consumed the right number of bytes).

I gave it a try. It works for some backends (e.g. RISCV), but not for the others. E.g., on AVR, instructions are encoded in a PDP-endian way for some reason, and merging tables for 16-bit and 32-bit instructions leads to conflicts. Conflicts also start appearing on at least Mips and AMDGPU, but I didn't investigate why. The single table is a few bytes larger than the sum of separate tables, there are no space savings anywhere (decoders are already shared). The only benefit is that it allows backends to eliminate a simple logic for selecting the right table, so it is probably not worth it.

@jurahul
Copy link
Contributor Author

jurahul commented Aug 25, 2025

Converting to draft as I don't think this is going in as is.

@jurahul jurahul closed this Aug 30, 2025
@jurahul jurahul deleted the decoder_emitter_type_specialization branch August 30, 2025 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants