[MC][ARM] Fix crash when assembling Thumb 'movs r0,#foo'. #115026

statham-arm · 2024-11-05T17:11:34Z

If the assembler sees this instruction, understanding foo to be an external symbol, there's no relocation it can write that will put the whole value of foo into the 8-bit immediate field of the 16-bit Thumb add instruction. So it should report an error message pointing at the source line, and in LLVM 18, it did exactly that. But now the error is not reported, due to an indexing error in the operand list in validateInstruction, and instead the code continues to attempt assembly, ending up aborting at the llvm_unreachable at the end of getHiLoImmOpValue.

In this commit I've fixed the index in the ARM::tMOVi8 case of validateInstruction, and also the one for tADDi8 which must cope with either the 2- or 3-operand form in the input assembly source. But also, while writing the test, I found that if you assemble for Armv7-M instead of Armv6-M, the instruction has opcode t2ADDri when it goes through validateInstruction, and only turns into tMOVi8 later in processInstruction. Then it's too late for validateInstruction to report that error. So I've adjusted processInstruction to spot that case and inhibit the conversion.

If the assembler sees this instruction, understanding 'foo' to be an external symbol, there's no relocation it can write that will put the whole value of 'foo' into the 8-bit immediate field of the 16-bit Thumb add instruction. So it should report an error message pointing at the source line, and in LLVM 18, it did exactly that. But now the error is not reported, due to an indexing error in the operand list in `validateInstruction`, and instead the code continues to attempt assembly, ending up aborting at the `llvm_unreachable` at the end of `getHiLoImmOpValue`. In this commit I've fixed the index in the `ARM::tMOVi8` case of `validateInstruction`, and also the one for `tADDi8` which must cope with either the 2- or 3-operand form in the input assembly source. But also, while writing the test, I found that if you assemble for Armv7-M instead of Armv6-M, the instruction has opcode `t2ADDri` when it goes through `validateInstruction`, and only turns into `tMOVi8` later in `processInstruction`. Then it's too late for `validateInstruction` to report that error. So I've introduced a new `revalidateInstruction` which is run _after_ modifying the instruction, and repeated the same check there.

llvmbot · 2024-11-05T17:12:12Z

@llvm/pr-subscribers-backend-arm

@llvm/pr-subscribers-mc

Author: Simon Tatham (statham-arm)

Changes

If the assembler sees this instruction, understanding foo to be an external symbol, there's no relocation it can write that will put the whole value of foo into the 8-bit immediate field of the 16-bit Thumb add instruction. So it should report an error message pointing at the source line, and in LLVM 18, it did exactly that. But now the error is not reported, due to an indexing error in the operand list in validateInstruction, and instead the code continues to attempt assembly, ending up aborting at the llvm_unreachable at the end of getHiLoImmOpValue.

In this commit I've fixed the index in the ARM::tMOVi8 case of validateInstruction, and also the one for tADDi8 which must cope with either the 2- or 3-operand form in the input assembly source. But also, while writing the test, I found that if you assemble for Armv7-M instead of Armv6-M, the instruction has opcode t2ADDri when it goes through validateInstruction, and only turns into tMOVi8 later in processInstruction. Then it's too late for validateInstruction to report that error. So I've introduced a new revalidateInstruction which is run after modifying the instruction, and repeated the same check there.

Full diff: https://github.com/llvm/llvm-project/pull/115026.diff

2 Files Affected:

(modified) llvm/lib/Target/ARM/AsmParser/ARMAsmParser.cpp (+65-9)
(added) llvm/test/MC/ARM/lower-upper-errors.s (+20)

diff --git a/llvm/lib/Target/ARM/AsmParser/ARMAsmParser.cpp b/llvm/lib/Target/ARM/AsmParser/ARMAsmParser.cpp
index 0df1c336a22146..8e3e27dec76ff1 100644
--- a/llvm/lib/Target/ARM/AsmParser/ARMAsmParser.cpp
+++ b/llvm/lib/Target/ARM/AsmParser/ARMAsmParser.cpp
@@ -659,6 +659,8 @@ class ARMAsmParser : public MCTargetAsmParser {
 
   bool validateInstruction(MCInst &Inst, const OperandVector &Ops,
                            unsigned MnemonicOpsEndInd);
+  bool revalidateInstruction(MCInst &Inst, const OperandVector &Ops,
+                             unsigned MnemonicOpsEndInd, unsigned OrigOpcode);
   bool processInstruction(MCInst &Inst, const OperandVector &Ops,
                           unsigned MnemonicOpsEndInd, MCStreamer &Out);
   bool shouldOmitVectorPredicateOperand(StringRef Mnemonic,
@@ -8294,7 +8296,9 @@ bool ARMAsmParser::validateInstruction(MCInst &Inst,
     break;
   }
   case ARM::tADDi8: {
-    MCParsedAsmOperand &Op = *Operands[MnemonicOpsEndInd + 1];
+    int i = (Operands[MnemonicOpsEndInd + 1]->isImm()) ? MnemonicOpsEndInd + 1
+                                                       : MnemonicOpsEndInd + 2;
+    MCParsedAsmOperand &Op = *Operands[i];
     if (isARMMCExpr(Op) && !isThumbI8Relocation(Op))
       return Error(Op.getStartLoc(),
                    "Immediate expression for Thumb adds requires :lower0_7:,"
@@ -8302,7 +8306,7 @@ bool ARMAsmParser::validateInstruction(MCInst &Inst,
     break;
   }
   case ARM::tMOVi8: {
-    MCParsedAsmOperand &Op = *Operands[MnemonicOpsEndInd];
+    MCParsedAsmOperand &Op = *Operands[MnemonicOpsEndInd + 1];
     if (isARMMCExpr(Op) && !isThumbI8Relocation(Op))
       return Error(Op.getStartLoc(),
                    "Immediate expression for Thumb movs requires :lower0_7:,"
@@ -8653,6 +8657,41 @@ bool ARMAsmParser::validateInstruction(MCInst &Inst,
   return false;
 }
 
+// After processInstruction has transformed an instruction being assembled into
+// a different opcode, do any further validity checks that the new opcode
+// depends on.
+//
+// `Inst` contains the final modified form of the instruction, but `Operands`
+// contains the parsed operands from the _original_ instruction, because
+// nothing has updated them (`processInstruction` received them as const).
+// `OrigOpcode` contains the original value of `Inst.getOpcode()`, which should
+// give enough context to know how to understand the original operands list.
+bool ARMAsmParser::revalidateInstruction(MCInst &Inst,
+                                         const OperandVector &Operands,
+                                         unsigned MnemonicOpsEndInd,
+                                         unsigned OrigOpcode) {
+  const unsigned NewOpcode = Inst.getOpcode();
+
+  if (OrigOpcode == ARM::t2ADDri && NewOpcode == ARM::tADDi8) {
+    // t2ADDri is the Thumb 32-bit immediate add instruction, for example
+    // 'add[s] r0,r1,#0xff00'. If its immediate argument isn't a constant
+    // requiring shifting, then processInstruction can turn it into tADDi8, the
+    // simpler 16-bit Thumb immediate add (provided all the other conditions
+    // for that transformation are met). That makes it too late for
+    // validateInstruction to do this check, which it would have done if it had
+    // known from the start that the instruction was tADDi8.
+    int i = (Operands[MnemonicOpsEndInd + 1]->isImm()) ? MnemonicOpsEndInd + 1
+                                                       : MnemonicOpsEndInd + 2;
+    MCParsedAsmOperand &Op = *Operands[i];
+    if (isARMMCExpr(Op) && !isThumbI8Relocation(Op))
+      return Error(Op.getStartLoc(),
+                   "Immediate expression for Thumb adds requires :lower0_7:,"
+                   " :lower8_15:, :upper0_7: or :upper8_15:");
+  }
+
+  return false;
+}
+
 static unsigned getRealVSTOpcode(unsigned Opc, unsigned &Spacing) {
   switch(Opc) {
   default: llvm_unreachable("unexpected opcode!");
@@ -11399,7 +11438,7 @@ bool ARMAsmParser::matchAndEmitInstruction(SMLoc IDLoc, unsigned &Opcode,
   unsigned MnemonicOpsEndInd = getMnemonicOpsEndInd(Operands);
 
   switch (MatchResult) {
-  case Match_Success:
+  case Match_Success: {
     LLVM_DEBUG(dbgs() << "Parsed as: ";
                Inst.dump_pretty(dbgs(), MII.getName(Inst.getOpcode()));
                dbgs() << "\n");
@@ -11414,15 +11453,31 @@ bool ARMAsmParser::matchAndEmitInstruction(SMLoc IDLoc, unsigned &Opcode,
       return true;
     }
 
-    {
-      // Some instructions need post-processing to, for example, tweak which
-      // encoding is selected. Loop on it while changes happen so the
-      // individual transformations can chain off each other. E.g.,
-      // tPOP(r8)->t2LDMIA_UPD(sp,r8)->t2STR_POST(sp,r8)
-      while (processInstruction(Inst, Operands, MnemonicOpsEndInd, Out))
+    // Some instructions need post-processing to, for example, tweak which
+    // encoding is selected. Loop on it while changes happen so the individual
+    // transformations can chain off each other. E.g.,
+    // tPOP(r8)->t2LDMIA_UPD(sp,r8)->t2STR_POST(sp,r8)
+    //
+    // This is written as a do-while inside an if, instead of the more obvious
+    // while loop, so that after postprocessing has completed the revised
+    // instruction can be revalidated, but (to save time) only if any changes
+    // had to be made at all.
+    unsigned OrigOpcode = Inst.getOpcode();
+    if (processInstruction(Inst, Operands, MnemonicOpsEndInd, Out)) {
+      do {
         LLVM_DEBUG(dbgs() << "Changed to: ";
                    Inst.dump_pretty(dbgs(), MII.getName(Inst.getOpcode()));
                    dbgs() << "\n");
+      } while (processInstruction(Inst, Operands, MnemonicOpsEndInd, Out));
+
+      MnemonicOpsEndInd = getMnemonicOpsEndInd(Operands);
+      if (revalidateInstruction(Inst, Operands, MnemonicOpsEndInd,
+                                OrigOpcode)) {
+        // As above, advance IT/VPT positions if we're exiting early.
+        forwardITPosition();
+        forwardVPTPosition();
+        return true;
+      }
     }
 
     // Only move forward at the very end so that everything in validate
@@ -11445,6 +11500,7 @@ bool ARMAsmParser::matchAndEmitInstruction(SMLoc IDLoc, unsigned &Opcode,
       Out.emitInstruction(Inst, getSTI());
     }
     return false;
+  }
   case Match_NearMisses:
     ReportNearMisses(NearMisses, IDLoc, Operands);
     return true;
diff --git a/llvm/test/MC/ARM/lower-upper-errors.s b/llvm/test/MC/ARM/lower-upper-errors.s
new file mode 100644
index 00000000000000..34fa3c7155ba1e
--- /dev/null
+++ b/llvm/test/MC/ARM/lower-upper-errors.s
@@ -0,0 +1,20 @@
+// RUN: not llvm-mc --triple thumbv6m -show-encoding %s 2>&1 | FileCheck %s
+// RUN: not llvm-mc --triple thumbv7m -show-encoding %s 2>&1 | FileCheck %s --check-prefixes=CHECK,THUMB2
+
+// Check reporting of errors of the form "you should have used
+// :lower16: in this immediate field".
+
+// CHECK: :[[@LINE+1]]:10: error: Immediate expression for Thumb movs requires :lower0_7:, :lower8_15:, :upper0_7: or :upper8_15:
+movs r0, #foo
+
+// CHECK: :[[@LINE+1]]:10: error: Immediate expression for Thumb adds requires :lower0_7:, :lower8_15:, :upper0_7: or :upper8_15:
+adds r0, #foo
+
+// CHECK: :[[@LINE+1]]:14: error: Immediate expression for Thumb adds requires :lower0_7:, :lower8_15:, :upper0_7: or :upper8_15:
+adds r0, r0, #foo
+
+// THUMB2: :[[@LINE+1]]:10: error: immediate expression for mov requires :lower16: or :upper16
+movw r0, #foo
+
+// THUMB2: :[[@LINE+1]]:10: error: immediate expression for mov requires :lower16: or :upper16
+movt r0, #foo

AlfieRichardsArm

LGTM! The revalidate seems like a good pattern as there's more I can imagine could use this pattern

efriedma-quic · 2024-11-08T01:48:54Z

llvm/lib/Target/ARM/AsmParser/ARMAsmParser.cpp

+    // simpler 16-bit Thumb immediate add (provided all the other conditions
+    // for that transformation are met). That makes it too late for
+    // validateInstruction to do this check, which it would have done if it had
+    // known from the start that the instruction was tADDi8.


Could the handling for t2ADDri just skip transforming the instruction if we know the end result isn't a valid Thumb1 instruction? Seems weird to transform to a Thumb1 instruction we know won't work, then produce an error referencing the Thumb1 instruction instead of the original Thumb2 instruction.

Some additional testcases:

add r9, r0, #foo movs r0, :lower16:#foo movs r11, :upper8_15:#foo

It's not quite as weird, because the original instruction was only Thumb2 because of LLVM's decision – the user asked for something much more general-purpose. It doesn't seem totally unreasonable to me that "adds r0, #extern_symbol" should give the same error whether or not Thumb2 is available. It looks like the same instruction, after all!

I did wonder about trying to do a more complete fix. The real problem case would be something like this (targeting v7-M or v8-M Mainline):

adds r0, r0, #foo adds r1, r1, #bar .equ foo, :lower0_7:some_extern_symbol .equ bar, 0xff00

If this were ever to assemble successfully, it would have to be done by making the adds to r0 tADDi8 which can accept the necessary relocation, but the one to r1 t2ADDri which can accept the shifted immediate. But the assembler must commit to each instruction's opcode before it sees the necessary data further down the file.

So I don't think a really complete fix is possible without changing the entire structure of how the assembler works. But I can look into the alternative approach of trying to make the t2ADDri → tADDi8 transformation more discriminating, in the hope that we can at least fix cases that don't forward-refer to an equ.

(I also think this test for :lower0_7: and friends is done in the wrong place, because it prevents even a fixup being generated, whereas it should only worry about relocations. Fixups relative to symbols defined later in the same source file, so they can be sorted out by the time assembly finishes, could be treated more leniently. But again that involves a big rework that I didn't want to block fixing this one bug.)

OK, here's the revised version which does it that way, and also has your expanded tests.

The important things here are:

We don't crash

If an instruction is encodable in Thumb1 mode, it also has the same encoding in Thumb2 mode.

If we have to reject certain cases because they would introduce some ambiguity, that's fine, especially if there's some easy way for the user to rewrite their code.

Given that, current version seems okay. Maybe we should do relaxation as part of processing fixups, but that's obviously a much bigger change.

I've reverted the part of the previous commit that introduced and used `revalidateInstruction`, replacing it with a more careful check in `processInstruction` of whether to transform `t2ADDri` → `tADDi8` in the first place. I've kept the index fixes in the existing `validateInstruction`. Also expanded the testing considerably, including making two test files to deal with the fact that some of these unemittable instructions aren't detected until a later phase of assembly, which won't be reached if there are any errors in the first phase.

github-actions · 2024-11-08T15:29:28Z

✅ With the latest revision this PR passed the C/C++ code formatter.

efriedma-quic

LGTM

statham-arm requested review from AlfieRichardsArm, efriedma-quic, ostannard and stuij November 5, 2024 17:11

llvmbot added backend:ARM llvm:mc Machine (object) code labels Nov 5, 2024

AlfieRichardsArm approved these changes Nov 6, 2024

View reviewed changes

efriedma-quic reviewed Nov 8, 2024

View reviewed changes

statham-arm requested a review from AlfieRichardsArm November 8, 2024 15:26

git-clang-format

b6606df

efriedma-quic approved these changes Nov 13, 2024

View reviewed changes

statham-arm merged commit d97f17a into llvm:main Nov 14, 2024
8 checks passed

statham-arm deleted the lower0_7-error-fix branch November 14, 2024 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MC][ARM] Fix crash when assembling Thumb 'movs r0,#foo'. #115026

[MC][ARM] Fix crash when assembling Thumb 'movs r0,#foo'. #115026

Uh oh!

statham-arm commented Nov 5, 2024 •

edited

Loading

Uh oh!

llvmbot commented Nov 5, 2024 •

edited

Loading

Uh oh!

AlfieRichardsArm left a comment

Uh oh!

efriedma-quic Nov 8, 2024

Uh oh!

statham-arm Nov 8, 2024

Uh oh!

statham-arm Nov 8, 2024

Uh oh!

statham-arm Nov 8, 2024

Uh oh!

efriedma-quic Nov 13, 2024

Uh oh!

github-actions bot commented Nov 8, 2024 •

edited

Loading

Uh oh!

efriedma-quic left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[MC][ARM] Fix crash when assembling Thumb 'movs r0,#foo'. #115026

[MC][ARM] Fix crash when assembling Thumb 'movs r0,#foo'. #115026

Uh oh!

Conversation

statham-arm commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlfieRichardsArm left a comment

Choose a reason for hiding this comment

Uh oh!

efriedma-quic Nov 8, 2024

Choose a reason for hiding this comment

Uh oh!

statham-arm Nov 8, 2024

Choose a reason for hiding this comment

Uh oh!

statham-arm Nov 8, 2024

Choose a reason for hiding this comment

Uh oh!

statham-arm Nov 8, 2024

Choose a reason for hiding this comment

Uh oh!

efriedma-quic Nov 13, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

efriedma-quic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

statham-arm commented Nov 5, 2024 •

edited

Loading

llvmbot commented Nov 5, 2024 •

edited

Loading

github-actions bot commented Nov 8, 2024 •

edited

Loading