Skip to content

Formalized Intel Syntax for x86

LIU Hao edited this page Jan 18, 2024 · 33 revisions

The Motivation

The assembly language for x86 and x86-64 involves two major variations of syntax: the Microsoft assembler (MASM) syntax and the GNU assembler (GAS) syntax. The MASM syntax, also known as the Intel syntax, is prescriptive in Intel Software Developer Manual, and is used extensively by many non-GNU tools. The GNU syntax, also known as the AT&T syntax, derives from PDP-11 assembly to create Unix, and is default and dominant in the post-Unix world.

The advantages of the MASM syntax are:

  1. It looks more modern, closer to many other assembly languages, such as ARM, MIPS and RISC-V.
  2. It is the syntax in Intel and AMD documentation.

The disadvantages of the MASM syntax are:

  1. MASM is proprietary software.
  2. The syntax has not been formally defined, and causes ambiguity sometimes.

For instance, the Intel Software Developer Manual contains this line:

MOV EBX, RAM_START

This is ambiguous in two ways. First, it could be interpreted as either of

MOV EBX, OFFSET RAM_START         ; `movl $RAM_START, %ebx`
MOV EBX, DWORD PTR [RAM_START]    ; `movl RAM_START, %ebx`

Second, on x86-64 the address might be RIP-relative or absolute, as in

MOV EBX, DWORD PTR [RAM_START]
         ; x86    absolute       ; 8B 1D    RAM_START   ; `movl RAM_START, %ebx`
         ; x86-64 RIP-relative   ; 8B 1D    RAM_START   ; `movl RAM_START(%rip), %ebx`
         ; x86-64 absolute       ; 8B 1C 25 RAM_START   ; `movl RAM_START, %ebx`

The first issue here is solved by interpreting it as an indirect reference, but the ambiguity may still arise if the symbol results from a high-level language, such as C.

When targeting x86, the Microsoft compiler decorates C identifiers: External names that denote objects or functions with the __cdecl or __stdcall calling convention are prefixed with an underscore _; external names that denote functions with the __fastcall or __vectorcall calling convention are prefixed with an at symbol @. This technique prevents symbols from conflicting with keywords in assembly.

But it is no longer the case for x86-64 (as well as ARM and ARM64). If a user declares an external variable with the name RSI, the compiler may generate the ambiguous and incorrect

MOV EAX, DWORD PTR [RSI]    ; parsed as `movl (%rsi), %eax`
                            ; should have been `movl rsi, %eax`

This RFC proposes formalization of the Intel syntax, by disallowing certain constructions to resolve ambiguity.

The Proposal

  1. Indirect references shall always contain a mode specifier. Plain brackets are no longer allowed.

    MOV EAX, [RCX]                         ; invalid syntax
    MOV EAX, DWORD [RCX]                   ; invalid syntax
    MOV EAX, DWORD PTR [RCX]               ; valid: `movl (%rcx), %eax`
    VMULPD ZMM0, ZMM1, QWORD BCST [RCX]    ; valid: `vmulpd (%rcx){1to8}, %zmm1, %zmm0`
  2. If an identifier follows PTR, BCAST or OFFSET, then it is always treated as a symbol, even when it is a keyword. In other words, only registers are enclosed within brackets. This idea is shared with GAS syntax.

    MOV EAX, printf                        ; invalid: `printf` is not a known register
    MOV EAX, OFFSET printf                 ; valid: `movl $printf, %eax`
    MOV EAX, RCX                           ; invalid: operand size mismatch
    MOV EAX, OFFSET RCX                    ; valid: `movl $RCX, %eax`
    MOV EAX, DWORD PTR [RCX]               ; valid: `movl (%rcx), %eax`
    MOV EAX, DWORD PTR RCX                 ; valid: `movl rcx, %eax`
    MOV EAX, DWORD PTR RCX[RIP+10]         ; valid: `movl rcx+10(%rip), %eax`
  3. For instructions with a dummy memory operand (LEA, NOP, etc.) and those with an uncommon size (FXSAVE/FXRSTOR, FNSAVE/FNRSTOR, etc.), BYTE PTR shall be used.

    NOP DWORD PTR [RAX], EAX               ; valid: 0F 1F 00
  4. The base, index, scale and displacement parts of a memory operand shall appear uniformly. The displacement comes first, immediately following the mode specifier. If there is at least a base or index register, they are all placed in a pair of square brackets. This idea is also shared with GAS syntax.

    MOV ECX, DWORD PTR [RSI+RDI*4+field]   ; invalid: `field` is not a known register
    MOV ECX, DWORD PTR field[RSI+RDI*4]    ; valid: `movl field(%rsi,%rdi,4), %ecx`

External Links

  1. GCC Bug 53929 - [meta-bug] -masm=intel with global symbol
Clone this wiki locally