This document explains how the MicroASM compiler (microasm_compiler.cpp) and interpreter (microasm_interpreter.cpp) work together, transforming human-readable assembly code into an executable format and running it.
The compiler's primary role is to translate the symbolic .masm source code into a compact, machine-understandable binary format (.bin). This involves several steps:
-
Parsing (
parseandparseLine):- The source file is read line by line.
- Each line is processed:
- Leading/trailing whitespace is implicitly handled by stream extraction.
- Lines starting with
;(comments) or empty lines are ignored. - The first word on the line is treated as a potential instruction mnemonic or directive. It's converted to uppercase to ensure case-insensitivity (e.g.,
movbecomesMOV). - Subsequent words on the line are treated as operands for the instruction/directive.
-
Label Handling (
LBLDirective):- If the first word is
LBL, the next word is taken as the label name. - The compiler maintains a
labelMap(e.g.,std::unordered_map<std::string, int>). - It stores an entry in
labelMapwhere the key is the label name prefixed with#(e.g.,#loop_start) and the value is thecurrentAddress. currentAddresstracks the byte offset within the code segment where the next instruction would be placed. This ensures jumps point to the correct location in the final binary's code section.
- If the first word is
-
Data Handling (
DBDirective):- If the first word is
DB, the next word is expected to be a data label (e.g.,$1,$myString). The word after that is expected to be a double-quoted string literal. - The compiler maintains a
dataLabelsmap (e.g.,std::unordered_map<std::string, int>) and adataSegmentbuffer (e.g.,std::vector<char>). - The current
dataAddress(tracking the size ofdataSegment) is stored indataLabelswith the data label as the key. This records the starting offset of this data within the data segment. - The string literal is processed:
- Quotes are removed.
- Escape sequences (like
\n,\t,\\,\") are converted into their corresponding single byte values. - The resulting bytes are appended to the
dataSegmentbuffer. - A null terminator (
\0) is appended to thedataSegmentbuffer.
- The
dataAddresscounter is incremented by the total number of bytes added (processed string length + 1 for null terminator).
- If the first word is
-
Instruction & Operand Processing:
- If the first word is a recognized instruction mnemonic (not
LBLorDB):- The mnemonic is looked up in an
opcodeMapto get its numericalOpcodevalue (e.g.,MOV->0x01). - The remaining words on the line are stored as operand strings associated with this instruction.
- An internal
Instructionstructure (holding the opcode and operand strings) is added to a list (instructions). - The
currentAddress(code offset) is incremented. The size added is 1 byte for the opcode, plusN * (1 + 4)bytes for the operands, whereNis the number of operands (1 byte forOperandType, 4 bytes forvalue). This calculation anticipates the final binary size.
- The mnemonic is looked up in an
- If the first word is a recognized instruction mnemonic (not
-
Operand Resolution (
resolveOperand):- This helper function is crucial both during parsing (for address calculation, though less critical) and especially during binary generation. It takes an operand string and determines its type and numerical value for the bytecode.
- Labels (
#label_name): Looks up#label_nameinlabelMap. Returns typeLABEL_ADDRESSand the stored code offset value. Throws an error if the label is not found. - Data Labels (
$data_label): Looks up$data_labelindataLabels. Returns typeDATA_ADDRESSand the stored data offset value. Throws an error if not found. - Registers (
RAX,R0, etc.): Converts the name to uppercase. Looks up the name in aregMap. Returns typeREGISTERand the register's index (0-23). Throws an error if not a valid register. - Immediates (
123,$500): Attempts to parse the string (after removing a leading$if present) as an integer. Returns typeIMMEDIATEand the integer value. Throws an error if parsing fails or the value is out of the 32-bit signed range.
-
Binary File Generation (
compilefunction):- This function orchestrates writing the final
.binfile. - Calculate Sizes: The actual code size is recalculated by iterating through the
instructionslist and summing the sizes (1 for opcode + N*(1+4) for operands), skipping anyDBpseudo-instructions encountered during the initial parse. The data size is simply the final size of thedataSegmentbuffer. - Create Header: A
BinaryHeaderstruct is populated with the magic number (0x4D53414D), version, calculatedcodeSize, calculateddataSize, and theentryPoint(currently hardcoded to 0, meaning execution starts at the beginning of the code segment). - Write Header: The
BinaryHeaderstruct is written as raw bytes to the beginning of the output.binfile. - Write Code Segment: The compiler iterates through the
instructionslist again:- Pseudo-instructions like
DBare skipped. - For each actual instruction:
- The
Opcodebyte is written. - For each operand string associated with the instruction:
resolveOperandis called to get the finalOperandTypeandvalue.- The
OperandTypebyte is written. - The
value(as a 4-byte integer) is written.
- The
- Pseudo-instructions like
- Write Data Segment: The entire contents of the
dataSegmentbuffer (containing all processed strings and null terminators fromDBdirectives) are written to the file immediately following the code segment.
- This function orchestrates writing the final
The compiler produces a .bin file with a specific layout, essential for the interpreter to understand:
+-----------------------+ <-- File Start
| BinaryHeader | (Fixed size, e.g., 16 bytes)
| - magic: uint32_t | (e.g., 0x4D53414D for "MASM")
| - version: uint16_t | (e.g., 1)
| - reserved: uint16_t | (Padding/Future use, currently 0)
| - codeSize: uint32_t | (Size of the Code Segment in bytes)
| - dataSize: uint32_t | (Size of the Data Segment in bytes)
| - entryPoint: uint32_t| (Offset within Code Segment to start execution)
+-----------------------+ <-- Header End / Code Segment Start
| Code Segment | (Variable size: header.codeSize bytes)
| - Opcode (1 byte) | (e.g., 0x01 for MOV)
| - OperandType (1 byte)| (e.g., 0x01 for REGISTER)
| - OperandValue (4 B) | (e.g., 0 for RAX, or an immediate value, or an offset)
| - OperandType (1 byte)| (If instruction has >1 operand)
| - OperandValue (4 B) | (...)
| - Opcode (1 byte) | (Next instruction)
| - ... |
+-----------------------+ <-- Code Segment End / Data Segment Start
| Data Segment | (Variable size: header.dataSize bytes)
| - Raw bytes from DB | (e.g., 'H','e','l','l','o','\0', 'W','o','r',...)
| - ... |
+-----------------------+ <-- Data Segment End / File End
The interpreter executes the instructions contained within the .bin file.
-
Loading (
loadfunction):- The specified
.binfile is opened in binary mode. - Read Header: The first
sizeof(BinaryHeader)bytes are read from the file into aBinaryHeaderstruct. - Validate Header: The
magicnumber is checked against0x4D53414D. If it doesn't match, it's not a valid file, and an error is thrown. Version checks could also be performed here. - Allocate RAM: A
std::vector<char>namedramis created with a specified size (e.g., 65536 bytes for 64KB). This simulates the computer's main memory. - Determine Data Segment Base: A variable
dataSegmentBaseis calculated. This is the starting address within the simulated RAM where the data segment will be loaded (e.g.,ram.size() / 2). This base address is crucial for resolvingDATA_ADDRESSoperands later. - Load Code Segment:
header.codeSizebytes are read from the file (immediately following the header) into thebytecode_rawbuffer (std::vector<uint8_t>). This buffer now contains only the executable instructions and their operands. - Load Data Segment:
header.dataSizebytes are read from the file (immediately following the code segment) directly into theramvector, starting at thedataSegmentBaseoffset. - Set Instruction Pointer: The interpreter's instruction pointer (
ip, an integer index intobytecode_raw) is initialized to the value specified byheader.entryPoint.
- The specified
-
Execution Loop (
executefunction):- The interpreter enters a
whileloop that continues as long asippoints to a valid location within thebytecode_rawbuffer (ip < bytecode_raw.size()). - Fetch Opcode: The byte at
bytecode_raw[ip]is read. This is theOpcodefor the current instruction.ipis incremented by 1. - Decode & Fetch Operands (
nextRawOperand):- Based on the fetched
Opcode, the interpreter knows how many operands to expect (using a lookup table or switch statement). - For each expected operand,
nextRawOperandis called. nextRawOperandreads theOperandTypebyte frombytecode_raw[ip](incrementingip) and then reads the next 4 bytes as the operand's rawvalue(incrementingipby 4). It returns aBytecodeOperandstruct containing the type and value.
- Based on the fetched
- Resolve Operand Values (
getValue):- Before the instruction logic uses an operand,
getValueis often called on theBytecodeOperandstruct obtained fromnextRawOperand. - If type is
REGISTER: It returns the integer value currently stored in theregistersvector at the index specified byoperand.value. - If type is
IMMEDIATEorLABEL_ADDRESS: It returnsoperand.valuedirectly, as this value represents a literal number or a code offset (which is used directly as the target for jumps/calls). - If type is
DATA_ADDRESS: It calculates and returns the absolute RAM address:dataSegmentBase + operand.value. This translates the data offset (from the bytecode) into a usable memory address within the simulatedram.
- Before the instruction logic uses an operand,
- Execute Instruction: A large
switchstatement based on theOpcodeperforms the required action:- Register Modification: Instructions like
MOV,ADD,SUB,POPread values (usinggetValueif necessary) and write results directly into theregistersvector. - Memory Modification: Instructions like
MOVTO,FILL,COPYcalculate absolute RAM addresses (usinggetValuefor base addresses and offsets) and use helper functions (writeRamInt,writeRamChar,memcpy,memset) to modify theramvector.MOVADDRreads from RAM usingreadRamInt. - Flow Control:
JMP,CALL, and conditional jumps (JE,JNE, etc.) modify theipregister directly.CALLalso pushes the next instruction's address (ipafter fetching operands) onto the stack before changingip.RETpops an address from the stack intoip. - Flags:
CMPandCMP_MEMcalculate results and update internal boolean flags (zeroFlag,signFlag). Conditional jumps read these flags. - Stack Pointer:
PUSH,POP,CALL,RET,ENTER,LEAVEmodify theRSPregister (index 7) and interact with RAM viareadRamInt/writeRamIntat theRSPaddress (adjustingRSPbefore/after). - I/O:
OUT,COUT, etc., read values/addresses (usinggetValue), potentially read strings/chars fromramusing helpers (readRamString,readRamChar), and print tostd::coutorstd::cerr.
- Register Modification: Instructions like
- Loop Continuation: The loop fetches the next opcode unless
HLTwas executed (which terminates the loop/program) or an error occurred.
- The interpreter enters a
This detailed process ensures that the symbolic assembly code is correctly translated into executable bytecode, and the interpreter can accurately load and run that bytecode by managing registers, simulated RAM, and the instruction pointer according to the defined instruction set.