Skip to content

From assembler to bytecode

gmolin edited this page Apr 17, 2020 · 3 revisions

File structure

To understand the rules of translating assembly code into bytecode, you need to consider the file structure with the extension .cor.

To do this, we will translate this champion using the asm program provided by the task:

.name "Batman"
.comment "This city needs me"

loop:
        sti r1,%: live,% 1
live:
        live% 0
        ld% 0, r2
        zjmp%: loop

The resulting file will have the following structure:

! Bytecode file structure

Magic header

The first 4 bytes in the file are the “magic number”.

It is defined using the constant COREWAR_EXEC_MAGIC, with the value in the example file op.h being equal to 0xea83f3.

What is a magic header and why is it needed?

Magic header is such a special number, the task of which is to report that the given file is binary.

If there is no such "message" in the file, it can be interpreted as a text file.

Champion name

The next 128 bytes in the file take the champions name. Why 128? This is the value of the constant PROG_NAME_LENGTH, which determines the maximum length of a string with a name.

The name of our champion is much shorter. But in the bytecode, it still takes up all 128 bytes.

Because according to the translation rules, if the length of a string with a name is less than the established limit, then the missing characters are compensated by zero bytes.

In general, each character of the name is converted into an ASCII code of 1 byte size, written in hexadecimal notation:

The symbol B a t m a n
** ASCII code ** 0x42 0x61 0x74 0x6d 0x61 0x6e

Instead of the missing characters, we write zero bytes.

NULL

The next 4 bytes in the file structure are reserved for a certain control point - four zero octets.

They do not carry any information load. Their task is simply to be in the right place.

Champion exec code size

These 4 bytes contain important information - the size of the champion's executable code in bytes.

As we recall, the virtual machine must make sure that the size of the source code does not exceed the limit specified in the constant CHAMP_MAX_SIZE. In the provided op.h file, it is equal to 682.

Champion comment

The next 2048 bytes are occupied by the champion comment. And at its core, this part is completely analogous to the Champion name part.

In the exception of how the limit on the maximum length is set by the constant COMMENT_LENGTH.

NULL

And again 4 zero octets.

Champion exec code

The last part of the file is the champion's executable code.

Unlike a name or comment, it is not padded with null bytes.

Encoding statements

In order to understand how statement coding works, we need two tables.

Statement table

First of all, we need an updated table of statements with the columns "Argument Type Code" and "Size of T_DIR".

Code Name Argument # 1 Argument # 2 Argument # 3 Argument Type Code Size T_DIR
0x01 live T_DIR - - No 4
0x02 ld T_DIR / T_IND T_REG - Yes 4
0x03 st T_REG T_REG / T_IND - Yes 4
0x04 add T_REG T_REG T_REG Yes 4
0x05 sub T_REG T_REG T_REG Yes 4
0x06 and T_REG / T_DIR / T_IND T_REG / T_DIR / T_IND T_REG Yes 4
0x07 or T_REG / T_DIR / T_IND T_REG / T_DIR / T_IND T_REG Yes 4
0x08 xor T_REG / T_DIR / T_IND T_REG / T_DIR / T_IND T_REG Yes 4
0x09 zjmp T_DIR - - No 2
0x0a ldi T_REG / T_DIR / T_IND T_REG / T_DIR T_REG Yes 2
0x0b sti T_REG T_REG / T_DIR / T_IND T_REG / T_DIR Yes 2
0x0c fork T_DIR - - No 2
0x0d lld T_DIR / T_IND T_REG - Yes 4
0x0e lldi T_REG / T_DIR / T_IND T_REG / T_DIR T_REG Yes 2
0x0f lfork T_DIR - - No 2
0x10 aff T_REG - - Yes 4

Why do we need a Size T_DIR for statements that do not accept arguments of this type?

At this stage, this is really useless information. But they will be needed while the virtual machine is running.

What will be discussed in the [relevant section](Virtual machine).

Complete statements table can be viewed on the Google Sheets service.

Argument table

The second table we need contains information about the codes of argument types and their sizes.

Type Sign Code Size
T_REG r 01 1 byte
T_DIR % 10 Size T_DIR
T_IND - 11 2 bytes

Complete argument table can be viewed on the Google Sheets service.

About registrys and its sizes

It is important to distinguish two size characteristics of the registrys.

The name of the registry (r1, r2 ...) in the bytecode is 1 byte. But the registry itself contains 4 bytes, as indicated in the REG_SIZE constant.

About the size of arguments of type T_DIR

As you can see in the statements table, the size of arguments of type T_DIR is not fixed and depends on the statement. But in the file op.h there is a preprocessor constant, which says that the size of T_DIR is equal to the size of the registry - 4 bytes:

# define IND_SIZE 2
# define REG_SIZE 4
# define DIR_SIZE REG_SIZE

Where is the logic here?.

There is logic here, but it is rather vague.

As we have already said, to write a number of type T_DIR into the registry and to unload this number into memory, the sizes of the registrys and arguments of type T_DIR must match.

With this statement, everything is fine. It is correct. If you look at the statements that load the value into the registry, for them the size of T_DIR is equal to 4.

Also the size of T_DIR is equal to 4 for the statement of live.

As for statements for which the size of T_DIR is equal to 2, then in these cases an argument of this type takes part only in the formation of the address. And for such a task, 4 bytes are redundant. After all, the amount of memory is only 4096 bytes, as indicated in the constant MEM_SIZE. And any address number that exceeds this limit will be truncated by modulo MEM_SIZE, if it has not already been done using IDX_MOD.

In this situation, an argument of type T_DIR plays the role of a relative address. That is, an argument of type T_IND. And so its size is equal to IND_SIZE (2 bytes).

Encoding Algorithm

Each statement presented in bytecode has the following structure:

  1. The statement code is 1 byte
  2. The code of the types of arguments (Not necessary for all statements) - 1 byte
  3. Arguments

Argument Type Code

As indicated above in the structure of the encoded statement, the second component may be missing - the argument type code.

Its availability depends on the specific statement. If it takes only one argument and its type is uniquely defined as T_DIR, then the code with information about the types of arguments is not needed. For all other statements, this component is required.

You can check whether this code is needed for a specific statement in the "Type of Arguments Code" column of the statement table.

Let's look at how we code executable code instructions:

loop:
        sti r1, %:live, %1
live:
        live %0
        ld %0, r2
        zjmp %:loop

Instruction #1

The first instruction to translate is:

loop:
        sti r1, %:live, %1

First, set the dimensions of each component.

Statement Code Argument Type Code Argument # 1 Argument # 2 Argument # 3
1 byte 1 byte 1 byte 2 bytes 2 bytes

Now let's look at how we get the bytecode of each part of the instruction.

Statement code

The code for each statement is listed in the statement table. For sti, it is equal to 0x0b.

** Argument Type Code **

In order to generate this code, you need to represent 1 byte in the binary system. The first two bits on the left will be occupied by code like argument # 1. The next two will go to code like the second argument. And so on. The last fourth pair will always be equal to 00.

Codes for each type are listed in the argument table.

Argument # 1 Argument # 2 Argument # 3 - Result Code
T_REG T_DIR T_DIR - -
01 10 10 00 0x68

Argument of type T_REG

In this case, the registry number is translated into the hexadecimal code. For registry r1 it is 0x01.

Argument-label of type T_DIR

As we already know, the label should turn into a number that contains the relative address in bytes.

Since the label live indicates the next instruction, and the size in bytes of the current instruction is already known to us, you can easily calculate the required distance - 7 bytes.

The resulting address number should be placed on 2 bytes - 0x0007.

Argument-number of type T_DIR

It's easier in this case. You just need to write down the number obtained in the decimal system, in the form of a hexadecimal bytecode - 0x0001.

The final bytecode of the instruction is 0b 68 01 0007 0001.

Instruction #2

For the following instructions, everything is similar:

live:
        live %0

The only significant change is that the argument type code is not needed for this statement:

Statement Code Argument # 1
1 byte 4 bytes

The final bytecode for the second instruction is 01 00000000.

Instruction #3

The third instruction:

        ld %0, r2
Statement Code Argument Type Code Argument # 1 Argument # 2
1 byte 1 byte 4 bytes 1 byte

There are no surprises here - 02 90 00000000 02.

Instruction # 4

        zjmp%: loop

The statement code zjmp is 0x09.

The argument type code is not needed for this statement.

Statement Code | Argument # 1 : -----: |: -----: 1 byte | 2 bytes

The label loop points to 19 bytes back. But how do we represent the number -19 in the hexadecimal system?

To do this, write the number 19 in the binary system in direct code:

0000 0000 0001 0011

From the number 19 written in the direct code, we can get the number -19 in the additional code.

It is in the supplementary code that it is customary to represent negative integers in computers.

In order to get an additional code for the number -19, we must follow these steps:

  1. Invert all digits

That is, change the unit to zero, and vice versa:

1111 1111 1110 1100
  1. Add one to the number
1111 1111 1110 1101

Done. So we got the number -19 in the additional code.

Let's translate it from binary to hexadecimal - 0xffed.

The final bytecode for instruction # 5 is 09 ffed.

Summary

The entire executable code of the champion in question will look like this:

0b68 0100 0700 0101 0000 0000 0290 0000
0000 0209 ffed

Clone this wiki locally