Skip to content

Assembler

gmolin edited this page Apr 17, 2020 · 4 revisions

Assembler syntax

The assembly language obeys the following rule: "One line - one statement."

A statement is the smallest autonomous part of a programming language: command or set of commands. A program is usually a sequence of instructions.

Source: Statement (computer science) - Wikipedia

Empty lines, comments, extra tabs or spaces are all ignored.

A comment

A constant COMMENT_CHAR in the header ʻop.h` determines which character indicates the beginning of the comment.

In the file provided, it is a hash - #.

By that means, everything between the # character and the end of the line will be taken as a comment.

A comment can be located anywhere in the file.

Example # 1:

# this is a comment
# and this is a comment too

Example # 2:

ld %0, r2 # this is a comment

Alternative comment

In the provided archive vm_champs.tar along the path champs/examples you can find the file bee_gees.s with the champion code, which the original program asm translates into byte code without errors.

There are two kinds of comments in this champion:

  • standard, which was discussed above;
  • alternative, which is not described in the subject.

This alternative view differs from the standard one only in the symbol of the beginning of the comment. Instead of hash #, ; is used here.

An example of using a comment of this type:

sti r1, %:live, %1; this is a comment

How to deal with it?

This type of comment is not described in the subject, but is supported by the original translator. Therefore, it is most likely that we do not have to process it.

However, we can still add this to our project. To do this, the following line could be added to the op.h header file:

# define ALT_COMMENT_CHAR ';'

This will be the second (the first is the Norm formatting) and the last change that we will make to the file op.h.

Champion Name

We should define the champion's name in the file with the code of the champion. For this, assembler has a statement with its name defined in the constant NAME_CMD_STRING. In the provided op.h file, it is .name.

By that means, after the .name statement, the name of our champion should follow:

.name "Batman"

The string length must not exceed the number specified in the constant PROG_NAME_LENGTH. In the provided file, it is equal to 128.

By the way, an empty string can also be used as a champion name:

.name ""

But the complete absence of a string is an error:

.name

Champion Comment

Additionally, in the file with the extension .s the champion comment should be defined.

The statement to do this is contained in the constant COMMENT_CMD_STRING of the file op.h. In the file provided, this is .comment.

The length of the comment string is limited by the constant COMMENT_LENGTH. In the provided op.h file, its value is 2048.

Mainly, the .comment statement is very similar to .name and behaves the same way in cases with an empty string and its complete absence.

Other statements

In some files with the extension .s, which were provided to us as an example, a statement such as .extend was encountered.

This statement, like any other other than .name and .comment, is not described in the subject and is considered as error by the original translator.

We will handle this and other statements in the same manner.

Executable code

The champion executable code consists of instructions.

For the assembler language, the “One line - one statement” rule applies. And the symbol for ending an instruction in this language is a newline. That is, instead of the ; character familiar in the C language, the \n character appears here.

Based on this rule, we must remember that even after the last instruction, the newline should follow. Otherwise, asm will display an error message.

Each instruction consists of several components:

label

The label consists of characters that are defined in the constant LABEL_CHARS. In the sample file, this is abcdefghijklmnopqrstuvwxyz_0123456789.

That means the label cannot contain characters that are not specified in LABEL_CHARS.

And the label itself must be followed by the symbol defined in the constant LABEL_CHAR. In the sample file, this is the : character.

Why do we need labels?

A label indicates a statement that follows immediately. For one statement, and not for their unit:

.name "Batman"
.comment "This city needs me"

loop:
        sti r1, %: live, %1 # <- The loop label indicates this statement
live:
        live %0 # <- The live label indicates this statement.
        ld %0, r2 # <- And no label indicates this statement
        zjmp %:loop

The task of the labels is to simplify the process of writing code.

To fully appreciate their role, let's imagine a world without labels.

As we know, the champion code written in assembly language will turn into many bytes (represented in the hexadecimal notation) after the translation. And it is just a bytecode that the virtual machine will execute.

Suppose we need to organise a loop in which the live statement is performed over and over again. To do this, we have the statement zjmp, which can transfer us to the Nth number of bytes forward or backward.

In this case, we need to return to the live statement again after each iteration of the loop. How many bytes is it back? To find out, you need to count how many bytes in the bytecode the statement code and its argument will occupy.

As we learn later, the statement live takes 1 byte, and its only argument needs 4 bytes.

It turns out we need to go back 5 bytes back:

live %1
zjmp %-5

Not so difficult, but still, such calculations take time. And it would be much simple to write "go to the live statement". labels exist to do that.

We simply create a label for the live statement we need and pass it to the zjmp statement:

loop: live %1
         zjmp %:loop

From the assembler's point of view, both examples are absolutely identical. During its work, the asm program will calculate how many bytes ago the loop label indicates and will replace it with the number -5.

So, for the final result it does not matter at all what was used. It's just that writing code using labels is much more convenient.

Formatting

There are several approaches to writing a label:

label:
        live %0
label:


        live %0
label: live %0

All the above examples mean the same thing for the translator program.

Therefore, you can choose any of the options presented.

Many labels for one statement

This is also possible:

example1:

example2:
        live %0

This means that both example1 and example2 labels indicate to the same statement.

No statement

A situation may arise when the label does not have an statement to which it could indicate:

label:
# End of file

In this case, the label indicates to the place immediately after the champion’s executable code.

The main thing is that at the end of the line on which it is located, there is \n. Otherwise, the compiler will report an error.

Statements and their arguments

Assembly language has a specific set of 16 statements. Each of which takes from one to three arguments.

Information about the name of the statement, its code, and the arguments that it takes is given in the file op.c provided by the task.

Statement Code Statement Name Argument # 1 Argument # 2 Argument # 3
1 live T_DIR - -
2 ld T_DIR / T_IND T_REG -
3 st T_REG T_REG /T_IND -
4 add T_REG T_REG T_REG
5 sub T_REG T_REG T_REG
6 and T_REG / T_DIR /T_IND T_REG /T_DIR / T_IND T_REG
7 or T_REG / T_DIR / T_IND T_REG /T_DIR / T_IND T_REG
8 xor T_REG / T_DIR / T_IND T_REG / T_DIR / T_IND T_REG
9 zjmp T_DIR - -
10 ldi T_REG / T_DIR /T_IND T_REG /T_DIR T_REG
11 sti T_REG T_REG /T_DIR / T_IND T_REG / T_DIR
12 fork T_DIR - -
13 lld T_DIR / T_IND T_REG -
14 lldi T_REG / T_DIR /T_IND T_REG /T_DIR T_REG
15 lfork T_DIR - -
16 aff T_REG - -

Understanding the role of each statement and how it interprets arguments of different types is the most important task for understanding the basics of the Corewar project.

Statement and their arguments

Arguments

Each argument corresponds to one of three types:

1. Registry - T_REG

Registry is a variable that can store any data. The size of this variable in octets is indicated in the constant REG_SIZE, which in the sample file op.h is initialized with the value 4.

An octet in computer science is eight binary digits. Usually referred to as "byte".

Source: Octet (computing) - Wikipedia

The number of registrys is limited by the number specified in the constant REG_NUMBER. which in the sample file op.h is initialized with the value 16.

That is, the registrys available to us are r1, r2, r3 ... r16.

** Registry Values ​​**

During virtual machine startup, all registrys except r1 will be initialized to zeroes.

The number of the champion player will be written into r1 with a minus sign.

This number is unique within the game and it is needed for the statement live to inform that a particular player is alive.

That is, the carriage, which will be placed at the beginning of the player’s code under the number 2, will receive the value of r1 equal to -2.

If the statement live is performed with the argument -2, the virtual machine will consider that this player is alive:

live %-2

2. Direct - T_DIR

The direct argument consists of two parts: the symbol that is specified in the constant DIRECT_CHAR (%) + a number or a label that represent direct value.

If it's a label, the symbol from the variable LABEL_CHAR (:) must also be indicated before its name:

sti r1, %:marker, %1

What are the direct and indirect values?

To understand the difference between direct and indirect value, it is worth seeing one very simple example:

Imagine that we have the number 5. In its direct meaning, it represents itself. That is, the number 5 is the number 5.

But in an indirect sense, this is not a number, but a relative address that points to 5 bytes in advance.

Label in direct and indirect meaning

If everything is clear with numbers in the direct and indirect value, then what about the labels? What is the difference?

Everything is quite simple. As we know, these statements will be performed by a virtual machine. It's important for it what argument was received. Direct or indirect?

But remember that labels will not reach the virtual machine at all. At the time of translation into bytecode, they will all be replaced by their numerical equivalents.

Therefore, the labels are the numbers as well. Only recorded in another form.

The process of replacing labels with numbers is described in the chapter “Why do we need labels?”.

3. Indirect - T_IND

An argument of this type can be either a number or a label that represents an indirect value.

If a number is used as an argument of type T_IND, then no additional characters are needed:

ld    5, r7

If the label is an argument, then the symbol from the variable LABEL_CHAR (:) must be indicated before its name:

ld    :label, r7

Separator

In order to separate one argument from another within one statement, the assembler uses a special delimiter character. It is determined by the preprocessor constant SEPARATOR_CHAR and in the example file op.h it is the symbol ,:

ld    21, r7

Statements

Statement live

Statement Code Statement Name Argument # 1 Argument # 2 Argument # 3
1 live T_DIR - -

Description

The live statement has two functions:

  1. It counts that carriage, which performs the live statement, is alive.

  2. If the number specified as an argument to the statement live matches the number of the player, then it will consider that the player is alive. For example, if the argument value is -2, then the player with the number 2 is alive.

What is a carriage?

A detailed explanation of this term will be given in the [Virtual Machine](Virtual machine section), since the carriage refers specifically to this part of the Corewar project.

However, since basic knowledge of this term is a prerequisite for understanding the work of statements, we will briefly consider what it is.

A carriage is a process that performs the statement on which it stands.

Suppose we run a virtual machine with three champion players who have to fight for the victory.

The executable codes of the champions will be placed in the memory of the virtual machine. A carriage will be placed at the beginning of each of the memory sections.

3 champions. 3 sections of memory with executable codes placed on them. 3 carriages.

Each carriage contains several important elements:

  • PC (Program Counter)

A variable that contains the carriage position.

  • registrys

The registrys, the number of which is determined by the constant REG_NUMBER.

  • carry flag

A special variable that affects the statement of the zjmp function and can take one of two values: 1 or 0 (true or false).

  • The number of the cycle in which the live statement was last performed by this carriage

This information is needed to determine if the carriage is alive.

In fact, the carriage contains a lot more elements, but they will be considered later.

Statement ld

Statement Code Statement Name Argument # 1 Argument # 2 Argument # 3
2 ld T_DIR / T_IND T_REG -

Description

The task of this statement is to load the value into the registry. But its behavior differs depending on the type of the first argument:

  • Argument #1 - T_DIR

If the type of the first argument is T_DIR, then the number passed as an argument will be taken "as is".

Objectives of the statement:

  1. Write the number in the registry, which was passed as the second argument.

  2. If the number 0 is written in the registry, then set the value of carry to 1. If a non-zero value was written, set carry to 0.

  • Argument #1 - T_IND

If the type of the first argument is T_IND, then the number represents the address.

If an argument of this type is received, it is truncated by modulo - <FIRST_ARGUMENT> % IDX_MOD.

What is IDX_MOD?

IDX_MOD is another constant from the file op.h. Its value is determined using the expression (MEM_SIZE / 8), where MEM_SIZE defines the amount of memory in bytes. In a virtual machine inside a memory area with the size of MEM_SIZE the champions will fight.

So why do we need the IDX_MOD constant? It is needed so that the carriage would not jump in memory for too long distances. In the sample file, the constant MEM_SIZE was initialized with the value (4 * 1024). Therefore, the value of IDX_MOD in this case corresponds to 512.

By that means, the carriage will not be able to move further than 512 bytes in one jump.

After an argument of type T_IND has been truncated by modulo, the resulting value is used as a relative address - how many bytes forward or backward relative to the current location of the carriage to the position we need.

The task of statement ld:

  1. Define the address - current position + <FIRST_ARGUMENT> % IDX_MOD.

  2. 4 bytes must be read from the received address.

  3. Write the read number to the registry that was passed as the second parameter.

  4. If 0 is written into the registry, then set the value of carry to 1. If a non-zero value was written, then set carry to 0.

Why are we reading exactly 4 bytes?

As we know, the size of each registry is 4 bytes. More precisely, it is the number of bytes defined in the file op.h by the constant REG_SIZE.

The size of the argument of type T_DIR is defined in the same file:

# define REG_SIZE 4
# define DIR_SIZE REG_SIZE

And it also makes 4 bytes.

We go to the address that is specified with an argument of type T_IND to read the value. And we read the number "as is.": like getting a number type T_DIR. Then we must write that number in the registry. In order for the writing to be successful and for the number to fit in the registry, the size of this number and the size of the registry must be compatible.

During the analysis of the following statements, it will become visible that not only can we read the value and write it to the registry, but also carry out the opposite action: take the value from the registry and write it to the address.

Therefore, the size of the number and registry must be compatible in both directions. In this case, the only possible solution is to make the number of bytes that we read (or write to) equal to the number of bytes that we can store in the registry.

In short, we read as many bytes as the registry can hold.

Statement st

Statement Code Statement Name Argument # 1 Argument # 2 Argument # 3
3 st T_REG T_REG / T_IND -

Description

This statement writes a value from the registry that was passed as the first parameter. However, the location of writing depends on the type of the second argument:

  • Argument #2 - T_REG

If the second argument matches the type T_REG, then the value is written to the registry.

For example, in this case, the value from registry number 7 is written to registry with number 11:

st    r7, r11
  • Argument #2 - T_IND

As we recall arguments like T_IND this is about relative addresses. Therefore, in this case, the statement procedure of the statement st is as follows:

  1. Truncate the value of the second argument by the modulo IDX_MOD.

  2. Define the address: current position + <SECOND_ARGUMENT> % IDX_MOD

  3. Write the value from the registry, which was transferred as the first argument, into memory at the received address.

Statement add

Statement Code Statement Name Argument # 1 Argument # 2 Argument # 3
4 add T_REG T_REG T_REG

Description

Fortunately, in this statement, all arguments are of the same type. Therefore, everything is simple with it.

The task of the add:

  1. Sum the value from the registry that was passed as the first argument, with the value of the registry that was passed as the second argument.

  2. Write the result to the registry, which was passed as the third argument.

  3. If the received sum, which we recorded in the third argument, was equal to zero, then set carry to 1. And if the sum was not zero - in 0.

Statement sub

Statement Code Statement Name Argument # 1 Argument # 2 Argument # 3
5 sub T_REG T_REG T_REG

Description

There is also no ambiguity about the arguments in this statement.

Its tasks:

  1. From the value of the registry passed as the first argument, subtract the value of the registry that was passed as the second argument.

  2. The result is written in the registry, which was passed as the third argument.

  3. If the recorded result was equal to zero, then make the value of carry equal to 1. If the result was not zero, then set it to 0.

Statement and

Statement Code Statement Name Argument # 1 Argument # 2 Argument # 3
6 and T_REG / T_DIR / T_IND T_REG / T_DIR / T_IND T_REG

Description

and performs a "bitwise AND" statement for the values​of the first two arguments and writes the result to the registry passed as the third argument.

If the recorded result was equal to zero, then the value of carry must be set equal to 1. If the result was not zero, then - equal to 0.

Since the first and second arguments can be one of three types, we will consider how to get the value of each of them:

  • Argument #1 / Argument #2 - T_REG

In this case, the value is taken from the registry passed as an argument.

  • Argument #1 / Argument #2 - T_DIR

In this case, the numerical value passed as an argument is used.

  • Argument #1 / Argument #2 - T_IND

If the argument type is T_IND, then it is necessary to set the address from which 4 bytes will be read.

The address is defined as follows - current position + <ARGUMENT> % IDX_MOD.

The 4-byte number read at this address will be the required value.

Statement or

Statement Code Statement Name Argument # 1 Argument # 2 Argument # 3
7 or T_REG / T_DIR / T_IND T_REG / T_DIR / T_IND T_REG

Description

At its core, this statement is completely analogous to the and statement. Only in this case, "bitwise AND" is replaced by "bitwise OR".

Statement xor

Statement Code Statement Name Argument # 1 Argument # 2 Argument # 3
8 xor T_REG / T_DIR / T_IND T_REG / T_DIR / T_IND T_REG

Description

At its core, this statement is completely analogous to the and statement. Only in this case, "bitwise AND" is replaced by "bitwise exclusive OR".

How does the bitwise exclusive OR (XOR) work?

A B A ^ B
0 0 0
0 1 1
1 0 1
1 1 0

Statement zjmp

Statement Code Statement Name Argument # 1 Argument # 2 Argument # 3
9 zjmp T_DIR - -

Description

This statement is affected by the value of the carry flag.

If it is equal to 1, then the function updates the value of PC to the address: current position + <FIRST_ARGUMENT> % IDX_MOD.

That is, zjmp sets where the carriage should move to perform the next statement. This allows us to jump in memory to the desired position, and not to do everything in order.

If the carry value is zero, no movement is performed.

Statement ldi

Statement Code Statement Name Argument # 1 Argument # 2 Argument # 3
10 ldi T_REG / T_DIR / T_IND T_REG / T_DIR T_REG

Description

This statement writes the value to the registry, which was passed to it as the third parameter. The value it writes is 4 bytes. It reads these 4 bytes at the address, which is formed according to the following principle: current position + (<FIRST_ARGUMENT_VALUE> + <SECOND_ARGUMENT_VALUE>) % IDX_MOD.

Since the statement can take different types of the first and second arguments, consider the way to get the value for each type:

  • Argument # 1 / Argument # 2 - T_REG

The value is contained in the registry that was passed as a parameter.

  • Argument # 1 / Argument # 2 - T_DIR

In this case, the argument already contains its meaning.

  • Argument # 1 - T_IND

To get the value of this argument, you need to read 4 bytes at the address - current position + <FIRST_ARGUMENT> % IDX_MOD.

Statement sti

Statement Code Statement Name Argument # 1 Argument # 2 Argument # 3
11 sti T_REG T_REG / T_DIR / T_IND T_REG / T_DIR

Description

This statement writes the value of the registry passed as the first parameter to the address - current position + (<SECOND_ARGUMENT_VALUE> + <THIRD_ARGUMENT_VALUE>) % IDX_MOD.

How to get a value for each type of argument is described above.

Statement fork

Statement Code Statement Name Argument # 1 Argument # 2 Argument # 3
12 fork T_DIR - -

Description

The fork statement makes a copy of the carriage. And this copy is placed at the address <FIRST_ARGUMENT>% IDX_MOD.

What data is being copied?

  • Values​of all registrys

  • The value of carry

  • The number of the cycle in which the last statement live performed

  • And something else, but more on that later.

Statement lld

Statement Code Statement Name Argument # 1 Argument # 2 Argument # 3
13 lld T_DIR / T_IND T_REG -

Description

This statement behaves in much the same way as the ld statement. That is, it writes the value obtained from the first argument to the registry passed as the second argument.

The only difference between the two statements is the use of modulus truncation.

If the first argument is of type T_IND, then in this statement we will read 4 bytes of the value at the address - current position +<FIRST_ARGMENT>. Without truncating by modulo.

Problems of the original virtual machine

The original corewar virtual machine, unfortunately, does not work correctly. And it reads 2 bytes, not 4. Perhaps a similar bug is explained by the same lines as the problems in the provided files:

... we might have mistaken a bottle of water for a bottle of vodka.

For an argument of type T_DIR there are no changes.

Statement lldi

Statement Code Statement Name Argument # 1 Argument # 2 Argument # 3
14 lldi T_REG / T_DIR / T_IND T_REG / T_DIR T_REG

Description

At its core, this statement is similar to the ldi statement.

It writes the value to the registry, which was passed to it as the third parameter. The value that this statement writes is 4 bytes that it read.

The bytes are read at the address, which is formed according to the following principle: current position + (<FIRST_ARGUMENT_VALUE> + <SECOND_ARGUMENT_VALUE>).

Unlike the ldi statement, in this case, when forming the address, you shouldn't truncate by modulo IDX_MOD.

For arguments like T_IND everything remains as before

If we get an argument of type T_IND as the first or as second argument, then we still read 4 bytes of the value at the address - current position + <ARGUMENT> % IDX_MOD. Truncation by modulo is preserved.

Statement lfork

Statement Code Statement Name Argument # 1 Argument # 2 Argument # 3
15 lfork T_DIR - -

Description

At its core, this statement is similar to the fork statement.

Except for the fact that a new carriage is created at the address: current position + <FIRST_ARGUMENT>. In the lfork statement, truncation by modulo is not necessary.

Statement aff

Statement Code Statement Name Argument # 1 Argument # 2 Argument # 3
16 aff T_REG - -

Description

This statement takes a value from a registry that was passed as a single argument. Converts it to type char. And displays as an ASCII character.

In the subject, the aff statement is described as:

aff: The opcode is 10 in the hexadecimal. There is an argument’s coding byte, even if it’s a bit silly because there is only 1 argument that is a registry, which is a registry, and its content is interpreted by the character’s ASCII value to display on the standard output. The code is modulo 256.

But in fact, there is no need to perform additional truncation by modulo 256.

After all, the result of these two calculations will be identical:

(char) (value % 256)
(char) (value)

Display mode of aff output in the original corewar

In the original corewar virtual machine, the display mode of the output of the aff statement is off by default. To enable it, use the -a flag.

Updated statement table

Code Name Argument # 1 Argument # 2 Argument # 3 Modifies carry Description
1 live T_DIR - - No alive
2 ld T_DIR / T_IND T_REG Yes load
3 st T_REG T_REG / T_IND No store
4 add T_REG T_REG T_REG Yes addition
5 sub T_REG T_REG T_REG Yes subtraction
6 and T_REG / T_DIR / T_IND T_REG / T_DIR / T_IND T_REG Yes bitwise AND (&)
7 or T_REG / T_DIR / T_IND T_REG / T_DIR / T_IND T_REG Yes bitwise OR (|)
8 xor T_REG / T_DIR / T_IND T_REG / T_DIR / T_IND T_REG Yes bitwise XOR (^)
9 zjmp T_DIR No jump if non-zero
10 ldi T_REG / T_DIR / T_IND T_REG / T_DIR T_REG No load index
11 sti T_REG T_REG / T_DIR / T_IND T_REG / T_DIR No store index
12 fork T_DIR No fork
13 lld T_DIR / T_IND T_REG Yes long load
14 lldi T_REG / T_DIR / T_IND T_REG / T_DIR T_REG Yes long load index
15 lfork T_DIR No long fork
16 aff T_REG No aff

Cycles before execution

But that's not all you need to know about statements.

There is another important parameter: cycles before execution.

This is the number of cycles that the carriage must wait before starting a statement.

As an example, standing on the fork statement, it will wait for the next 800 cycles before executing fork. And for statement ld, the wait is only 5 cycles.

This parameter was introduced to create game mechanics in which the most effective and useful functions have the highest cost.

Code Name Cycles before execution
1 live 10
2 ld 5
3 st 5
4 add 10
5 sub 10
6 and 6
7 or 6
8 xor 6
9 zjmp 20
10 ldi 25
11 sti 25
12 fork 800
13 lld 10
14 lldi 50
15 lfork 1000
16 aff 2

Complete statements table can be viewed on the Google Sheets.

Clone this wiki locally