-
Notifications
You must be signed in to change notification settings - Fork 0
From assembler to bytecode
To understand the rules of translating assembly code into bytecode, you need to consider the file structure with the extension .cor.
To do this, we will translate this champion using the asm program provided by the task:
.name "Batman"
.comment "This city needs me"
loop:
sti r1,%: live,% 1
live:
live% 0
ld% 0, r2
zjmp%: loop
The resulting file will have the following structure:
The first 4 bytes in the file are the “magic number”.
It is defined using the constant COREWAR_EXEC_MAGIC, with the value in the example file op.h being equal to 0xea83f3.
Magic header is such a special number, the task of which is to report that the given file is binary.
If there is no such "message" in the file, it can be interpreted as a text file.
The next 128 bytes in the file take the champions name. Why 128? This is the value of the constant PROG_NAME_LENGTH, which determines the maximum length of a string with a name.
The name of our champion is much shorter. But in the bytecode, it still takes up all 128 bytes.
Because according to the translation rules, if the length of a string with a name is less than the established limit, then the missing characters are compensated by zero bytes.
In general, each character of the name is converted into an ASCII code of 1 byte size, written in hexadecimal notation:
| The symbol | B |
a |
t |
m |
a |
n |
|---|---|---|---|---|---|---|
| ** ASCII code ** | 0x42 |
0x61 |
0x74 |
0x6d |
0x61 |
0x6e |
Instead of the missing characters, we write zero bytes.
The next 4 bytes in the file structure are reserved for a certain control point - four zero octets.
They do not carry any information load. Their task is simply to be in the right place.
These 4 bytes contain important information - the size of the champion's executable code in bytes.
As we recall, the virtual machine must make sure that the size of the source code does not exceed the limit specified in the constant CHAMP_MAX_SIZE. In the provided op.h file, it is equal to 682.
The next 2048 bytes are occupied by the champion comment. And at its core, this part is completely analogous to the Champion name part.
In the exception of how the limit on the maximum length is set by the constant COMMENT_LENGTH.
And again 4 zero octets.
The last part of the file is the champion's executable code.
Unlike a name or comment, it is not padded with null bytes.
In order to understand how statement coding works, we need two tables.
First of all, we need an updated table of statements with the columns "Argument Type Code" and "Size of T_DIR".
| Code | Name | Argument # 1 | Argument # 2 | Argument # 3 | Argument Type Code | Size T_DIR
|
|---|---|---|---|---|---|---|
| 0x01 | live |
T_DIR |
- | - | No | 4 |
| 0x02 | ld |
T_DIR / T_IND
|
T_REG |
- | Yes | 4 |
| 0x03 | st |
T_REG |
T_REG / T_IND
|
- | Yes | 4 |
| 0x04 | add |
T_REG |
T_REG |
T_REG |
Yes | 4 |
| 0x05 | sub |
T_REG |
T_REG |
T_REG |
Yes | 4 |
| 0x06 | and |
T_REG / T_DIR / T_IND
|
T_REG / T_DIR / T_IND
|
T_REG |
Yes | 4 |
| 0x07 | or |
T_REG / T_DIR / T_IND
|
T_REG / T_DIR / T_IND
|
T_REG |
Yes | 4 |
| 0x08 | xor |
T_REG / T_DIR / T_IND
|
T_REG / T_DIR / T_IND
|
T_REG |
Yes | 4 |
| 0x09 | zjmp |
T_DIR |
- | - | No | 2 |
| 0x0a | ldi |
T_REG / T_DIR / T_IND
|
T_REG / T_DIR
|
T_REG |
Yes | 2 |
| 0x0b | sti |
T_REG |
T_REG / T_DIR / T_IND
|
T_REG / T_DIR
|
Yes | 2 |
| 0x0c | fork |
T_DIR |
- | - | No | 2 |
| 0x0d | lld |
T_DIR / T_IND
|
T_REG |
- | Yes | 4 |
| 0x0e | lldi |
T_REG / T_DIR / T_IND
|
T_REG / T_DIR
|
T_REG |
Yes | 2 |
| 0x0f | lfork |
T_DIR |
- | - | No | 2 |
| 0x10 | aff |
T_REG |
- | - | Yes | 4 |
Why do we need a
Size T_DIRfor statements that do not accept arguments of this type?At this stage, this is really useless information. But they will be needed while the virtual machine is running.
What will be discussed in the [relevant section](Virtual machine).
Complete statements table can be viewed on the Google Sheets service.
The second table we need contains information about the codes of argument types and their sizes.
| Type | Sign | Code | Size |
|---|---|---|---|
T_REG |
r |
01 |
1 byte |
T_DIR |
% |
10 |
Size T_DIR
|
T_IND |
- | 11 |
2 bytes |
Complete argument table can be viewed on the Google Sheets service.
About registrys and its sizes
It is important to distinguish two size characteristics of the registrys.
The name of the registry (
r1,r2...) in the bytecode is 1 byte. But the registry itself contains 4 bytes, as indicated in theREG_SIZEconstant.
About the size of arguments of type
T_DIRAs you can see in the statements table, the size of arguments of type
T_DIRis not fixed and depends on the statement. But in the fileop.hthere is a preprocessor constant, which says that the size ofT_DIRis equal to the size of the registry - 4 bytes:# define IND_SIZE 2 # define REG_SIZE 4 # define DIR_SIZE REG_SIZEWhere is the logic here?.
There is logic here, but it is rather vague.
As we have already said, to write a number of type
T_DIRinto the registry and to unload this number into memory, the sizes of the registrys and arguments of typeT_DIRmust match.With this statement, everything is fine. It is correct. If you look at the statements that load the value into the registry, for them the size of
T_DIRis equal to4.Also the size of
T_DIRis equal to4for the statement oflive.As for statements for which the size of
T_DIRis equal to2, then in these cases an argument of this type takes part only in the formation of the address. And for such a task, 4 bytes are redundant. After all, the amount of memory is only 4096 bytes, as indicated in the constantMEM_SIZE. And any address number that exceeds this limit will be truncated by moduloMEM_SIZE, if it has not already been done usingIDX_MOD.In this situation, an argument of type
T_DIRplays the role of a relative address. That is, an argument of typeT_IND. And so its size is equal toIND_SIZE(2 bytes).
Each statement presented in bytecode has the following structure:
- The statement code is 1 byte
- The code of the types of arguments (Not necessary for all statements) - 1 byte
- Arguments
Argument Type Code
As indicated above in the structure of the encoded statement, the second component may be missing - the argument type code.
Its availability depends on the specific statement. If it takes only one argument and its type is uniquely defined as
T_DIR, then the code with information about the types of arguments is not needed. For all other statements, this component is required.You can check whether this code is needed for a specific statement in the "Type of Arguments Code" column of the statement table.
Let's look at how we code executable code instructions:
loop:
sti r1, %:live, %1
live:
live %0
ld %0, r2
zjmp %:loop
The first instruction to translate is:
loop:
sti r1, %:live, %1
First, set the dimensions of each component.
| Statement Code | Argument Type Code | Argument # 1 | Argument # 2 | Argument # 3 |
|---|---|---|---|---|
| 1 byte | 1 byte | 1 byte | 2 bytes | 2 bytes |
Now let's look at how we get the bytecode of each part of the instruction.
Statement code
The code for each statement is listed in the statement table. For sti, it is equal to 0x0b.
** Argument Type Code **
In order to generate this code, you need to represent 1 byte in the binary system. The first two bits on the left will be occupied by code like argument # 1. The next two will go to code like the second argument. And so on. The last fourth pair will always be equal to 00.
Codes for each type are listed in the argument table.
| Argument # 1 | Argument # 2 | Argument # 3 | - | Result Code |
|---|---|---|---|---|
T_REG |
T_DIR |
T_DIR |
- | - |
01 |
10 |
10 |
00 |
0x68 |
Argument of type T_REG
In this case, the registry number is translated into the hexadecimal code. For registry r1 it is 0x01.
Argument-label of type T_DIR
As we already know, the label should turn into a number that contains the relative address in bytes.
Since the label live indicates the next instruction, and the size in bytes of the current instruction is already known to us, you can easily calculate the required distance - 7 bytes.
The resulting address number should be placed on 2 bytes - 0x0007.
Argument-number of type T_DIR
It's easier in this case. You just need to write down the number obtained in the decimal system, in the form of a hexadecimal bytecode - 0x0001.
The final bytecode of the instruction is 0b 68 01 0007 0001.
For the following instructions, everything is similar:
live:
live %0
The only significant change is that the argument type code is not needed for this statement:
| Statement Code | Argument # 1 |
|---|---|
| 1 byte | 4 bytes |
The final bytecode for the second instruction is 01 00000000.
The third instruction:
ld %0, r2
| Statement Code | Argument Type Code | Argument # 1 | Argument # 2 |
|---|---|---|---|
| 1 byte | 1 byte | 4 bytes | 1 byte |
There are no surprises here - 02 90 00000000 02.
zjmp%: loop
The statement code zjmp is 0x09.
The argument type code is not needed for this statement.
Statement Code | Argument # 1 : -----: |: -----: 1 byte | 2 bytes
The label loop points to 19 bytes back. But how do we represent the number -19 in the hexadecimal system?
To do this, write the number 19 in the binary system in direct code:
0000 0000 0001 0011
From the number 19 written in the direct code, we can get the number -19 in the additional code.
It is in the supplementary code that it is customary to represent negative integers in computers.
In order to get an additional code for the number -19, we must follow these steps:
- Invert all digits
That is, change the unit to zero, and vice versa:
1111 1111 1110 1100
- Add one to the number
1111 1111 1110 1101
Done. So we got the number -19 in the additional code.
Let's translate it from binary to hexadecimal - 0xffed.
The final bytecode for instruction # 5 is 09 ffed.
The entire executable code of the champion in question will look like this:
0b68 0100 0700 0101 0000 0000 0290 0000
0000 0209 ffed