Lexical analysis

Prior to the translation of code written in assembly language into bytecode, the translator program performs the so-called lexical analysis of the source code.

In computer science, lexical analysis (tokenising) is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens.

Source: Lexical Analysis - Wikipedia

This process consists of parsing the champion’s code into separate components, where each will be assigned to one of the types.

For example, the following line after the lexical analysis turns into ...:

loop: sti r1,%: live,% 1

... token list:

Number	Content	Type
1	`loop:`	`LABEL`
2	`sti`	`INSTRUCTION`
3	`r1`	`REGISTER`
4	`,`	`SEPARATOR`
5	`%: live`	`DIRECT_LABEL`
6	`,`	`SEPARATOR`
7	`% 1`	`DIRECT`

It is possible to study in detail how the provided translator program identifies each type of token using the error messages it displays.

Features

Thanks to these error reports and the "method of scientific brute-force search" that you can find out the following:

Order of commands

Commands for setting the name and comment can be interchanged:

.comment "This city needs me"
.name "Batman"

Line

A STRING type token includes everything from the opening to the closing quotation mark. And neither line breaks nor comment characters can stop it.

Therefore, the example below is absolutely correct:

.name "one
#two
three "

Optional spaces and tabs

In cases where the translator can unambiguously separate the components from each other, whitespace between them can be omitted:

live%42

loop:sti r1, %:live, %1

.name"Batman"
.comment"This city needs me"

But there are cases when you cannot go without at least one space or tab.

± 0

Statement arguments may contain minus signs, but using the plus sign will cause an error. It cannot be in the number.

However, leading zeros may be present:

live %0000042

Registries

For the provided asm program, registry is a string consisting of the character r and one or two digits.

Therefore, all of the following examples will be successfully translated into bytecode:

r2

r01

r99

Pay attention to the last one.

The fact that its translation does not cause any errors indicates that asm does not know anything about the virtual machine configuration and the value of the REG_NUMBER constant.

This behavior of the program is absolutely logical. After all, the translator and the virtual machine are separate entities that are not aware of each other's configuration (or even existence).

And if the value of the REG_NUMBER constant is changed, the same bytecode can become valid for the virtual machine from invalid, and vice versa.

If the value of REG_NUMBER is equal to 16, the operation with the register argument r42 will be incorrect. And with the value 49, it will be executed without problems.

Two special cases

But two special cases of the asm program work raise questions - r0 and r00.

The original translator will also handle these two examples without any problems. Although this contradicts the logic of the assignment, which says:

Registry: (r1 <–> rx with x = REG_NUMBER)

And if the program’s ignorance of the upper bound value (REG_NUMBER) is logical and easily explainable. Indeed, when changing the configuration of a virtual machine, the same section of bytecode can become invalid from invalid. The translation result of r0 and r00 will always be incorrect.

Therefore, when implementing your own asm, this problem should be fixed.

Size of executable code

The size of the executable code is in no way limited by the translator program. That is, asm will work equally well with the source code consisting of one instruction, and with the code, where these instructions are hundreds of thousands.

With a virtual machine, everything is a little different. It has a limit on the maximum amount of executable code in bytes using the constant CHAMP_MAX_SIZE.

In the generated op.h file, this constant is initialized with the following lines:

# define MEM_SIZE (4 * 1024)
// ...
# define CHAMP_MAX_SIZE (MEM_SIZE / 6)

In this case, the size of the champion’s executable code should not exceed 682 bytes.

With a minimal border, everything is a little more interesting.

The virtual machine will easily accept a .cor file in which the size of the executable code will be zero.

But to produce such a file using the provided translator is not so simple.

If, apart from setting the name and comment of the champion, nothing else is written to the .s file, then converting it to bytecode will fail.

But if you add only one mark to the end, you can get the desired file as a result without executable code:

.comment "This city needs me"
.name "Batman"

loop:
# End of file

Perhaps the creators of the original asm wanted to ban the broadcast of the champion without executable code in this way. So that at least one operation in the file is still present.

But in this case, they did not take into account everything and you can still create such a file.

What to do with it?

In your work, you can improve this protection to guarantee the presence of at least one operation.

You can also remove it completely, which looks like a more logical solution. After all, the original virtual machine works without problems with .cor files without executable code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lexical analysis

Features

Order of commands

Line

Optional spaces and tabs

± 0

Registries

Two special cases

Size of executable code

What to do with it?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally