Engine Performance: Storing DFAs in Data Section #7

tajmone · 2021-08-20T11:50:29Z

tajmone
Aug 20, 2021

@SciroAtGit, I think that to improve performance of the RegEx engine you could benefit from pre-compiled RegExs (DFAs) being stored in Data sections (i.e. the read only .rodata for constants), which should make memory access faster.

I'm trying this approach in the Lemon PB project, as you can see in the PoC code in lemon-pb/poc/calc/calc.pbi:

DataSection
  yyArrTableData:
  ; yy_action[]
  Data.YYACTIONTYPE   20,   3,   4,   1,   2,  18,   5,   1,   2,  24 ;  0
  Data.YYACTIONTYPE   16,  19,  19,  23,   6,   7                     ; 10
  ; yy_lookahead[]:
  Data.YYCODETYPE      0,   1,   2,   3,   4,   6,   7,   3,   4,   7 ; 0
  Data.YYCODETYPE      5,   8,   8,   7,   7,   7,   8,   6,   6,   6 ; 10
  Data.YYCODETYPE      6,   6                                         ; 20
  ; yy_reduce_ofst[]: (always signed char)
  Data.b              -1,   2,   6,   7,   8                          ;  0
  ; yy_default[]:
  Data.YYACTIONTYPE   17,  17,  17,  17,  17,  17,  22,  21           ;  0
  ; yy_shift_ofst[]: (always unsigned char)
  Data.a               5,   5,   5,   5,   5,   0,   4,   4           ;  0
EndDataSection

Structure yyArraysStruct
  action.YYACTIONTYPE[#YY_ACTTAB_COUNT]
  lookahead.YYCODETYPE[#YY_LOOKAHEADTAB_COUNT]
  reduce_ofst.b[#YY_REDOFSTTAB_COUNT]
  default.YYACTIONTYPE[#YY_DEFAULTTAB_COUNT]
  shift_ofst.a[#YY_SHIFTOFSTTAB_COUNT]
EndStructure

*yy.yyArraysStruct = ?yyArrTableData

Since these C-style array are read-only data about the parse table, I'm planning to make the parser generator emit them as PB code in DataSection with labels, and then have the template create structured arrays that point to the label. Preliminary tests showed that this work.

Probably you'll have to fiddle with pointers though, since from what I understood so far the data structures here contain pointers to other sub-elements, which means you'll need to do some pointers arithmetic to get this right.

Also, you'd probably need to add a tool that generates a .pbi include file with all the required DataSections, which end users then would have to include using an alternative module of this engine, designed specifically for working with pre-compiled RegExs. But I thinks it's worth it, especially for projects that don't need to define RegEx at run-time (e.g. from user input) but rely on a predefined set of RegExs.

I'm not sure how faster accessing data stored in data sections is going to be, compared to allocating the same data at run-time, but I think that it could introduce a significant improvement since the DFAs won't need to be memory managed on the heap, and because data sections are globally reachable in terms of memory management. Of course, it's also true that for each type of data section in the binary, the OS will allocated memory chunks which are 4Kb minimum (or multiples thereof), so if the RegEx is very small chances are that at run-time you could find a small memory bloat (but probably using heap allocated memory is worst, for the OS also use fixed sized chunks to allocate memory for structures that could grow).

SicroAtGit · 2021-08-21T13:14:41Z

SicroAtGit
Aug 21, 2021
Maintainer

I think that to improve performance of the RegEx engine you could benefit from pre-compiled RegExs (DFAs) being stored in Data sections (i.e. the read only .rodata for constants), which should make memory access faster.

Is not possible with my current DFA data structure (at least not easily) because it is an array of maps and maps cannot be static.

I use a map because my regex engine generates an incomplete DFA.

A complete DFA could be implemented with a static array of a static array. But since a complete DFA must have transitions for all the symbols of the alphabet of the language in each state, the memory consumption gets very high very quickly.

Example: For a language that supports all characters of the character set that PureBasic uses, 65536 symbol transitions must be in each state.

To implement character classes in a memory-saving way, I will use a character range (min/max character) at transitions instead of one character as a symbol. The dot-symbol (any character) is then [1, 65536], instead of (1|2|3|...|65536). Thus, the data structures of NFA and DFA will change soon. Therefore, it is still too early to discuss performance improvements.

since from what I understood so far the data structures here contain pointers to other sub-elements

Only the NFA data structure uses memory pointers to the next states, whereas the DFA data structure uses index numbers.

a tool that generates a .pbi include file with all the required DataSections, which end users then would have to include using an alternative module of this engine, designed specifically for working with pre-compiled RegExs. But I thinks it's worth it, especially for projects that don't need to define RegEx at run-time (e.g. from user input) but rely on a predefined set of RegExs.

Is planned, but in a different way.

The DFA is then PureBasic code. A Procedure/EndProcedure block, where Goto or Select/EndSelect block is used to perform the state changes and Select/EndSelect or If/EndIf blocks are used to perform the transitions. Just like re2c does it.

1 reply

SicroAtGit Dec 25, 2021
Maintainer

@tajmone

I've crossed out a few things above that I no longer want to implement this way.

Because I am now creating a complete DFA (each state has defined transitions for each byte), character ranges will be implemented differently.

I made a speed test, which is faster: DFA as DataSection block or as PB code (uses If...EndIf and Goto) and the variant with DataSection block was much faster. So now I implemented your suggestion to offer the possibility to store the DFA in a DataSection block. Next, I will implement a function that will automatically generate the DataSection block from a created DFA.

The example code that uses a DFA stored in a DataSection block.

SicroAtGit · 2021-12-29T11:46:20Z

SicroAtGit
Dec 29, 2021
Maintainer

The RegEx engine can now use a DFA directly from a memory, i.e. also from a PureBasic DataSection block. Additionally, the new function ExportDfa() can be used to create a PureBasic include file containing the DFA as DataSection block.

2 replies

tajmone Dec 31, 2021
Author

That's going to be very useful for third party tools like lexer generators, etc. Well done!

SicroAtGit Jan 1, 2022
Maintainer

Thanks.

Once I implemented #16, the RegEx engine is already usable as a lexer.

Engine Performance: Storing DFAs in Data Section #7

Uh oh!

Uh oh!

tajmone Aug 20, 2021

Replies: 2 comments · 3 replies

Uh oh!

Uh oh!

SicroAtGit Aug 21, 2021 Maintainer

Uh oh!

Uh oh!

SicroAtGit Dec 25, 2021 Maintainer

Uh oh!

Uh oh!

SicroAtGit Dec 29, 2021 Maintainer

Uh oh!

tajmone Dec 31, 2021 Author

Uh oh!

SicroAtGit Jan 1, 2022 Maintainer

tajmone
Aug 20, 2021

Replies: 2 comments 3 replies

SicroAtGit
Aug 21, 2021
Maintainer

SicroAtGit Dec 25, 2021
Maintainer

SicroAtGit
Dec 29, 2021
Maintainer

tajmone Dec 31, 2021
Author

SicroAtGit Jan 1, 2022
Maintainer