-
Notifications
You must be signed in to change notification settings - Fork 82
Description
The problem
Hey, I'm trying to use this parser with quite a huge file but it crashes with an out of memory exception. Example files can be found here:
https://github.com/nitotm/efficient-language-detector/tree/main/resources/ngrams
Context
I'm trying to use phpactor (which uses this parser) to index a large file and when running this parser, it crashes the language server with an out of memory exception. (phpactor/phpactor#2978)
I've traced it down to a function in this project:
tolerant-php-parser/src/PhpTokenizer.php
Line 221 in 457738c
| protected static function tokenGetAll(string $content, $parseContext): array |
In the doc comment of this function it states that caching the result is up to the user of this parser, but I think that a streamed aproach for tokens would probably be better.
Ideas
Maybe this should be configurable or depending on the file size of the thing to parse. For small files just returning an array is probably faster but for big files streaming the tokens would make more sense.
What I would suggest is some kind of save and restore mechanism in the tokenizer. This way you can save a point in the tokenizer try tokenizing it one way and if that doesn't work try a different way. This way we only have to keep the tokens since the last save point in memory.