Skip to content

Parsing large (a lot of tokens) files crashes with out of memory. #423

@mamazu

Description

@mamazu

The problem

Hey, I'm trying to use this parser with quite a huge file but it crashes with an out of memory exception. Example files can be found here:
https://github.com/nitotm/efficient-language-detector/tree/main/resources/ngrams

Context

I'm trying to use phpactor (which uses this parser) to index a large file and when running this parser, it crashes the language server with an out of memory exception. (phpactor/phpactor#2978)

I've traced it down to a function in this project:

protected static function tokenGetAll(string $content, $parseContext): array

In the doc comment of this function it states that caching the result is up to the user of this parser, but I think that a streamed aproach for tokens would probably be better.

Ideas

Maybe this should be configurable or depending on the file size of the thing to parse. For small files just returning an array is probably faster but for big files streaming the tokens would make more sense.

What I would suggest is some kind of save and restore mechanism in the tokenizer. This way you can save a point in the tokenizer try tokenizing it one way and if that doesn't work try a different way. This way we only have to keep the tokens since the last save point in memory.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions