Skip to content

Tokenization Poor Performance With Large Number of Token Instances #115

@aolszowka

Description

@aolszowka

The Tokenizer appears to perform very poorly when you have a large number of replacement token instances.

For example in a file this line:

$matches = select-string -Path $tempFile -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value }

Returns 8600 instances on a file I am attempting to have it replace.

Based on the logic of this loop:

ForEach ($match in $matches) {

This will attempt to perform this operation 8600 times. If you look at the code this loops though the file row, by row, attempting a replacement of all of the variables that are found.

This is inefficent, rather the above line should have gathered distinct values like so (the following tries to follow the PowerShell-isms and is not 100% efficent):

$matches = select-string -Path $tempFile -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } | Sort-Object | Get-Unique

Note that in order to use Get-Unique the documentation states that the list must be ordered, which is why a Sort-Object is called before hand.

Running this on that same file returns a mere 38 instances to have to attempt to replace. which is an order of magnitude smaller than the previous attempts.

There are still other performance issues, for example the row-by-row replacement per variable as seen here:

(Get-Content $tempFile -Encoding $encoding) |
Foreach-Object {
$_ -replace $match, $variableValue
} |
Set-Content $tempFile -Encoding $encoding -Force

This becomes painful as the number of lines in the file increase, However, this simple fix would resolve the most obvious performance issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions