Tokenization Poor Performance With Large Number of Token Instances

The Tokenizer appears to perform very poorly when you have a large number of replacement token instances.

For example in a file this line:

https://github.com/openalm/Extension-UtilitiesPack/blob/4747cae037612c5f3e41bdf6e6aa3b285cbc29eb/Utilites/Tokenizer/tokenize-ps3.ps1#L106

Returns 8600 instances on a file I am attempting to have it replace.

Based on the logic of this loop:

https://github.com/openalm/Extension-UtilitiesPack/blob/4747cae037612c5f3e41bdf6e6aa3b285cbc29eb/Utilites/Tokenizer/tokenize-ps3.ps1#L107

This will attempt to perform this operation 8600 times. If you look at the code this loops though the file row, by row, attempting a replacement of all of the variables that are found.

This is inefficent, rather the above line should have gathered distinct values like so (the following tries to follow the PowerShell-isms and is not 100% efficent):

```powershell
$matches = select-string -Path $tempFile -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } | Sort-Object | Get-Unique
```

Note that in order to use `Get-Unique` the documentation states that the list must be ordered, which is why a `Sort-Object` is called before hand.

Running this on that same file returns a mere 38 instances to have to attempt to replace. which is an order of magnitude smaller than the previous attempts.

There are still other performance issues, for example the row-by-row replacement per variable as seen here:

https://github.com/openalm/Extension-UtilitiesPack/blob/4747cae037612c5f3e41bdf6e6aa3b285cbc29eb/Utilites/Tokenizer/tokenize-ps3.ps1#L145-L149

This becomes painful as the number of lines in the file increase, However, this simple fix would resolve the most obvious performance issue.

	(Get-Content $tempFile -Encoding $encoding) \|
	Foreach-Object {
	$_ -replace $match, $variableValue
	} \|
	Set-Content $tempFile -Encoding $encoding -Force

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenization Poor Performance With Large Number of Token Instances #115

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tokenization Poor Performance With Large Number of Token Instances #115

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions