Skip to content

tokenizeString performance improved by 57%#781

Open
ahfuzhang wants to merge 4 commits intoVictoriaMetrics:masterfrom
ahfuzhang:dev/ahfuzhang_20251028_for_token
Open

tokenizeString performance improved by 57%#781
ahfuzhang wants to merge 4 commits intoVictoriaMetrics:masterfrom
ahfuzhang:dev/ahfuzhang_20251028_for_token

Conversation

@ahfuzhang
Copy link
Contributor

@ahfuzhang ahfuzhang commented Oct 28, 2025

Describe Your Changes

  1. Scan the string only once.
  2. Avoid using isAscii() by checking whether the string contains Unicode.
  3. Use a table lookup instead of a complex Boolean expression.

Checklist

The following checks are mandatory:

@valyala
I'm sorry, I don't participate much in open source projects. It seems I'm not following the proper procedures and etiquette enough. I'd appreciate some guidance if I find myself doing things inappropriately.

Could you please take a moment to review my PR? My idol

@ahfuzhang
Copy link
Contributor Author

@func25 could you review this again ?

Copy link
Contributor

@func25 func25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could you add more tests to cover the cases we discussed?

// Register the token.
token := unsafe.String((*byte)(unsafe.Add(ptr, start)), end-start)
if curUnicodeFlag == 1 {
// Only perform tokenizeStringUnicode on very short substrings if the string contains Unicode characters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Only perform tokenizeStringUnicode on very short substrings if the string contains Unicode characters
// If the current token contains non-ASCII bytes, delegate to tokenizeStringUnicode for rune-aware tokenization. Otherwise hash directly

@ahfuzhang
Copy link
Contributor Author

@valyala I don't understand why this project has such low activity levels. Has the core team moved on to another project?

@func25
Copy link
Contributor

func25 commented Nov 27, 2025

The team is focused on higher-priority work before the end of the year. + we agreed to merge this PR if no one else reviews it (it may be reverted later), but it still needs test cases, so I can't merge it right now

@valyala
Copy link
Contributor

valyala commented Feb 5, 2026

@ahfuzhang , could you provide benchmark results for the BenchmarkTokenizeStrings before and after the optimization applied in this pull request?

@ahfuzhang
Copy link
Contributor Author

@ahfuzhang , could you provide benchmark results for the BenchmarkTokenizeStrings before and after the optimization applied in this pull request?

/*
go test -test.fullpath=true -benchmem -run=^$ -bench ^BenchmarkTokenizeHashes$ github.com/VictoriaMetrics/VictoriaLogs/lib/logstorage

goos: darwin
goarch: arm64
pkg: github.com/VictoriaMetrics/VictoriaLogs/lib/logstorage
cpu: Apple M2
=== RUN BenchmarkTokenizeHashes
BenchmarkTokenizeHashes
BenchmarkTokenizeHashes-8 774218 1476 ns/op 2267.84 MB/s 0 B/op 0 allocs/op
PASS
*/

/*
go test -test.fullpath=true -benchmem -run=^$ -bench ^BenchmarkTokenizeHashesOld$ github.com/VictoriaMetrics/VictoriaLogs/lib/logstorage

goos: darwin
goarch: arm64
pkg: github.com/VictoriaMetrics/VictoriaLogs/lib/logstorage
cpu: Apple M2
=== RUN BenchmarkTokenizeHashesOld
BenchmarkTokenizeHashesOld
BenchmarkTokenizeHashesOld-8 665464 1864 ns/op 1796.11 MB/s 0 B/op 0 allocs/op
PASS

2267.84 MB/s vs 1796.11 MB/s, about 26% speedup
*/

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants