-
-
Notifications
You must be signed in to change notification settings - Fork 8k
all: Improve and consolidate calculation of word count #10032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
all: Improve and consolidate calculation of word count #10032
Conversation
This PR has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help. |
Yes, I believe it fits the description, including closing an open issue. |
b200173
to
fde4893
Compare
I have updated this branch to resolve conflicts, and have also improved the implementation slightly. |
I'm not a linguistics expert, but my understanding is that CJK languages do not have explicit word separators like spaces in English. But we're relying on cc: @davidsneighbour |
@jmooring no, there are no word separators in Chinese/Japanese. The concept of a "word" is not even clearly defined in those languages, which is why counting characters is the best one can do. In either case, I don't think this affects the resolution of #10031: the problem is that, with This PR resolves that issue with mixed text, by computing separate counts for CJK characters as rune count and everything else via word count. This is still not perfect obviously (is Korean read at a similar speed as Chinese? what about Arabic anyway? etc.) but it's an improvement over the current situation at least, and eliminates an unnecessary setting. |
@jmooring is there any more feedback or changes you'd like to see here? |
I've fixed the formatting issues reported by the CI |
PerformanceI ran a quick performance comparison (before and after) on For
For example, here is an optimized version of code// CountWordsOptimized returns the approximate word count in s.
func (ns *Namespace) CountWordsOptimized(s any) (int, error) {
ss, err := cast.ToStringE(s)
if err != nil {
return 0, fmt.Errorf("failed to convert content to string: %w", err)
}
sss := tpl.StripHTML(ss)
n := 0
if hasCJK(sss) {
for _, word := range strings.Fields(sss) {
if hasCJK(word) {
n += utf8.RuneCountInString(word)
} else {
n++
}
}
} else {
inWord := false
for _, r := range sss {
wasInWord := inWord
inWord = !unicode.IsSpace(r)
if inWord && !wasInWord {
n++
}
}
}
return n, nil
}
// hasCJK reports whether the string s contains one or more Chinese, Japanese,
// or Korean (CJK) characters.
func hasCJK(s string) bool {
for _, r := range s {
if unicode.In(r, unicode.Han, unicode.Hangul, unicode.Hiragana, unicode.Katakana) {
return true
}
}
return false
} A performance comparison for a string (approximately 3400 words) that does not contain any CJK characters:
A performance comparison for a string (approximately 3400 words) containing 2 CJK characters near the middle:
Is the performance hit worth the improved accuracy? Probably, but not sure. Duplicate codeWe calculate word count in several places:
It would be nice if we could DRY this up a bit. Commit messageSee https://github.com/gohugoio/hugo/blob/master/CONTRIBUTING.md#git-commit-message-guidelines. I suggest something like:
OtherThis resolves #10031 but it does not improve the CJK word count itself as, for example, a Chinese word may consist of one or more runes. So our CJK word count is too high in some (most?) cases. |
Thank you very much for your detailed feedback! I apologize for the delay, I was rather busy. Regarding the points you raised: PerformanceI've adopted your optimized version. I also adapted it slightly: for CJK texts, it only looks at the first character of each word to determine whether it is a CJK word. This should be accurate enough for normal text and leads to a further speedup. I've also added benchmarks for English text, mixed text and Chinese text, the results looked something like this (times in nanoseconds)
To summarize, this is faster for pure CJK or English text, and slightly slower for mixed text. Note that the "counting spaces instead of words" technique will usually lead to a count off by 1 (since I've also changed the reading time computation to go without in-between float64 conversion as you suggested. Note that this does now lead to a reading time of zero (due to integer division) when both the CJK count is below 501 and the non-CJK count is below 213. This necessitated a few more test adaptions. I'm wondering if this is undesirable in any case? Duplicate codeI eliminated the duplication in the places you mentioned, except for truncate.go: since it doesn't just count words but iterates through them one by one with custom behavior, I don't see the common function being usable here. OtherI've adapted the PR title/description as you suggested. Also:
As I mentioned before, "word" is not a clearly defined concept in Chinese (or Japanese for that matter, though I am not at all familiar with Korean). Counting characters is really the best you can do. In contrast to Western languages, reading speed or length limitations for documents etc. are, afaik, always defined via character counts. |
I've seen the CI failure at https://github.com/gohugoio/hugo/actions/runs/15906934442/job/44874052659?pr=10032 |
Seems like merged master has fixed the CI failure. |
It's on my list, but not at the top. |
I only spent a couple of minutes looking at this...
I wouldn't spend any time on this until I've had a chance to spend more time with it. |
Understood!
and
Perhaps I am misunderstanding something, but I don't see any usecase for setting To illustrate this with four hypothetical users:
Specifically about
Imo the benchmarks show that the differences are rather small, but I assume you are more familiar with typical use cases and bottlenecks in Hugo than me. To summarize, we see no change in performance for pure CJK and pure non-CJK texts, and ~25% worse performance on mixed texts. Do you think the word counting algorithm is a significant factor in performance on a large site? Some numbers I used to put this into perspective for myself:
That was the initial point of this PR: Hugo already differentiated between CJK and non-CJK sections within one text (if the setting was activated) but it summed them both together for the final reading time computation. Assume you have a mixed text containing 1500 words non-CJK and 1500 characters of Chinese text. Previously with the setting off, this would calculate a word count of 1500 + 1 = 1501 (because there are usually no spaces in Chinese text) and then a reading time of 1501 / 213 = 7mins. With the setting on, this would compute a wordcount of 1500 + 1500 = 3000 and a reading time of 3000/501 = 6mins, i.e. treating the entire text as Chinese. I changed this to compute word counts separately (1500 English words and 1500 Chinese characters) and then compute their reading time with the correct formula respectively, giving us 1500/213 + 1500/501 = 10mins. Since the reading time computation formula changed, I also had to adapt the test results.
Okay, will revert this.
Fully agreed, I just didn't have an idea for a better name... happy to take suggestions though. Maybe |
Sorry, that question was unclear. Before we did this: (wordcount + 212) / 213 You changed it to this: wordcount / 213 The original one is correct. Assume there are 200 words. The original calc results in 1, yours results in zero. Integer division discards the remainder; there's no rounding.
How about just "hstrings.CountWords` ? That's what it does.
True, but that's not what the code does. It uses Also, please squash commits with a final commit message of
Then just force push changes thereafter. I'll look at the rest of this later. |
3233d88
to
3e467b9
Compare
Ah, sorry that makes more sense.
Okay, done 👍🏼
Also done, but what is the point of this? Commits should be squashed and the commit message taken from the PR title and description automatically when this is merged anyway, no? |
Closes #10031