Skip to content

Conversation

DeinAlptraum
Copy link

@DeinAlptraum DeinAlptraum commented Jun 18, 2022

Closes #10031

@CLAassistant
Copy link

CLAassistant commented Jun 18, 2022

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link

This PR has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help.
Please check https://github.com/gohugoio/hugo/blob/master/CONTRIBUTING.md#code-contribution and verify that this code contribution fits with the description. If yes, tell is in a comment.
This PR will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.

@github-actions github-actions bot added the Stale label Jun 19, 2023
@DeinAlptraum
Copy link
Author

Yes, I believe it fits the description, including closing an open issue.

@github-actions github-actions bot removed the Stale label Jun 20, 2023
@DeinAlptraum DeinAlptraum force-pushed the fix-reading-time-mixed-cjk branch from b200173 to fde4893 Compare May 23, 2025 09:23
@DeinAlptraum DeinAlptraum changed the title Fix reading time on mixed CJK/non-CJK text. Fixes #10031 Remove CJK language setting and fix reading time on mixed CJK/non-CJK text. Fixes #10031 May 23, 2025
@DeinAlptraum DeinAlptraum changed the title Remove CJK language setting and fix reading time on mixed CJK/non-CJK text. Fixes #10031 Remove CJK language setting and fix reading time on mixed text. Fixes #10031 May 23, 2025
@DeinAlptraum
Copy link
Author

I have updated this branch to resolve conflicts, and have also improved the implementation slightly.

@jmooring
Copy link
Member

jmooring commented May 25, 2025

I'm not a linguistics expert, but my understanding is that CJK languages do not have explicit word separators like spaces in English. But we're relying on strings.Fields to separate words, so I'm not sure how this addresses #10031.

cc: @davidsneighbour

@DeinAlptraum
Copy link
Author

@jmooring no, there are no word separators in Chinese/Japanese. The concept of a "word" is not even clearly defined in those languages, which is why counting characters is the best one can do.
I just looked up Korean and it seems that they do use spaces. We could use a different formula here or only treat CJ differently if you prefer.

In either case, I don't think this affects the resolution of #10031: the problem is that, with hasCJKLanguage activated, any reading time is computed as if the text consists of only CJK characters. As an example, a text that contains one CJK character and 10,000 non-CJK words currently results in a reading time of about 20mins (rather than the expected 47mins) because it uses the CJK formula, rather than "501 runes per minute" for the one CJK character and "213 words per minute" for the remaining 10,000 words
Splitting at spaces for CJK languages is not an issue even if they have spaces (like Korean) because we only count the number of runes, so this doesn't affect the result anyway.

This PR resolves that issue with mixed text, by computing separate counts for CJK characters as rune count and everything else via word count. This is still not perfect obviously (is Korean read at a similar speed as Chinese? what about Arabic anyway? etc.) but it's an improvement over the current situation at least, and eliminates an unnecessary setting.

@DeinAlptraum
Copy link
Author

@jmooring is there any more feedback or changes you'd like to see here?

@DeinAlptraum
Copy link
Author

I've fixed the formatting issues reported by the CI

@jmooring
Copy link
Member

jmooring commented Jun 5, 2025

Performance

I ran a quick performance comparison (before and after) on strings.WordCount on a page with 3430 words, none of which contain CJK characters. The new implementation takes 20-30% longer.

For strings.WordCount as well as *cachedContentScope.contentPlain, create some realistic benchmark tests (e.g., somewhere around 3000 words). Some ideas for optimization:

  • Test for rune existence in Unicode ranges instead of using a regular expression
  • If the string doesn't contain CJK, count spaces instead of splitting into words
  • My understanding is that integer division is faster than converting to float then dividing (look at reading time)

For example, here is an optimized version of strings.WordCount:

code
// CountWordsOptimized returns the approximate word count in s.
func (ns *Namespace) CountWordsOptimized(s any) (int, error) {
	ss, err := cast.ToStringE(s)
	if err != nil {
		return 0, fmt.Errorf("failed to convert content to string: %w", err)
	}

	sss := tpl.StripHTML(ss)

	n := 0
	if hasCJK(sss) {
		for _, word := range strings.Fields(sss) {
			if hasCJK(word) {
				n += utf8.RuneCountInString(word)
			} else {
				n++
			}
		}
	} else {
		inWord := false
		for _, r := range sss {
			wasInWord := inWord
			inWord = !unicode.IsSpace(r)
			if inWord && !wasInWord {
				n++
			}
		}
	}

	return n, nil
}

// hasCJK reports whether the string s contains one or more Chinese, Japanese,
// or Korean (CJK) characters.
func hasCJK(s string) bool {
	for _, r := range s {
		if unicode.In(r, unicode.Han, unicode.Hangul, unicode.Hiragana, unicode.Katakana) {
			return true
		}
	}
	return false
}

A performance comparison for a string (approximately 3400 words) that does not contain any CJK characters:

INFO  timer:  name CountWordsOptimized count 100 duration 68.389745ms average 683.897µs median 597.067µs
INFO  timer:  name CountWordsOriginal count 100 duration 81.885859ms average 818.858µs median 698.433µs
INFO  timer:  name CountWordsFromPR10032 count 100 duration 102.590132ms average 1.025901ms median 869.076µs

A performance comparison for a string (approximately 3400 words) containing 2 CJK characters near the middle:

INFO  timer:  name CountWordsOriginal count 100 duration 62.440273ms average 624.402µs median 533.429µs
INFO  timer:  name CountWordsOptimized count 100 duration 88.763503ms average 887.635µs median 785.199µs
INFO  timer:  name CountWordsFromPR10032 count 100 duration 103.849739ms average 1.038497ms median 909.551µs

Is the performance hit worth the improved accuracy? Probably, but not sure.

Duplicate code

We calculate word count in several places:

  • hugolib/page__content.go
  • resources/page/page_markup.go
  • tpl/strings/strings.go
  • tpl/strings/truncate.go
  • helpers/content.go

It would be nice if we could DRY this up a bit.

Commit message

See https://github.com/gohugoio/hugo/blob/master/CONTRIBUTING.md#git-commit-message-guidelines. I suggest something like:

all: Improve and consolidate calculation of word count

Closes #10031

Other

This resolves #10031 but it does not improve the CJK word count itself as, for example, a Chinese word may consist of one or more runes. So our CJK word count is too high in some (most?) cases.

@DeinAlptraum DeinAlptraum changed the title Remove CJK language setting and fix reading time on mixed text. Fixes #10031 all: Improve and consolidate calculation of word count Jun 25, 2025
@DeinAlptraum
Copy link
Author

DeinAlptraum commented Jun 25, 2025

Thank you very much for your detailed feedback! I apologize for the delay, I was rather busy.

Regarding the points you raised:

Performance

I've adopted your optimized version. I also adapted it slightly: for CJK texts, it only looks at the first character of each word to determine whether it is a CJK word. This should be accurate enough for normal text and leads to a further speedup.

I've also added benchmarks for English text, mixed text and Chinese text, the results looked something like this (times in nanoseconds)

Current Master
    ASCII:      400
    Mixed:      170
    Chinese:    73

This PR (first version)
    ASCII:      520
    Mixed:      550
    Chinese:    73

Mooring Opt
    ASCII:      300
    Mixed:      340
    Chinese:    70

This PR (current version)
    ASCII:      300
    Mixed:      210
    Chinese:    67

To summarize, this is faster for pure CJK or English text, and slightly slower for mixed text.

Note that the "counting spaces instead of words" technique will usually lead to a count off by 1 (since n words are separated by n-1 spaces).

I've also changed the reading time computation to go without in-between float64 conversion as you suggested. Note that this does now lead to a reading time of zero (due to integer division) when both the CJK count is below 501 and the non-CJK count is below 213. This necessitated a few more test adaptions. I'm wondering if this is undesirable in any case?

Duplicate code

I eliminated the duplication in the places you mentioned, except for truncate.go: since it doesn't just count words but iterates through them one by one with custom behavior, I don't see the common function being usable here.
Besides, I've put the common word counting function into common/hstrings/strings.go; I'm not that familiar with the structure of the project, so I hope this is an acceptable place.

Other

I've adapted the PR title/description as you suggested.

Also:

This resolves #10031 but it does not improve the CJK word count itself as, for example, a Chinese word may consist of one or more runes. So our CJK word count is too high in some (most?) cases.

As I mentioned before, "word" is not a clearly defined concept in Chinese (or Japanese for that matter, though I am not at all familiar with Korean). Counting characters is really the best you can do. In contrast to Western languages, reading speed or length limitations for documents etc. are, afaik, always defined via character counts.
The only problem I see is that our labeling as "word count" is then a bit confusing, but this is still an improvement over counting a space-less block of 1000 Chinese characters as a single word, which can happen on current master. And at least for the reading time computation, this is already a complete solution.

@DeinAlptraum
Copy link
Author

I've seen the CI failure at https://github.com/gohugoio/hugo/actions/runs/15906934442/job/44874052659?pr=10032
but it doesn't seem related to my change, or I don't know how to interpret this error.
I'll update the branch just in case.

@DeinAlptraum
Copy link
Author

Seems like merged master has fixed the CI failure.
@jmooring do you have any further feedback on this?

@jmooring
Copy link
Member

jmooring commented Jul 3, 2025

do you have any further feedback on this

It's on my list, but not at the top.

@jmooring
Copy link
Member

jmooring commented Jul 3, 2025

I only spent a couple of minutes looking at this...

  • Why remove the site config value? If hasCJKLanguage is false (the default) we can bypass detection. Think about a site with 1000 pages in 12 languages, and only one of the languages is CJK. That's a lot of unnecessary detection.
  • I haven't thought this through, but it seems like deprecating the isCJKLanguage front matter field in favor of a hasCJKLanguage front matter field would be the right thing to do. For example, you could set the site level hasCJKLanguage to false (the default), and set the hasCJKLanguage front matter field to true (override) where needed, either directly in front matter or via cascade. Again, a way to bypass the detection logic.
  • Why did you change the reading time calculation and expected test results?
  • Don't bother editing docs.yaml; it's generated from code.
  • I think we need a better name for hstrings.CountWordsCJK, because that's not what it does.

I wouldn't spend any time on this until I've had a chance to spend more time with it.

@DeinAlptraum
Copy link
Author

I wouldn't spend any time on this until I've had a chance to spend more time with it.

Understood!
Regarding your questions:

Why remove the site config value? If hasCJKLanguage is false (the default) we can bypass detection. Think about a site with 1000 pages in 12 languages, and only one of the languages is CJK. That's a lot of unnecessary detection.

and

I haven't thought this through, but it seems like deprecating the isCJKLanguage front matter field in favor of a hasCJKLanguage front matter field would be the right thing to do. For example, you could set the site level hasCJKLanguage to false (the default), and set the hasCJKLanguage front matter field to true (override) where needed, either directly in front matter or via cascade. Again, a way to bypass the detection logic.

Perhaps I am misunderstanding something, but I don't see any usecase for setting hasCJKLanguage to false. When I initially started writing this PR, I wasn't planning to remove the setting, but eventually did so after realizing that there was no point to having it anymore, since hugo could now differentiate between CJK and non-CJK text and treat it appropriately.

To illustrate this with four hypothetical users:

  1. Pure English writer: has no need for any CJK support, so they don't care about this setting. You could say this change doesn't affect them, though technically they benefit from faster word counts since you optimized the counting algorithm also for non-CJK.
  2. English writer, sometimes has CJK stuff in their texts: the amount of CJK characters is small enough that they do not have a significant effect on reading times and word counts, in which case the setting is irrelevant to them and they would only experience slightly worse word counting times (corresponding to the Mixed case in my benchmarks above) via the changes in this PR.
  3. English writer with significant CJK stuff in their texts (e.g. someone talking about literature, citing passages etc. for a bilingual audience): this is the niche group benefitting the most from this change. With current master, it is impossible for them to get accurate reading times (it is computed either as if it were full English text or full CJK text), meaning this would improve the situation for them. They would also get more sensible word counts. They do get slightly worse performance.
  4. Pure CJK writer: with this change they would get accurate word counts/reading times without having to change this setting first. Otherwise, they are not affected.

Specifically about

That's a lot of unnecessary detection.

Imo the benchmarks show that the differences are rather small, but I assume you are more familiar with typical use cases and bottlenecks in Hugo than me. To summarize, we see no change in performance for pure CJK and pure non-CJK texts, and ~25% worse performance on mixed texts. Do you think the word counting algorithm is a significant factor in performance on a large site?

Some numbers I used to put this into perspective for myself:

  • on my laptop the word counting on a ~4000 word mixed text takes 170 microseconds before this change, 210 microseconds after. That is 40 microseconds difference for what I assume is a relatively average length post. This is very small imo
  • My own blog consisting of 84 pages takes ~750ms to build. If we assume that all 84 pages were ~4000 word mixed texts, the build would be slowed down by about 40*84 = 3360 microseconds ≈ 3ms. I.e. the effect is completely negligible, and that was under worst-case assumptions which are never going to be true (the vast majority of pages are not 4000-word mixed texts)
  • Finally, to consider the more realistic example you provided: a site with 1000 pages in 12 languages out of which 1 is CJK. For simplicity's sake I assume all pages to be ~4000 words in length, and the CJK parts to be 1/12 of the overall number of pages (~83 pages). This PR's implementation does not affect the performance on the 917 non-CJK pages (well technically it is going to be about 25% faster thanks to your optimization) so that would be 400 microseconds * 917 pages, before and after. Then we have 83 CJK pages: the performance on these is not affected if it is pure CJK text, so I'll assume mixed text again for the worst case. Then we get 170 microseconds per page before and 210 microseconds after. So we get 917 * 400 + 83 * 170 ≈ 381ms before and 917 * 400 + 83 * 210 ≈ 384ms after, a difference of 3ms, or a bit less than 1%. This is for the word counting only of course, the proportion is even less if we were to look at the entire site building.

Why did you change the reading time calculation and expected test results?

That was the initial point of this PR: Hugo already differentiated between CJK and non-CJK sections within one text (if the setting was activated) but it summed them both together for the final reading time computation. Assume you have a mixed text containing 1500 words non-CJK and 1500 characters of Chinese text. Previously with the setting off, this would calculate a word count of 1500 + 1 = 1501 (because there are usually no spaces in Chinese text) and then a reading time of 1501 / 213 = 7mins. With the setting on, this would compute a wordcount of 1500 + 1500 = 3000 and a reading time of 3000/501 = 6mins, i.e. treating the entire text as Chinese. I changed this to compute word counts separately (1500 English words and 1500 Chinese characters) and then compute their reading time with the correct formula respectively, giving us 1500/213 + 1500/501 = 10mins.

Since the reading time computation formula changed, I also had to adapt the test results.

Don't bother editing docs.yaml; it's generated from code.

Okay, will revert this.

I think we need a better name for hstrings.CountWordsCJK, because that's not what it does.

Fully agreed, I just didn't have an idea for a better name... happy to take suggestions though. Maybe CountWordsCJKAware or something similar?

@jmooring
Copy link
Member

jmooring commented Jul 3, 2025

Why did you change the reading time calculation

Sorry, that question was unclear. Before we did this:

(wordcount + 212) / 213

You changed it to this:

wordcount / 213

The original one is correct. Assume there are 200 words. The original calc results in 1, yours results in zero. Integer division discards the remainder; there's no rounding.

idea for a better name

How about just "hstrings.CountWords` ? That's what it does.

Note that the "counting spaces instead of words" technique will usually lead to a count off by 1

True, but that's not what the code does. It uses unicode.IsSpace to detect word boundaries and counts the words.

Also, please squash commits with a final commit message of

all: Improve and consolidate calculation of word count

Closes #10031

Then just force push changes thereafter.

I'll look at the rest of this later.

@DeinAlptraum DeinAlptraum force-pushed the fix-reading-time-mixed-cjk branch from 3233d88 to 3e467b9 Compare July 7, 2025 20:36
@DeinAlptraum
Copy link
Author

DeinAlptraum commented Jul 7, 2025

Sorry, that question was unclear.

Ah, sorry that makes more sense.
Since we have two separate counts now, I changed this since it doesn't quite make sense when there's a low non-zero word count of both kinds. E.g. one English word and one Chinese character would lead to a reading time of 2 minutes. But I guess this is an extremely unlikely edge case... so I'll change this back to compute both of them as before.

How about just hstrings.CountWords ?

Okay, done 👍🏼

Also, please squash commits with a final commit message of

Also done, but what is the point of this? Commits should be squashed and the commit message taken from the PR title and description automatically when this is merged anyway, no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading time with hasCJKLanguage is computed as if text was purely CJK
3 participants