Commit d2a1853
committed
Optimize traversing the DOM when analyzing text content
Previously, this was an often-repeated construct in readability implementation:
charCount(ps.getInnerText(node, true)) == 0
What this would do is:
- Call `dom.TextContent(node)` to append the contents of all individual text nodes together;
- Pass the result through `strings.TrimSpace`;
- Pass the result through the NormalizeSpaces regex which squashes consecutive runs of whitespace;
- Count the Unicode runes of the result;
- Finally, if the count is zero, the element would be considered "empty".
The above is an example of an incredibly costly operation that could be done much more efficiently, for example: walk the DOM subtree until the first non-space character is found, then bail out and conclude that the element has content. This barely needs any memory allocations, and is the approach taken in this PR to address a variety of counting or detecting tasks that share a similar purpose.
Benchmark before vs. after for processing a large HTML document reveals significant saving in memory allocations:
variant | times | ns/op | Bytes/op | allocs/op
--------|-------|------------|------------|----------
before | 30 | 38,986,203 | 59,623,683 | 199,876
after | 36 | 31,910,769 | 11,449,004 | 119,8101 parent 9f5bf5c commit d2a1853
File tree
7 files changed
+355
-285
lines changed- internal/re2go
7 files changed
+355
-285
lines changedSome generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
40 | | - | |
41 | | - | |
42 | | - | |
43 | | - | |
44 | | - | |
45 | | - | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
56 | | - | |
57 | | - | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
152 | 152 | | |
153 | 153 | | |
154 | 154 | | |
155 | | - | |
156 | | - | |
157 | | - | |
158 | | - | |
159 | | - | |
160 | 155 | | |
161 | 156 | | |
162 | 157 | | |
| |||
0 commit comments