Issue #688 ignore extracted strings longer than 2048 characters#697
Issue #688 ignore extracted strings longer than 2048 characters#697ato merged 3 commits intointernetarchive:masterfrom
Conversation
| CharSequence attrName = cs.subSequence(attr.start(1),attr.end(1)); | ||
| value = TextUtils.unescapeHtml(value); | ||
| if (attr.start(2) > -1) { | ||
| if (value.length() == getMaxAttributeValLength() && end < cs.length()) { |
There was a problem hiding this comment.
If value contains any HTML escapes (such as &) then the length of value will be changed by value = TextUtils.unescapeHtml(value) line above and won't match getMaxAttributeValLength(). Maybe we should be checking the original length from the regex (end - start) not the length after unescaping?
There was a problem hiding this comment.
You're right, the check must be done before the unescaping
| // Check if it's really the end of the string | ||
| if (nextChar != '"' && nextChar != '\'' && !Character.isWhitespace(nextChar)) { |
There was a problem hiding this comment.
This check seems to have no effect other than the log message so I'm a bit confused why we're doing it. It feels like there's something missing here, like a continue or being paired with an else?
There was a problem hiding this comment.
The CharSequence cs contains all of the tag content (tag name, all the attributes and their values). Even if the length of value equals getMaxAttributeValLength() and the value of end is lesser than the length of cs, we are not sure if the value is truncated, so we need to check if the next value is a quote, double quote or a space. However, the two if needs to be merged or something like that.
|
In the new version of this fix, after checking if the value is really truncated, we move the pointer of the matcher at the actual end of the value. Like this, if there is another attribute to parse, the truncated part of the value is not considered as a potential attribute. |
|
Thanks! |
Fix #688 + unit tests (#689)