Skip to content

Issue #688 ignore extracted strings longer than 2048 characters#697

Merged
ato merged 3 commits intointernetarchive:masterfrom
bnfleb:bnf_2025
Dec 9, 2025
Merged

Issue #688 ignore extracted strings longer than 2048 characters#697
ato merged 3 commits intointernetarchive:masterfrom
bnfleb:bnf_2025

Conversation

@bnfleb
Copy link
Copy Markdown
Contributor

@bnfleb bnfleb commented Dec 4, 2025

Fix #688 + unit tests (#689)

CharSequence attrName = cs.subSequence(attr.start(1),attr.end(1));
value = TextUtils.unescapeHtml(value);
if (attr.start(2) > -1) {
if (value.length() == getMaxAttributeValLength() && end < cs.length()) {
Copy link
Copy Markdown
Collaborator

@ato ato Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If value contains any HTML escapes (such as &amp;) then the length of value will be changed by value = TextUtils.unescapeHtml(value) line above and won't match getMaxAttributeValLength(). Maybe we should be checking the original length from the regex (end - start) not the length after unescaping?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the check must be done before the unescaping

Comment on lines +441 to +442
// Check if it's really the end of the string
if (nextChar != '"' && nextChar != '\'' && !Character.isWhitespace(nextChar)) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check seems to have no effect other than the log message so I'm a bit confused why we're doing it. It feels like there's something missing here, like a continue or being paired with an else?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CharSequence cs contains all of the tag content (tag name, all the attributes and their values). Even if the length of value equals getMaxAttributeValLength() and the value of end is lesser than the length of cs, we are not sure if the value is truncated, so we need to check if the next value is a quote, double quote or a space. However, the two if needs to be merged or something like that.

@bnfleb
Copy link
Copy Markdown
Contributor Author

bnfleb commented Dec 8, 2025

In the new version of this fix, after checking if the value is really truncated, we move the pointer of the matcher at the actual end of the value. Like this, if there is another attribute to parse, the truncated part of the value is not considered as a potential attribute.

@ato ato merged commit 4bcd5b3 into internetarchive:master Dec 9, 2025
3 checks passed
@ato
Copy link
Copy Markdown
Collaborator

ato commented Dec 9, 2025

Thanks!

@ato ato linked an issue Dec 21, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fixes on unit test on HTML parser Heritrix should not consider strings starting with "data:" as potential URLs

2 participants