Issue #688 ignore extracted strings longer than 2048 characters by bnfleb · Pull Request #697 · internetarchive/heritrix3

bnfleb · 2025-12-04T13:56:37Z

Fix #688 + unit tests (#689)

…haracters

ato · 2025-12-05T01:45:12Z

modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java

            CharSequence attrName = cs.subSequence(attr.start(1),attr.end(1));
            value = TextUtils.unescapeHtml(value);
-            if (attr.start(2) > -1) {
+            if (value.length() == getMaxAttributeValLength() && end < cs.length()) {


If value contains any HTML escapes (such as &) then the length of value will be changed by value = TextUtils.unescapeHtml(value) line above and won't match getMaxAttributeValLength(). Maybe we should be checking the original length from the regex (end - start) not the length after unescaping?

You're right, the check must be done before the unescaping

ato · 2025-12-05T01:52:59Z

modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java

+                // Check if it's really the end of the string
+                if (nextChar != '"' && nextChar != '\'' && !Character.isWhitespace(nextChar)) {


This check seems to have no effect other than the log message so I'm a bit confused why we're doing it. It feels like there's something missing here, like a continue or being paired with an else?

The CharSequence cs contains all of the tag content (tag name, all the attributes and their values). Even if the length of value equals getMaxAttributeValLength() and the value of end is lesser than the length of cs, we are not sure if the value is truncated, so we need to check if the next value is a quote, double quote or a space. However, the two if needs to be merged or something like that.

bnfleb · 2025-12-08T16:09:36Z

In the new version of this fix, after checking if the value is really truncated, we move the pointer of the matcher at the actual end of the value. Like this, if there is another attribute to parse, the truncated part of the value is not considered as a potential attribute.

ato · 2025-12-09T07:58:25Z

Thanks!

bnfleb added 2 commits December 4, 2025 14:54

Issue internetarchive#688 ignore extracted strings longer than 2048 c…

fb791e2

…haracters

Issue internetarchive#689 unit tests for image attributes

305204a

ato reviewed Dec 5, 2025

View reviewed changes

when skipping too long value, move matcher pointer to the end of value

771452e

ato merged commit 4bcd5b3 into internetarchive:master Dec 9, 2025
3 checks passed

ato linked an issue Dec 21, 2025 that may be closed by this pull request

Fixes on unit test on HTML parser #689

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue #688 ignore extracted strings longer than 2048 characters#697

Issue #688 ignore extracted strings longer than 2048 characters#697
ato merged 3 commits intointernetarchive:masterfrom
bnfleb:bnf_2025

bnfleb commented Dec 4, 2025 •

edited

Loading

Uh oh!

ato Dec 5, 2025 •

edited

Loading

Uh oh!

bnfleb Dec 5, 2025

Uh oh!

ato Dec 5, 2025

Uh oh!

bnfleb Dec 5, 2025

Uh oh!

bnfleb commented Dec 8, 2025

Uh oh!

Uh oh!

ato commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// Check if it's really the end of the string
		if (nextChar != '"' && nextChar != '\'' && !Character.isWhitespace(nextChar)) {

Conversation

bnfleb commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ato Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bnfleb Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

ato Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

bnfleb Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

bnfleb commented Dec 8, 2025

Uh oh!

Uh oh!

ato commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bnfleb commented Dec 4, 2025 •

edited

Loading

ato Dec 5, 2025 •

edited

Loading