Skip to content

Commit 4d50f4e

Browse files
ehsandeepGDATTACKER-RESEARCHERdependabot[bot]Mzack9999
authored
bug fixes (#2334)
* Improve error handling in htmlToText function Enhance htmlToText function to handle panics and errors safely. panic: html: open stack of elements exceeds 512 nodes goroutine 5523922 [running]: github.com/projectdiscovery/httpx/common/pagetypeclassifier.htmlToText(...) /home/runner/work/httpx/httpx/common/pagetypeclassifier/pagetypeclassifier.go:36 github.com/projectdiscovery/httpx/common/pagetypeclassifier.(*PageTypeClassifier).Classify(0xc0005164d8, {0xc0ba03a000?, 0xd?}) /home/runner/work/httpx/httpx/common/pagetypeclassifier/pagetypeclassifier.go:26 +0x6f github.com/projectdiscovery/httpx/runner.(*Runner).analyze(_, _, {_, _}, {{0xc00470c450, 0xb}, {0x0, 0x0}, {0x0, 0x0}}, ...) /home/runner/work/httpx/httpx/runner/runner.go:2349 +0x7555 github.com/projectdiscovery/httpx/runner.(*Runner).process.func1({{0xc00470c450, 0xb}, {0x0, 0x0}, {0x0, 0x0}}, {0x1686161?, 0x10?}, {0x16ace2d, 0xa}) /home/runner/work/httpx/httpx/runner/runner.go:1444 +0x125 created by github.com/projectdiscovery/httpx/runner.(*Runner).process in goroutine 1 /home/runner/work/httpx/httpx/runner/runner.go:1442 +0x8a6 * chore(deps): bump golang.org/x/text from 0.30.0 to 0.31.0 Bumps [golang.org/x/text](https://github.com/golang/text) from 0.30.0 to 0.31.0. - [Release notes](https://github.com/golang/text/releases) - [Commits](golang/text@v0.30.0...v0.31.0) --- updated-dependencies: - dependency-name: golang.org/x/text dependency-version: 0.31.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * chore(deps): bump golang.org/x/net from 0.46.0 to 0.47.0 Bumps [golang.org/x/net](https://github.com/golang/net) from 0.46.0 to 0.47.0. - [Commits](golang/net@v0.46.0...v0.47.0) --- updated-dependencies: - dependency-name: golang.org/x/net dependency-version: 0.47.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * chore(deps): bump github.com/PuerkitoBio/goquery from 1.10.3 to 1.11.0 Bumps [github.com/PuerkitoBio/goquery](https://github.com/PuerkitoBio/goquery) from 1.10.3 to 1.11.0. - [Release notes](https://github.com/PuerkitoBio/goquery/releases) - [Commits](PuerkitoBio/goquery@v1.10.3...v1.11.0) --- updated-dependencies: - dependency-name: github.com/PuerkitoBio/goquery dependency-version: 1.11.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * chore(deps): bump the modules group with 5 updates Bumps the modules group with 5 updates: | Package | From | To | | --- | --- | --- | | [github.com/projectdiscovery/cdncheck](https://github.com/projectdiscovery/cdncheck) | `1.2.9` | `1.2.10` | | [github.com/projectdiscovery/gologger](https://github.com/projectdiscovery/gologger) | `1.1.59` | `1.1.60` | | [github.com/projectdiscovery/networkpolicy](https://github.com/projectdiscovery/networkpolicy) | `0.1.27` | `0.1.28` | | [github.com/projectdiscovery/utils](https://github.com/projectdiscovery/utils) | `0.6.1-0.20251030144701-ce5c4b44e1e6` | `0.6.1` | | [github.com/projectdiscovery/wappalyzergo](https://github.com/projectdiscovery/wappalyzergo) | `0.2.54` | `0.2.55` | Updates `github.com/projectdiscovery/cdncheck` from 1.2.9 to 1.2.10 - [Release notes](https://github.com/projectdiscovery/cdncheck/releases) - [Changelog](https://github.com/projectdiscovery/cdncheck/blob/main/.goreleaser.yaml) - [Commits](projectdiscovery/cdncheck@v1.2.9...v1.2.10) Updates `github.com/projectdiscovery/gologger` from 1.1.59 to 1.1.60 - [Release notes](https://github.com/projectdiscovery/gologger/releases) - [Commits](projectdiscovery/gologger@v1.1.59...v1.1.60) Updates `github.com/projectdiscovery/networkpolicy` from 0.1.27 to 0.1.28 - [Release notes](https://github.com/projectdiscovery/networkpolicy/releases) - [Commits](projectdiscovery/networkpolicy@v0.1.27...v0.1.28) Updates `github.com/projectdiscovery/utils` from 0.6.1-0.20251030144701-ce5c4b44e1e6 to 0.6.1 - [Release notes](https://github.com/projectdiscovery/utils/releases) - [Changelog](https://github.com/projectdiscovery/utils/blob/main/CHANGELOG.md) - [Commits](https://github.com/projectdiscovery/utils/commits/v0.6.1) Updates `github.com/projectdiscovery/wappalyzergo` from 0.2.54 to 0.2.55 - [Release notes](https://github.com/projectdiscovery/wappalyzergo/releases) - [Commits](projectdiscovery/wappalyzergo@v0.2.54...v0.2.55) --- updated-dependencies: - dependency-name: github.com/projectdiscovery/cdncheck dependency-version: 1.2.10 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/gologger dependency-version: 1.1.60 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/networkpolicy dependency-version: 0.1.28 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/utils dependency-version: 0.6.1 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/wappalyzergo dependency-version: 0.2.55 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules ... Signed-off-by: dependabot[bot] <[email protected]> * better error handling * chore(deps): bump golang.org/x/crypto from 0.44.0 to 0.45.0 Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.44.0 to 0.45.0. - [Commits](golang/crypto@v0.44.0...v0.45.0) --- updated-dependencies: - dependency-name: golang.org/x/crypto dependency-version: 0.45.0 dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> * adding panic guard + tests * lint * chore(deps): bump github.com/weppos/publicsuffix-go Bumps [github.com/weppos/publicsuffix-go](https://github.com/weppos/publicsuffix-go) from 0.50.0 to 0.50.1. - [Changelog](https://github.com/weppos/publicsuffix-go/blob/main/CHANGELOG.md) - [Commits](weppos/publicsuffix-go@v0.50.0...v0.50.1) --- updated-dependencies: - dependency-name: github.com/weppos/publicsuffix-go dependency-version: 0.50.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * chore(deps): bump the modules group with 11 updates Bumps the modules group with 11 updates: | Package | From | To | | --- | --- | --- | | [github.com/projectdiscovery/cdncheck](https://github.com/projectdiscovery/cdncheck) | `1.2.10` | `1.2.11` | | [github.com/projectdiscovery/dsl](https://github.com/projectdiscovery/dsl) | `0.8.4` | `0.8.5` | | [github.com/projectdiscovery/fastdialer](https://github.com/projectdiscovery/fastdialer) | `0.4.15` | `0.4.17` | | [github.com/projectdiscovery/gologger](https://github.com/projectdiscovery/gologger) | `1.1.60` | `1.1.61` | | [github.com/projectdiscovery/hmap](https://github.com/projectdiscovery/hmap) | `0.0.95` | `0.0.96` | | [github.com/projectdiscovery/networkpolicy](https://github.com/projectdiscovery/networkpolicy) | `0.1.28` | `0.1.29` | | [github.com/projectdiscovery/retryablehttp-go](https://github.com/projectdiscovery/retryablehttp-go) | `1.0.131` | `1.0.132` | | [github.com/projectdiscovery/tlsx](https://github.com/projectdiscovery/tlsx) | `1.2.1` | `1.2.2` | | [github.com/projectdiscovery/useragent](https://github.com/projectdiscovery/useragent) | `0.0.102` | `0.0.103` | | [github.com/projectdiscovery/utils](https://github.com/projectdiscovery/utils) | `0.6.1` | `0.7.1` | | [github.com/projectdiscovery/wappalyzergo](https://github.com/projectdiscovery/wappalyzergo) | `0.2.55` | `0.2.56` | Updates `github.com/projectdiscovery/cdncheck` from 1.2.10 to 1.2.11 - [Release notes](https://github.com/projectdiscovery/cdncheck/releases) - [Changelog](https://github.com/projectdiscovery/cdncheck/blob/main/.goreleaser.yaml) - [Commits](projectdiscovery/cdncheck@v1.2.10...v1.2.11) Updates `github.com/projectdiscovery/dsl` from 0.8.4 to 0.8.5 - [Release notes](https://github.com/projectdiscovery/dsl/releases) - [Commits](projectdiscovery/dsl@v0.8.4...v0.8.5) Updates `github.com/projectdiscovery/fastdialer` from 0.4.15 to 0.4.17 - [Release notes](https://github.com/projectdiscovery/fastdialer/releases) - [Commits](projectdiscovery/fastdialer@v0.4.15...v0.4.17) Updates `github.com/projectdiscovery/gologger` from 1.1.60 to 1.1.61 - [Release notes](https://github.com/projectdiscovery/gologger/releases) - [Commits](projectdiscovery/gologger@v1.1.60...v1.1.61) Updates `github.com/projectdiscovery/hmap` from 0.0.95 to 0.0.96 - [Release notes](https://github.com/projectdiscovery/hmap/releases) - [Commits](projectdiscovery/hmap@v0.0.95...v0.0.96) Updates `github.com/projectdiscovery/networkpolicy` from 0.1.28 to 0.1.29 - [Release notes](https://github.com/projectdiscovery/networkpolicy/releases) - [Commits](projectdiscovery/networkpolicy@v0.1.28...v0.1.29) Updates `github.com/projectdiscovery/retryablehttp-go` from 1.0.131 to 1.0.132 - [Release notes](https://github.com/projectdiscovery/retryablehttp-go/releases) - [Commits](projectdiscovery/retryablehttp-go@v1.0.131...v1.0.132) Updates `github.com/projectdiscovery/tlsx` from 1.2.1 to 1.2.2 - [Release notes](https://github.com/projectdiscovery/tlsx/releases) - [Changelog](https://github.com/projectdiscovery/tlsx/blob/main/.goreleaser.yml) - [Commits](projectdiscovery/tlsx@v1.2.1...v1.2.2) Updates `github.com/projectdiscovery/useragent` from 0.0.102 to 0.0.103 - [Release notes](https://github.com/projectdiscovery/useragent/releases) - [Commits](projectdiscovery/useragent@v0.0.102...v0.0.103) Updates `github.com/projectdiscovery/utils` from 0.6.1 to 0.7.1 - [Release notes](https://github.com/projectdiscovery/utils/releases) - [Changelog](https://github.com/projectdiscovery/utils/blob/main/CHANGELOG.md) - [Commits](projectdiscovery/utils@v0.6.1...v0.7.1) Updates `github.com/projectdiscovery/wappalyzergo` from 0.2.55 to 0.2.56 - [Release notes](https://github.com/projectdiscovery/wappalyzergo/releases) - [Commits](projectdiscovery/wappalyzergo@v0.2.55...v0.2.56) --- updated-dependencies: - dependency-name: github.com/projectdiscovery/cdncheck dependency-version: 1.2.11 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/dsl dependency-version: 0.8.5 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/fastdialer dependency-version: 0.4.17 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/gologger dependency-version: 1.1.61 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/hmap dependency-version: 0.0.96 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/networkpolicy dependency-version: 0.1.29 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/retryablehttp-go dependency-version: 1.0.132 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/tlsx dependency-version: 1.2.2 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/useragent dependency-version: 0.0.103 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/utils dependency-version: 0.7.1 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: modules - dependency-name: github.com/projectdiscovery/wappalyzergo dependency-version: 0.2.56 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules ... Signed-off-by: dependabot[bot] <[email protected]> * fix test * chore(deps): bump github.com/JohannesKaufmann/html-to-markdown/v2 Bumps [github.com/JohannesKaufmann/html-to-markdown/v2](https://github.com/JohannesKaufmann/html-to-markdown) from 2.4.0 to 2.5.0. - [Release notes](https://github.com/JohannesKaufmann/html-to-markdown/releases) - [Commits](JohannesKaufmann/html-to-markdown@v2.4.0...v2.5.0) --- updated-dependencies: - dependency-name: github.com/JohannesKaufmann/html-to-markdown/v2 dependency-version: 2.5.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * chore(deps): bump the modules group with 10 updates Bumps the modules group with 10 updates: | Package | From | To | | --- | --- | --- | | [github.com/projectdiscovery/cdncheck](https://github.com/projectdiscovery/cdncheck) | `1.2.11` | `1.2.12` | | [github.com/projectdiscovery/dsl](https://github.com/projectdiscovery/dsl) | `0.8.5` | `0.8.6` | | [github.com/projectdiscovery/fastdialer](https://github.com/projectdiscovery/fastdialer) | `0.4.17` | `0.4.18` | | [github.com/projectdiscovery/gologger](https://github.com/projectdiscovery/gologger) | `1.1.61` | `1.1.62` | | [github.com/projectdiscovery/hmap](https://github.com/projectdiscovery/hmap) | `0.0.96` | `0.0.97` | | [github.com/projectdiscovery/networkpolicy](https://github.com/projectdiscovery/networkpolicy) | `0.1.29` | `0.1.30` | | [github.com/projectdiscovery/retryablehttp-go](https://github.com/projectdiscovery/retryablehttp-go) | `1.0.132` | `1.0.133` | | [github.com/projectdiscovery/useragent](https://github.com/projectdiscovery/useragent) | `0.0.103` | `0.0.104` | | [github.com/projectdiscovery/utils](https://github.com/projectdiscovery/utils) | `0.7.1` | `0.7.3` | | [github.com/projectdiscovery/wappalyzergo](https://github.com/projectdiscovery/wappalyzergo) | `0.2.56` | `0.2.57` | Updates `github.com/projectdiscovery/cdncheck` from 1.2.11 to 1.2.12 - [Release notes](https://github.com/projectdiscovery/cdncheck/releases) - [Commits](projectdiscovery/cdncheck@v1.2.11...v1.2.12) Updates `github.com/projectdiscovery/dsl` from 0.8.5 to 0.8.6 - [Release notes](https://github.com/projectdiscovery/dsl/releases) - [Commits](projectdiscovery/dsl@v0.8.5...v0.8.6) Updates `github.com/projectdiscovery/fastdialer` from 0.4.17 to 0.4.18 - [Release notes](https://github.com/projectdiscovery/fastdialer/releases) - [Commits](projectdiscovery/fastdialer@v0.4.17...v0.4.18) Updates `github.com/projectdiscovery/gologger` from 1.1.61 to 1.1.62 - [Release notes](https://github.com/projectdiscovery/gologger/releases) - [Commits](projectdiscovery/gologger@v1.1.61...v1.1.62) Updates `github.com/projectdiscovery/hmap` from 0.0.96 to 0.0.97 - [Release notes](https://github.com/projectdiscovery/hmap/releases) - [Commits](projectdiscovery/hmap@v0.0.96...v0.0.97) Updates `github.com/projectdiscovery/networkpolicy` from 0.1.29 to 0.1.30 - [Release notes](https://github.com/projectdiscovery/networkpolicy/releases) - [Commits](projectdiscovery/networkpolicy@v0.1.29...v0.1.30) Updates `github.com/projectdiscovery/retryablehttp-go` from 1.0.132 to 1.0.133 - [Release notes](https://github.com/projectdiscovery/retryablehttp-go/releases) - [Commits](projectdiscovery/retryablehttp-go@v1.0.132...v1.0.133) Updates `github.com/projectdiscovery/useragent` from 0.0.103 to 0.0.104 - [Release notes](https://github.com/projectdiscovery/useragent/releases) - [Commits](projectdiscovery/useragent@v0.0.103...v0.0.104) Updates `github.com/projectdiscovery/utils` from 0.7.1 to 0.7.3 - [Release notes](https://github.com/projectdiscovery/utils/releases) - [Changelog](https://github.com/projectdiscovery/utils/blob/main/CHANGELOG.md) - [Commits](projectdiscovery/utils@v0.7.1...v0.7.3) Updates `github.com/projectdiscovery/wappalyzergo` from 0.2.56 to 0.2.57 - [Release notes](https://github.com/projectdiscovery/wappalyzergo/releases) - [Commits](projectdiscovery/wappalyzergo@v0.2.56...v0.2.57) --- updated-dependencies: - dependency-name: github.com/projectdiscovery/cdncheck dependency-version: 1.2.12 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/dsl dependency-version: 0.8.6 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/fastdialer dependency-version: 0.4.18 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/gologger dependency-version: 1.1.62 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/hmap dependency-version: 0.0.97 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/networkpolicy dependency-version: 0.1.30 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/retryablehttp-go dependency-version: 1.0.133 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/useragent dependency-version: 0.0.104 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/utils dependency-version: 0.7.3 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules - dependency-name: github.com/projectdiscovery/wappalyzergo dependency-version: 0.2.57 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: modules ... Signed-off-by: dependabot[bot] <[email protected]> * feat: update `-ldp` option to show default ports in CLI output (#2331) feat: update -ldp option to show default ports in CLI output - Modified URL formatting in runner.go to respect LeaveDefaultPorts option - Fixed AddURLDefaultPort function to actually add default ports (80/443) - When -ldp is used, CLI output now shows https://example.com:443 instead of https://example.com - Maintains backward compatibility - default behavior unchanged Fixes CLI output inconsistency where -ldp flag only affected Host headers but not the displayed URL in console output. * fix: HTML parser panic protection with multiple fallback (#2330) fix: enhance HTML parser panic protection with multiple fallback strategies - Add ultra-aggressive HTML sanitization to reduce nesting depth - Implement size limiting (1MB) to prevent processing huge documents - Add plain text extraction fallback for complex HTML structures - Enhance panic recovery with comprehensive error handling - Remove deeply nestable elements (div, span, ul, ol, li) from sanitizer - Add comprehensive test coverage for edge cases Resolves HTML parser panic: 'html: open stack of elements exceeds 512 nodes' that occurred after switching to html-to-markdown/v2 library in PR #2255 * fix: host JSON field now returns hostname instead of IP (#2333) - Changed 'host' field to return actual hostname (e.g., example.com) - Added new 'host_ip' field for the resolved/dialed IP address - Fixes semantic issue where 'host' was incorrectly returning IP * version update --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: @GDATTACKER <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Mzack9999 <[email protected]>
2 parents 0840905 + f7712f7 commit 4d50f4e

File tree

10 files changed

+325
-100
lines changed

10 files changed

+325
-100
lines changed

common/httpx/httpx.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
package httpx
22

33
import (
4+
"context"
45
"crypto/tls"
56
"fmt"
67
"io"
@@ -25,7 +26,6 @@ import (
2526
pdhttputil "github.com/projectdiscovery/utils/http"
2627
stringsutil "github.com/projectdiscovery/utils/strings"
2728
urlutil "github.com/projectdiscovery/utils/url"
28-
"golang.org/x/net/context"
2929
"golang.org/x/net/http2"
3030
)
3131

common/pagetypeclassifier/pagetypeclassifier.go

Lines changed: 126 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,12 @@ package pagetypeclassifier
22

33
import (
44
_ "embed"
5+
"fmt"
6+
"strings"
7+
"sync"
58

69
htmltomarkdown "github.com/JohannesKaufmann/html-to-markdown/v2"
10+
"github.com/microcosm-cc/bluemonday"
711
"github.com/projectdiscovery/utils/ml/naive_bayes"
812
)
913

@@ -14,26 +18,139 @@ type PageTypeClassifier struct {
1418
classifier *naive_bayes.NaiveBayesClassifier
1519
}
1620

17-
func New() *PageTypeClassifier {
21+
func New() (*PageTypeClassifier, error) {
1822
classifier, err := naive_bayes.NewClassifierFromFileData(classifierData)
1923
if err != nil {
20-
panic(err)
24+
return nil, err
2125
}
22-
return &PageTypeClassifier{classifier: classifier}
26+
return &PageTypeClassifier{classifier: classifier}, nil
2327
}
2428

2529
func (n *PageTypeClassifier) Classify(html string) string {
26-
text := htmlToText(html)
27-
if text == "" {
30+
text, err := htmlToText(html)
31+
if err != nil || text == "" {
2832
return "other"
2933
}
3034
return n.classifier.Classify(text)
3135
}
3236

33-
func htmlToText(html string) string {
34-
text, err := htmltomarkdown.ConvertString(html)
37+
var (
38+
// sanitizerPolicy is an aggressive bluemonday policy that strips most HTML
39+
// to reduce nesting depth and prevent parser stack overflow
40+
sanitizerPolicy *bluemonday.Policy
41+
sanitizerPolicyOnce sync.Once
42+
)
43+
44+
// getSanitizerPolicy returns an ultra-aggressive HTML sanitizer policy that strips
45+
// almost all elements to minimize nesting depth and prevent parser stack overflow.
46+
func getSanitizerPolicy() *bluemonday.Policy {
47+
sanitizerPolicyOnce.Do(func() {
48+
p := bluemonday.NewPolicy()
49+
// Ultra-aggressive policy: Allow only the most basic text elements
50+
// to minimize nesting and reduce parser stack depth
51+
p.AllowElements("p", "br", "h1", "h2", "h3", "h4", "h5", "h6")
52+
p.AllowElements("strong", "em", "b", "i")
53+
// Remove div, span, ul, ol, li as they can create deep nesting
54+
// No attributes allowed to prevent style-based nesting issues
55+
sanitizerPolicy = p
56+
})
57+
return sanitizerPolicy
58+
}
59+
60+
// htmlToText safely converts HTML to text with multiple fallback strategies.
61+
// The 512 node limit in golang.org/x/net/html is hardcoded and cannot be increased.
62+
// Strategy:
63+
// 1. Length limit the input HTML to prevent massive documents
64+
// 2. Sanitize HTML aggressively with bluemonday to reduce nesting
65+
// 3. Convert sanitized HTML to markdown with panic recovery
66+
// 4. If conversion fails, fallback to plain text extraction
67+
func htmlToText(html string) (text string, err error) {
68+
defer func() {
69+
if r := recover(); r != nil {
70+
err = fmt.Errorf("html parser panic: %v", r)
71+
text = ""
72+
}
73+
}()
74+
75+
// Limit input size to prevent processing extremely large HTML documents
76+
const maxHTMLSize = 1024 * 1024 // 1MB limit
77+
if len(html) > maxHTMLSize {
78+
html = html[:maxHTMLSize]
79+
}
80+
81+
// First, sanitize HTML with ultra-aggressive bluemonday policy
82+
sanitizedHTML := getSanitizerPolicy().Sanitize(html)
83+
84+
// If sanitization failed or produced empty result, try plain text fallback
85+
if sanitizedHTML == "" {
86+
return extractPlainText(html), nil
87+
}
88+
89+
// Convert sanitized HTML to markdown
90+
text, err = htmltomarkdown.ConvertString(sanitizedHTML)
3591
if err != nil {
36-
panic(err)
92+
// If markdown conversion fails, fallback to plain text extraction
93+
return extractPlainText(sanitizedHTML), nil
94+
}
95+
96+
if text == "" {
97+
// If result is empty, try plain text fallback
98+
return extractPlainText(sanitizedHTML), nil
99+
}
100+
101+
return text, nil
102+
}
103+
104+
// extractPlainText is a simple fallback that extracts text content without HTML parsing
105+
// This is used when the HTML parser fails due to complexity or nesting depth
106+
func extractPlainText(html string) string {
107+
// Simple regex-based text extraction as fallback
108+
// Remove script and style tags first
109+
text := html
110+
111+
// Remove script tags and content
112+
for {
113+
start := strings.Index(text, "<script")
114+
if start == -1 {
115+
break
116+
}
117+
end := strings.Index(text[start:], "</script>")
118+
if end == -1 {
119+
text = text[:start]
120+
break
121+
}
122+
text = text[:start] + text[start+end+9:]
123+
}
124+
125+
// Remove style tags and content
126+
for {
127+
start := strings.Index(text, "<style")
128+
if start == -1 {
129+
break
130+
}
131+
end := strings.Index(text[start:], "</style>")
132+
if end == -1 {
133+
text = text[:start]
134+
break
135+
}
136+
text = text[:start] + text[start+end+8:]
137+
}
138+
139+
// Simple HTML tag removal (not perfect but safe)
140+
result := ""
141+
inTag := false
142+
for _, char := range text {
143+
if char == '<' {
144+
inTag = true
145+
} else if char == '>' {
146+
inTag = false
147+
result += " " // Replace tags with spaces
148+
} else if !inTag {
149+
result += string(char)
150+
}
37151
}
38-
return text
152+
153+
// Clean up multiple spaces
154+
words := strings.Fields(result)
155+
return strings.Join(words, " ")
39156
}

common/pagetypeclassifier/pagetypeclassifier_test.go

Lines changed: 97 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,24 @@
11
package pagetypeclassifier
22

33
import (
4+
"strings"
45
"testing"
56

6-
"github.com/stretchr/testify/assert"
7+
"github.com/stretchr/testify/require"
78
)
89

910
func TestPageTypeClassifier(t *testing.T) {
10-
1111
t.Run("test creation of new PageTypeClassifier", func(t *testing.T) {
12-
epc := New()
13-
assert.NotNil(t, epc)
12+
epc, err := New()
13+
require.NoError(t, err)
14+
require.NotNil(t, epc)
1415
})
1516

1617
t.Run("test classification non error page text", func(t *testing.T) {
17-
epc := New()
18-
assert.Equal(t, "nonerror", epc.Classify(`<!DOCTYPE html>
18+
epc, err := New()
19+
require.NoError(t, err)
20+
require.NotNil(t, epc)
21+
require.Equal(t, "nonerror", epc.Classify(`<!DOCTYPE html>
1922
<html lang="en">
2023
<head>
2124
<meta charset="UTF-8">
@@ -30,8 +33,10 @@ func TestPageTypeClassifier(t *testing.T) {
3033
})
3134

3235
t.Run("test classification on error page text", func(t *testing.T) {
33-
epc := New()
34-
assert.Equal(t, "error", epc.Classify(`<!DOCTYPE html>
36+
epc, err := New()
37+
require.NoError(t, err)
38+
require.NotNil(t, epc)
39+
require.Equal(t, "error", epc.Classify(`<!DOCTYPE html>
3540
<html>
3641
<head>
3742
<title>Error 403: Forbidden</title>
@@ -51,4 +56,88 @@ func TestPageTypeClassifier(t *testing.T) {
5156
</html>
5257
`))
5358
})
59+
60+
t.Run("test resilience with deeply nested HTML", func(t *testing.T) {
61+
epc, err := New()
62+
require.NoError(t, err)
63+
require.NotNil(t, epc)
64+
65+
// Generate deeply nested HTML that would have exceeded the 512 node stack limit
66+
// With our enhanced sanitization and fallback mechanisms, this should now work
67+
deeplyNestedHTML := "<div>"
68+
for i := 0; i < 600; i++ {
69+
deeplyNestedHTML += "<div><span>"
70+
}
71+
deeplyNestedHTML += "Some text content"
72+
for i := 0; i < 600; i++ {
73+
deeplyNestedHTML += "</span></div>"
74+
}
75+
deeplyNestedHTML += "</div>"
76+
77+
// Should not panic and should successfully classify the content
78+
result := epc.Classify(deeplyNestedHTML)
79+
require.NotEmpty(t, result)
80+
// Should be able to extract and classify the text content
81+
require.NotEqual(t, "", result)
82+
})
83+
84+
t.Run("test htmlToText with deeply nested HTML", func(t *testing.T) {
85+
// Generate deeply nested HTML that would have exceeded the 512 node stack limit
86+
deeplyNestedHTML := "<div>"
87+
for i := 0; i < 600; i++ {
88+
deeplyNestedHTML += "<div><span>"
89+
}
90+
deeplyNestedHTML += "Some text content"
91+
for i := 0; i < 600; i++ {
92+
deeplyNestedHTML += "</span></div>"
93+
}
94+
deeplyNestedHTML += "</div>"
95+
96+
// Should not panic and should successfully extract text with enhanced sanitization
97+
result, err := htmlToText(deeplyNestedHTML)
98+
require.NoError(t, err)
99+
require.NotEmpty(t, result)
100+
require.Contains(t, result, "Some text content")
101+
})
102+
103+
t.Run("test htmlToText with normal HTML", func(t *testing.T) {
104+
normalHTML := `<html><body><h1>Title</h1><p>Some content here</p></body></html>`
105+
result, err := htmlToText(normalHTML)
106+
require.NoError(t, err)
107+
require.NotEmpty(t, result)
108+
})
109+
110+
t.Run("test htmlToText with extremely large HTML", func(t *testing.T) {
111+
// Create a very large HTML document (over 1MB)
112+
largeContent := strings.Repeat("<p>This is a test paragraph with some content. ", 50000)
113+
largeHTML := "<html><body>" + largeContent + "</body></html>"
114+
115+
// Should handle large documents without panic
116+
result, err := htmlToText(largeHTML)
117+
require.NoError(t, err)
118+
require.NotEmpty(t, result)
119+
})
120+
121+
t.Run("test extractPlainText fallback", func(t *testing.T) {
122+
htmlWithScriptAndStyle := `<html>
123+
<head>
124+
<style>body { color: red; }</style>
125+
<script>alert('test');</script>
126+
</head>
127+
<body>
128+
<h1>Title</h1>
129+
<p>Some <strong>important</strong> content here</p>
130+
<div><span>Nested content</span></div>
131+
</body>
132+
</html>`
133+
134+
result := extractPlainText(htmlWithScriptAndStyle)
135+
require.NotEmpty(t, result)
136+
require.Contains(t, result, "Title")
137+
require.Contains(t, result, "important")
138+
require.Contains(t, result, "content")
139+
// Should not contain script or style content
140+
require.NotContains(t, result, "alert")
141+
require.NotContains(t, result, "color: red")
142+
})
54143
}

common/stringz/stringz.go

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,14 @@ func AddURLDefaultPort(rawURL string) string {
8383
if err != nil {
8484
return rawURL
8585
}
86+
// Force default port to be added if not present
87+
if u.Port() == "" {
88+
if u.Scheme == urlutil.HTTP {
89+
u.UpdatePort("80")
90+
} else if u.Scheme == urlutil.HTTPS {
91+
u.UpdatePort("443")
92+
}
93+
}
8694
return u.String()
8795
}
8896

0 commit comments

Comments
 (0)