[pull] master from scrapinghub:master by pull[bot] · Pull Request #10 · zanachka/article-extraction-benchmark

pull · 2025-09-23T18:59:27Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

-beautifulsoup4==4.9.3 +beautifulsoup4==4.13.5 -goose3==3.1.8 +goose3==3.1.20 -html-text==0.5.1 +html-text==0.7.0 -html2text==2020.1.16 +html2text==2025.4.15 -inscriptis==1.1.2 +inscriptis==2.6.0 -justext==2.2.0 +justext==3.0.2 -news-please==1.5.17 +news-please==1.6.16 -newspaper3k==0.2.8 +newspaper4k==0.9.3.1 -readability-lxml==0.7.1 +readability-lxml==0.8.4.1 -trafilatura==0.5.1 +trafilatura==2.0.0 Dragnet does not seem to be installable, likely since it's been unmaintained since 2019.

All Go libraries would encounter an error parsing this HTML file: html/0ec95c7261d122f304728e90c983450ef1ce1e0b423546835c397d50aaf0d0f2.html.gz The error was: transform: short internal buffer This was because these Go libraries used `dom.Parse()` from https://github.com/go-shiori/dom instead of the standard `html.Parse()`. The `dom.Parse()` function wraps HTML parsing with potential transcoding to UTF-8 and performing Unicode normalization of the input text. The Go error would be triggered in this transformer chain: transform.Chain(norm.NFD, ..., norm.NFC) I'm not exactly sure why, but chaining these transformers together isn't safe due to internal buffer issues. Furthermore, using `dom.Parse()` to parse HTML documents in this project isn't necessary because all HTML files in the dataset are already UTF-8, and article extractor program themselves should not be responsible for Unicode normalization. The only Unicode normalization we now to do is strip the soft hyphen character (U+00AD) in Python scripts that store the results of article parsing.

Tool updates

mislav and others added 10 commits September 23, 2025 12:58

Update go-domdistiller, go-readability

ebaf390

Add Readeck's fork of go-readability

e7d655e

Update @mozilla/readability

d3a3fb4

Add go-trafilatura

f35c22a

Makefile

1dfdd02

Table formatting of evaluate script output

27a6d7c

Update evaluation results

b260995

Merge pull request #26 from mislav/tool-updates

4f9c9ed

Tool updates

pull bot locked and limited conversation to collaborators Sep 23, 2025

pull bot added the ⤵️ pull label Sep 23, 2025

pull bot merged commit 4f9c9ed into zanachka:master Sep 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from scrapinghub:master#10

[pull] master from scrapinghub:master#10
pull[bot] merged 10 commits intozanachka:masterfrom
scrapinghub:master

pull bot commented Sep 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pull bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pull bot commented Sep 23, 2025 •

edited

Loading