Releases · fmacpro/horseman-article-parser

27 Sep 01:05

fmacpro

1.2.5

9563bac

1.2.5 Image/Captions Cleanup Latest

Latest

Image/Captions Cleanup

Added stripImagesForRawText in controllers/textProcessing.js:425 to remove <figure>, <picture>, standalone <img>, and associated captions before the HTML-to-text pass. Raw text now omits image alts/captions while sentence boundary handling stays intact.
Reused the shared URL helpers (containsUrlLike, stripUrlsFromText, stripDataUrlsFromText) so both raw and formatted outputs stay URL‑clean without repeating regex logic. getFormattedText also clears data URIs but keeps real HTML links.

Spell Check Alignment

controllers/spellCheck.js:1 now imports maskUrlsInText/isUrlLikeToken, masking URL tokens with spaces so offsets and line numbers remain accurate while filtering URL-like misspellings (including data URIs).

Assets 2

26 Sep 19:21

fmacpro

1.2.4

622d396

1.2.4 Article Summary & Readability Metrics

controllers/readability.js now builds stats from a retext parse so sentence and paragraph counts follow real linguistic boundaries, with Intl.Segmenter/regex fallbacks to keep numbers sensible when parsing fails.
tests/readability.test.js keeps the legacy expectations and adds an abbreviation-focused case so we confirm sentences like “U.K.” aren’t double-counted.
Built a scoring-driven summariser that segments text, scores sentences for factual content, relevance to title/meta hints, paragraph position, and keeps per-paragraph variety before applying coverage balancing across the article
Added paragraph extraction with single/double newline fallback plus keyword token support so meta keywords/OG/Twitter hints boost matching sentences
Feed the new summariser with title/meta keyword hints pulled from page metadata before summarising
Added a focused unit test covering multi-paragraph selection to guard the new heuristics
Tightened place heuristics so country names stay as locations and trailing person names drop off. Added a normalized catalogue of multi-word countries plus known place phrases and new helpers (collectTermTags, wordLooksLikePerson, trimPlaceTailWords, cleanPlaceSegment, etc.) in controllers/entityParser.js:92 and controllers/entityParser.js:1190-1330 to split “UK France Canada Australia”, trim verbs/prepositions, and strip endings like “Netanyahu” from place strings.
Built a buildPersonTokenSet helper controllers/entityParser.js:1330 and now feed it through expandAndCleanPlaces and the final people filter controllers/entityParser.js:1608-1619 so any place we keep automatically removes the matching person entry (e.g. “United States”).

Assets 2

16 Sep 22:03

fmacpro

1.2.3

a397cc7

1.2.3 Improved entity parsing

Rewrite the entity parser to strip trailing job titles, split dense acknowledgement sequences, merge optional secondary NER results, and refine deduping, exposing the async pipeline through the article parser.
Extend NLP plugin hints to cover middle names, suffixes, and secondary NER configuration, update the default hint shape, and document the new options for integrators.
Preserve spacing when unwrapping inline HTML wrappers and add regression coverage for the helper to prevent merged tokens in cleaned articles.
Expand entity parser regression tests for the new heuristics and raise language suite timeouts to keep the browser-driven checks stable.
Overhauled the entity parser with heuristics that trim job titles, split dense acknowledgement blocks, incorporate secondary NER responses, and dedupe results asynchronously within the article pipeline.
Broadened NLP plugin hint handling—including default options—to accept middle/suffix buckets and external NER sources, and updated the README to describe the configuration.
Ensured inline wrapper removal leaves readable spacing and added targeted tests for the helper alongside broader entity parser and language timeout coverage.

Assets 2

15 Sep 22:15

fmacpro

1.2.2

034023f

1.2.2 Improved container detection, content sanitization, and readability metric

Refines article container identification and promotion logic for fragmented content, and adds caption triggers.
Sanitizes article content by removing images, CTAs, and other non-text elements before analysis
Introduces readability metrics and fixes paragraph-counting logic to improve accuracy.
Removes deprecated API documentation tooling

Assets 2

14 Sep 00:08

fmacpro

1.2.1

efdc7dd

Release 1.2.1: readability metrics, multilingual dictionaries, and enhanced summarization

Added comprehensive readability analysis with sample run output and paragraph-count fixes
Introduced French and Spanish dictionaries for multi-language processing
Implemented article summarization features
Preserved hyphenated words and line breaks for accurate titles and spell-check line numbers
Refactored entity parser to improve recognition logic

Assets 2

13 Sep 17:59

fmacpro

1.2.0

654804d

1.2.0 Improved entity and keyword parsing

Added a dedicated entity parser that normalizes strings, strips possessive suffixes, and deduplicates people, places, organizations, and topics before returning them
Introduced reusable helpers for capitalizing text, removing trailing possessives, and stripping punctuation
Updated the keyword parser to apply these helpers so extracted keywords and keyphrases are capitalized and free of trailing possessives
Loaded NLP plugin hints into the main parse workflow to enrich entity detection
Expanded test coverage to verify entity capitalization, possessive stripping, and keyword/phrase normalization
Removed domain specific tweaks from scripts as they are no longer required

Assets 2

12 Sep 23:43

fmacpro

1.1.3

6087625

1.1.3 Improve consent banner handling and raise default timeout

Overhauled consent dismissal system with a configurable observer timeout, broader overlay detection, and polling to catch late-loading consent frames
Added a Guardian-specific consent debugging script leveraging Puppeteer and the updated consent logic
Raised the default parse timeout to 40 s and updated the documentation to match
Strengthened the parser with new consent utilities and more robust navigation helpers for stable frame handling and fallback strategies
Improved screenshot timing and ensured a clean view of the article

Assets 2

12 Sep 14:32

fmacpro

1.1.2

dc9a291

1.1.2 Minor patch to screenshot functionality

Clarify content detection options
Document timeoutMs and increase default from 10s to 20s
Capture screenshot after consent dismissal

Assets 2

09 Sep 22:13

fmacpro

1.1.1

4c64103

1.1.1 Re-enable screenshot capture

allow parseArticle to honor the screenshot feature rather than filtering it out
add regression test validating that a mobile screenshot is captured when the feature is enabled
enable screenshot capture in the single sample script and include the image data in the JSON output

Assets 2

09 Sep 20:36

fmacpro

1.1.0

4345ff3

1.1.0 Modular refactor with expanded scripts, docs, and tests

Introduces a revamped article parser that applies refined heuristics and optionally leverages AI‑trained weights to improve accuracy and adaptability.
Split monolithic parsing logic into dedicated controllers for navigation, content detection, structured data, logging, consent handling, live‑blog support, and text processing.
Introduced new utility modules for async control, NLP plugin loading, and reusable navigation helpers, replacing duplicated code in helpers.js.
Added script suite for batch crawling, sample runs, curated URL fetching, CSV merging, and reranker training—each supporting named CLI arguments and improved Windows compatibility.
Expanded test coverage across controllers and scripts with new fixtures, ensuring core parsing features, batch operations, and helper utilities are exercised.
Reworked README with clearer installation, usage, and options documentation, removed outdated APIDOC, and enabled stricter ESLint rules for consistency.

Assets 2

Releases: fmacpro/horseman-article-parser

1.2.5 Image/Captions Cleanup

Uh oh!

1.2.4 Article Summary & Readability Metrics

Uh oh!

1.2.3 Improved entity parsing

Uh oh!

1.2.2 Improved container detection, content sanitization, and readability metric

Uh oh!

Release 1.2.1: readability metrics, multilingual dictionaries, and enhanced summarization

Uh oh!

1.2.0 Improved entity and keyword parsing

Uh oh!

1.1.3 Improve consent banner handling and raise default timeout

Uh oh!

1.1.2 Minor patch to screenshot functionality

Uh oh!

1.1.1 Re-enable screenshot capture

Uh oh!

1.1.0 Modular refactor with expanded scripts, docs, and tests

Uh oh!