Releases: fmacpro/horseman-article-parser
Releases · fmacpro/horseman-article-parser
1.2.5 Image/Captions Cleanup
Image/Captions Cleanup
- Added stripImagesForRawText in controllers/textProcessing.js:425 to remove
<figure>,<picture>, standalone<img>, and associated captions before the HTML-to-text pass. Raw text now omits image alts/captions while sentence boundary handling stays intact. - Reused the shared URL helpers (containsUrlLike, stripUrlsFromText, stripDataUrlsFromText) so both raw and formatted outputs stay URL‑clean without repeating regex logic. getFormattedText also clears data URIs but keeps real HTML links.
Spell Check Alignment
- controllers/spellCheck.js:1 now imports maskUrlsInText/isUrlLikeToken, masking URL tokens with spaces so offsets and line numbers remain accurate while filtering URL-like misspellings (including data URIs).
1.2.4 Article Summary & Readability Metrics
controllers/readability.jsnow builds stats from a retext parse so sentence and paragraph counts follow real linguistic boundaries, with Intl.Segmenter/regex fallbacks to keep numbers sensible when parsing fails.tests/readability.test.jskeeps the legacy expectations and adds an abbreviation-focused case so we confirm sentences like “U.K.” aren’t double-counted.- Built a scoring-driven summariser that segments text, scores sentences for factual content, relevance to title/meta hints, paragraph position, and keeps per-paragraph variety before applying coverage balancing across the article
- Added paragraph extraction with single/double newline fallback plus keyword token support so meta keywords/OG/Twitter hints boost matching sentences
- Feed the new summariser with title/meta keyword hints pulled from page metadata before summarising
- Added a focused unit test covering multi-paragraph selection to guard the new heuristics
- Tightened place heuristics so country names stay as locations and trailing person names drop off. Added a normalized catalogue of multi-word countries plus known place phrases and new helpers (collectTermTags, wordLooksLikePerson, trimPlaceTailWords, cleanPlaceSegment, etc.) in
controllers/entityParser.js:92andcontrollers/entityParser.js:1190-1330to split “UK France Canada Australia”, trim verbs/prepositions, and strip endings like “Netanyahu” from place strings. - Built a buildPersonTokenSet helper
controllers/entityParser.js:1330and now feed it through expandAndCleanPlaces and the final people filtercontrollers/entityParser.js:1608-1619so any place we keep automatically removes the matching person entry (e.g. “United States”).
1.2.3 Improved entity parsing
- Rewrite the entity parser to strip trailing job titles, split dense acknowledgement sequences, merge optional secondary NER results, and refine deduping, exposing the async pipeline through the article parser.
- Extend NLP plugin hints to cover middle names, suffixes, and secondary NER configuration, update the default hint shape, and document the new options for integrators.
- Preserve spacing when unwrapping inline HTML wrappers and add regression coverage for the helper to prevent merged tokens in cleaned articles.
- Expand entity parser regression tests for the new heuristics and raise language suite timeouts to keep the browser-driven checks stable.
- Overhauled the entity parser with heuristics that trim job titles, split dense acknowledgement blocks, incorporate secondary NER responses, and dedupe results asynchronously within the article pipeline.
- Broadened NLP plugin hint handling—including default options—to accept middle/suffix buckets and external NER sources, and updated the README to describe the configuration.
- Ensured inline wrapper removal leaves readable spacing and added targeted tests for the helper alongside broader entity parser and language timeout coverage.
1.2.2 Improved container detection, content sanitization, and readability metric
- Refines article container identification and promotion logic for fragmented content, and adds caption triggers.
- Sanitizes article content by removing images, CTAs, and other non-text elements before analysis
- Introduces readability metrics and fixes paragraph-counting logic to improve accuracy.
- Removes deprecated API documentation tooling
Release 1.2.1: readability metrics, multilingual dictionaries, and enhanced summarization
- Added comprehensive readability analysis with sample run output and paragraph-count fixes
- Introduced French and Spanish dictionaries for multi-language processing
- Implemented article summarization features
- Preserved hyphenated words and line breaks for accurate titles and spell-check line numbers
- Refactored entity parser to improve recognition logic
1.2.0 Improved entity and keyword parsing
- Added a dedicated entity parser that normalizes strings, strips possessive suffixes, and deduplicates people, places, organizations, and topics before returning them
- Introduced reusable helpers for capitalizing text, removing trailing possessives, and stripping punctuation
- Updated the keyword parser to apply these helpers so extracted keywords and keyphrases are capitalized and free of trailing possessives
- Loaded NLP plugin hints into the main parse workflow to enrich entity detection
- Expanded test coverage to verify entity capitalization, possessive stripping, and keyword/phrase normalization
- Removed domain specific tweaks from scripts as they are no longer required
1.1.3 Improve consent banner handling and raise default timeout
- Overhauled consent dismissal system with a configurable observer timeout, broader overlay detection, and polling to catch late-loading consent frames
- Added a Guardian-specific consent debugging script leveraging Puppeteer and the updated consent logic
- Raised the default parse timeout to 40 s and updated the documentation to match
- Strengthened the parser with new consent utilities and more robust navigation helpers for stable frame handling and fallback strategies
- Improved screenshot timing and ensured a clean view of the article
1.1.2 Minor patch to screenshot functionality
- Clarify content detection options
- Document timeoutMs and increase default from 10s to 20s
- Capture screenshot after consent dismissal
1.1.1 Re-enable screenshot capture
- allow parseArticle to honor the screenshot feature rather than filtering it out
- add regression test validating that a mobile screenshot is captured when the feature is enabled
- enable screenshot capture in the single sample script and include the image data in the JSON output
1.1.0 Modular refactor with expanded scripts, docs, and tests
- Introduces a revamped article parser that applies refined heuristics and optionally leverages AI‑trained weights to improve accuracy and adaptability.
- Split monolithic parsing logic into dedicated controllers for navigation, content detection, structured data, logging, consent handling, live‑blog support, and text processing.
- Introduced new utility modules for async control, NLP plugin loading, and reusable navigation helpers, replacing duplicated code in helpers.js.
- Added script suite for batch crawling, sample runs, curated URL fetching, CSV merging, and reranker training—each supporting named CLI arguments and improved Windows compatibility.
- Expanded test coverage across controllers and scripts with new fixtures, ensuring core parsing features, batch operations, and helper utilities are exercised.
- Reworked README with clearer installation, usage, and options documentation, removed outdated APIDOC, and enabled stricter ESLint rules for consistency.