You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Returns**: <code>Object</code> - article parser results object. Includes `text.summary` and
31
+
`text.sentences` when `options.enabled` contains `'summary'`. Also exposes
32
+
`language` with ISO-639-1 and ISO-639-3 codes when detection succeeds. Includes `readability` with estimated reading time and basic text statistics when `options.enabled` contains 'readability'.
Copy file name to clipboardExpand all lines: README.md
+27-1Lines changed: 27 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Horseman Article Parser
2
2
3
-
Horseman is a focused article scraping module for the open web. It loads pages (dynamic or AMP), detects the main story body, and returns clean, structured content ready for downstream use. Alongside text and title, it includes in-article links, metadata, sentiment, keywords/keyphrases, named entities, optional spelling suggestions, site icon, and Lighthouse signals. It also copes with live blogs, applies simple per-domain tweaks (headers/cookies/goto), and uses Puppeteer + stealth to reduce blocking.
3
+
Horseman is a focused article scraping module for the open web. It loads pages (dynamic or AMP), detects the main story body, and returns clean, structured content ready for downstream use. Alongside text and title, it includes in-article links, metadata, sentiment, keywords/keyphrases, named entities, optional summaries, optional spelling suggestions, readability metrics and basic counts (characters, words, sentences, paragraphs), site icon, and Lighthouse signals. It also copes with live blogs, applies simple per-domain tweaks (headers/cookies/goto), and uses Puppeteer + stealth to reduce blocking. The parser now detects the article language and exposes ISO codes, with best-effort support for non-English content (features may fall back to English dictionaries when specific resources are missing).
4
4
5
5
## Table of Contents
6
6
@@ -51,6 +51,8 @@ const options = {
51
51
"entities",
52
52
"spelling",
53
53
"keywords",
54
+
"summary",
55
+
"readability",
54
56
],
55
57
};
56
58
@@ -72,10 +74,20 @@ const options = {
72
74
people:article.people,
73
75
orgs:article.orgs,
74
76
places:article.places,
77
+
language:article.language,
78
+
readability: {
79
+
readingTime:article.readability.readingTime,
80
+
characters:article.readability.characters,
81
+
words:article.readability.words,
82
+
sentences:article.readability.sentences,
83
+
paragraphs:article.readability.paragraphs,
84
+
},
75
85
text: {
76
86
raw:article.processed.text.raw,
77
87
formatted:article.processed.text.formatted,
78
88
html:article.processed.text.html,
89
+
summary:article.processed.text.summary,
90
+
sentences:article.processed.text.sentences,
79
91
},
80
92
spelling:article.spelling,
81
93
meta:article.meta,
@@ -196,9 +208,15 @@ var options = {
196
208
"entities",
197
209
"spelling",
198
210
"keywords",
211
+
"summary",
212
+
"readability",
199
213
],
200
214
};
201
215
```
216
+
Add "summary" to `options.enabled` to generate a short summary of the article text. The result
217
+
includes `text.summary` and a `text.sentences` array containing the first five sentences.
218
+
219
+
Add "readability" to `options.enabled` to evaluate readability, estimate reading time, and gather basic text statistics. The result is available as `article.readability` with `readingTime` (seconds), `characters`, `words`, `sentences`, and `paragraphs`.
202
220
203
221
You may pass rules for returning an articles title & contents. This is useful in a case
204
222
where the parser is unable to return the desired title or content e.g.
@@ -317,6 +335,8 @@ const options = {
317
335
"entities",
318
336
"spelling",
319
337
"keywords",
338
+
"summary",
339
+
"readability",
320
340
],
321
341
// Optional: tweak spelling output/filters
322
342
retextspell: {
@@ -366,6 +386,10 @@ contentDetection: {
366
386
}
367
387
```
368
388
389
+
### Language Detection
390
+
391
+
Horseman automatically detects the article language and exposes ISO codes via `article.language` in the result. Downstream steps such as keyword extraction or spelling use these codes to select language-specific resources when available. Dictionaries for English, French, and Spanish are bundled; other languages fall back to English if a matching dictionary or NLP plugin is not found.
392
+
369
393
## Development
370
394
371
395
Please feel free to fork the repo or open pull requests to the development branch. I've used [eslint](https://eslint.org/) for linting.
@@ -558,6 +582,8 @@ npm run docs
558
582
-[retext-pos](https://github.com/retextjs/retext-pos): Plugin to add part-of-speech (POS) tags
559
583
-[retext-keywords](https://ghub.io/retext-keywords): Keyword extraction with Retext
560
584
-[retext-spell](https://ghub.io/retext-spell): Spelling checker for retext
585
+
-[retext-language](https://ghub.io/retext-language): Language detection for retext
586
+
-[franc](https://ghub.io/franc): Fast language detection from text
561
587
-[sentiment](https://ghub.io/sentiment): AFINN-based sentiment analysis for Node.js
562
588
-[jquery](https://ghub.io/jquery): JavaScript library for DOM operations
563
589
-[jsdom](https://ghub.io/jsdom): A JavaScript implementation of many web standards
0 commit comments