Skip to content

Commit efdc7dd

Browse files
authored
Merge pull request #162 from fmacpro/develop
Release 1.2.1: readability metrics, multilingual dictionaries, and enhanced summarization
2 parents 654804d + 7af33fd commit efdc7dd

23 files changed

+629
-34
lines changed

APIDOC.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,9 @@
2727
main article parser module export function
2828

2929
**Kind**: global function
30-
**Returns**: <code>Object</code> - article parser results object
30+
**Returns**: <code>Object</code> - article parser results object. Includes `text.summary` and
31+
`text.sentences` when `options.enabled` contains `'summary'`. Also exposes
32+
`language` with ISO-639-1 and ISO-639-3 codes when detection succeeds. Includes `readability` with estimated reading time and basic text statistics when `options.enabled` contains 'readability'.
3133

3234
| Param | Type | Description |
3335
| --- | --- | --- |

README.md

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Horseman Article Parser
22

3-
Horseman is a focused article scraping module for the open web. It loads pages (dynamic or AMP), detects the main story body, and returns clean, structured content ready for downstream use. Alongside text and title, it includes in-article links, metadata, sentiment, keywords/keyphrases, named entities, optional spelling suggestions, site icon, and Lighthouse signals. It also copes with live blogs, applies simple per-domain tweaks (headers/cookies/goto), and uses Puppeteer + stealth to reduce blocking.
3+
Horseman is a focused article scraping module for the open web. It loads pages (dynamic or AMP), detects the main story body, and returns clean, structured content ready for downstream use. Alongside text and title, it includes in-article links, metadata, sentiment, keywords/keyphrases, named entities, optional summaries, optional spelling suggestions, readability metrics and basic counts (characters, words, sentences, paragraphs), site icon, and Lighthouse signals. It also copes with live blogs, applies simple per-domain tweaks (headers/cookies/goto), and uses Puppeteer + stealth to reduce blocking. The parser now detects the article language and exposes ISO codes, with best-effort support for non-English content (features may fall back to English dictionaries when specific resources are missing).
44

55
## Table of Contents
66

@@ -51,6 +51,8 @@ const options = {
5151
"entities",
5252
"spelling",
5353
"keywords",
54+
"summary",
55+
"readability",
5456
],
5557
};
5658

@@ -72,10 +74,20 @@ const options = {
7274
people: article.people,
7375
orgs: article.orgs,
7476
places: article.places,
77+
language: article.language,
78+
readability: {
79+
readingTime: article.readability.readingTime,
80+
characters: article.readability.characters,
81+
words: article.readability.words,
82+
sentences: article.readability.sentences,
83+
paragraphs: article.readability.paragraphs,
84+
},
7585
text: {
7686
raw: article.processed.text.raw,
7787
formatted: article.processed.text.formatted,
7888
html: article.processed.text.html,
89+
summary: article.processed.text.summary,
90+
sentences: article.processed.text.sentences,
7991
},
8092
spelling: article.spelling,
8193
meta: article.meta,
@@ -196,9 +208,15 @@ var options = {
196208
"entities",
197209
"spelling",
198210
"keywords",
211+
"summary",
212+
"readability",
199213
],
200214
};
201215
```
216+
Add "summary" to `options.enabled` to generate a short summary of the article text. The result
217+
includes `text.summary` and a `text.sentences` array containing the first five sentences.
218+
219+
Add "readability" to `options.enabled` to evaluate readability, estimate reading time, and gather basic text statistics. The result is available as `article.readability` with `readingTime` (seconds), `characters`, `words`, `sentences`, and `paragraphs`.
202220

203221
You may pass rules for returning an articles title & contents. This is useful in a case
204222
where the parser is unable to return the desired title or content e.g.
@@ -317,6 +335,8 @@ const options = {
317335
"entities",
318336
"spelling",
319337
"keywords",
338+
"summary",
339+
"readability",
320340
],
321341
// Optional: tweak spelling output/filters
322342
retextspell: {
@@ -366,6 +386,10 @@ contentDetection: {
366386
}
367387
```
368388

389+
### Language Detection
390+
391+
Horseman automatically detects the article language and exposes ISO codes via `article.language` in the result. Downstream steps such as keyword extraction or spelling use these codes to select language-specific resources when available. Dictionaries for English, French, and Spanish are bundled; other languages fall back to English if a matching dictionary or NLP plugin is not found.
392+
369393
## Development
370394

371395
Please feel free to fork the repo or open pull requests to the development branch. I've used [eslint](https://eslint.org/) for linting.
@@ -558,6 +582,8 @@ npm run docs
558582
- [retext-pos](https://github.com/retextjs/retext-pos): Plugin to add part-of-speech (POS) tags
559583
- [retext-keywords](https://ghub.io/retext-keywords): Keyword extraction with Retext
560584
- [retext-spell](https://ghub.io/retext-spell): Spelling checker for retext
585+
- [retext-language](https://ghub.io/retext-language): Language detection for retext
586+
- [franc](https://ghub.io/franc): Fast language detection from text
561587
- [sentiment](https://ghub.io/sentiment): AFINN-based sentiment analysis for Node.js
562588
- [jquery](https://ghub.io/jquery): JavaScript library for DOM operations
563589
- [jsdom](https://ghub.io/jsdom): A JavaScript implementation of many web standards

controllers/entityParser.js

Lines changed: 27 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,25 +5,28 @@ export function normalizeEntity (w) {
55
if (typeof w !== 'string') return ''
66
return w
77
.replace(/[']/g, '')
8-
.replace(/[^A-Za-z0-9]+/g, ' ')
8+
.replace(/[^A-Za-z0-9-]+/g, ' ')
99
.trim()
1010
.toLowerCase()
1111
}
1212

1313
export default function entityParser (nlpInput, pluginHints = { first: [], last: [] }, timeLeft = () => Infinity) {
14+
const doc = nlp(nlpInput)
1415
const entityToString = (e) => {
1516
if (Array.isArray(e?.terms) && e.terms.length) {
1617
const parts = []
17-
for (const term of e.terms) {
18+
for (let i = 0; i < e.terms.length; i++) {
19+
const term = e.terms[i]
1820
let text = String(term.text || '').trim()
1921
if (!text) continue
2022
if (/^[']s$/i.test(text) && parts.length) {
2123
parts[parts.length - 1] += "'s"
2224
} else {
23-
parts.push(text)
25+
const isHyphen = typeof term.post === 'string' && term.post.trim() === '-' && i < e.terms.length - 1
26+
parts.push(isHyphen ? text + '-' : text)
2427
}
2528
}
26-
return parts.join(' ').trim()
29+
return parts.join(' ').replace(/- /g, '-').trim()
2730
}
2831
if (typeof e?.text === 'string') return e.text.trim()
2932
return null
@@ -45,7 +48,23 @@ export default function entityParser (nlpInput, pluginHints = { first: [], last:
4548
}
4649

4750
const result = {}
48-
result.people = dedupeEntities(nlp(nlpInput).people().json().map(entityToString), true)
51+
// use compromise's richer person parsing to split name parts
52+
doc.people().parse()
53+
result.people = dedupeEntities(
54+
doc.people().json().map(p => {
55+
const text = entityToString(p)
56+
if (p.person && (p.person.honorific || p.person.firstName || p.person.middleName || p.person.lastName)) {
57+
const parts = [p.person.honorific, p.person.firstName, p.person.middleName, p.person.lastName]
58+
.filter(Boolean)
59+
.map(capitalizeFirstLetter)
60+
const joined = parts.join(' ')
61+
// preserve hyphenated names using original text
62+
return /-/.test(p.text) ? text : joined
63+
}
64+
return text
65+
}),
66+
true
67+
)
4968
const seen = new Set(result.people.map(p => normalizeEntity(p)))
5069
if (pluginHints.first.length && pluginHints.last.length) {
5170
const haystack = normalizeEntity(nlpInput)
@@ -61,8 +80,8 @@ export default function entityParser (nlpInput, pluginHints = { first: [], last:
6180
}
6281
}
6382
result.people = dedupeEntities(result.people, true)
64-
if (timeLeft() >= 1000) result.places = dedupeEntities(nlp(nlpInput).places().json().map(entityToString))
65-
if (timeLeft() >= 900) result.orgs = dedupeEntities(nlp(nlpInput).organizations().json().map(entityToString))
66-
if (timeLeft() >= 800) result.topics = dedupeEntities(nlp(nlpInput).topics().json().map(entityToString))
83+
if (timeLeft() >= 1000) result.places = dedupeEntities(doc.places().json().map(entityToString))
84+
if (timeLeft() >= 900) result.orgs = dedupeEntities(doc.organizations().json().map(entityToString))
85+
if (timeLeft() >= 800) result.topics = dedupeEntities(doc.topics().json().map(entityToString))
6786
return result
6887
}

controllers/keywordParser.js

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,16 @@ import { retext } from 'retext'
22
import { toString as nlcstToString } from 'nlcst-to-string'
33
import pos from 'retext-pos'
44
import keywords from 'retext-keywords'
5+
import language from 'retext-language'
56
import _ from 'lodash'
67
import { capitalizeFirstLetter, stripPossessive } from '../helpers.js'
78

89
export default async function keywordParser (html, options = { maximum: 10 }) {
9-
const file = await retext().use(pos).use(keywords, options).process(html)
10+
const { lang, ...rest } = options || {}
11+
const processor = retext()
12+
if (lang) processor.use(language, { language: lang })
13+
processor.use(pos).use(keywords, rest)
14+
const file = await processor.process(html)
1015

1116
const keywordsArr = file.data.keywords.map(keyword => ({
1217
keyword: capitalizeFirstLetter(stripPossessive(nlcstToString(keyword.matches[0].node))),

controllers/language.js

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
import { retext } from 'retext'
2+
import retextLanguage from 'retext-language'
3+
import { franc } from 'franc'
4+
5+
// Minimal ISO-639-3 to ISO-639-1 mapping for common languages
6+
const ISO3_TO_1 = {
7+
afr: 'af', ara: 'ar', ben: 'bn', bul: 'bg', cat: 'ca', ces: 'cs', dan: 'da',
8+
deu: 'de', ell: 'el', eng: 'en', est: 'et', eus: 'eu', fin: 'fi', fra: 'fr',
9+
heb: 'he', hin: 'hi', hrv: 'hr', hun: 'hu', ind: 'id', ita: 'it', jpn: 'ja',
10+
kor: 'ko', lit: 'lt', lav: 'lv', nld: 'nl', pol: 'pl', por: 'pt', ron: 'ro',
11+
rus: 'ru', slk: 'sk', slv: 'sl', spa: 'es', srp: 'sr', swe: 'sv', tam: 'ta',
12+
tel: 'te', tha: 'th', tur: 'tr', ukr: 'uk', urd: 'ur', vie: 'vi', zho: 'zh'
13+
}
14+
15+
function iso3to1(code) {
16+
return ISO3_TO_1[code] || null
17+
}
18+
19+
/**
20+
* Detect language of provided text.
21+
* Returns ISO-639-1 and ISO-639-3 codes.
22+
* Defaults to English if detection fails.
23+
* @param {string} text raw text input
24+
* @returns {{iso6391: string, iso6393: string}}
25+
*/
26+
export default async function detectLanguage(text) {
27+
let iso6393 = 'eng'
28+
if (typeof text === 'string' && text.trim()) {
29+
try {
30+
const file = await retext().use(retextLanguage).process(text)
31+
if (file.data && file.data.language && file.data.language !== 'und') {
32+
iso6393 = file.data.language
33+
} else {
34+
const f = franc(text)
35+
if (f && f !== 'und') iso6393 = f
36+
}
37+
} catch {
38+
try {
39+
const f = franc(text)
40+
if (f && f !== 'und') iso6393 = f
41+
} catch {}
42+
}
43+
}
44+
const iso6391 = iso3to1(iso6393) || 'en'
45+
return { iso6391, iso6393 }
46+
}

controllers/readability.js

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
/**
2+
* Evaluate basic readability statistics and estimate reading time.
3+
* Returns an estimated reading time in seconds (assuming ~200 wpm) and
4+
* basic document statistics (characters, words, sentences, paragraphs).
5+
*
6+
* @param {string} text raw text input
7+
* @returns {{readingTime: number, characters: number, words: number, sentences: number, paragraphs: number}}
8+
*/
9+
export default async function checkReadability (text) {
10+
if (!text || typeof text !== 'string') return { readingTime: 0, characters: 0, words: 0, sentences: 0, paragraphs: 0 }
11+
const trimmed = text.trim()
12+
const characters = trimmed.length
13+
const words = trimmed.split(/\s+/).filter(Boolean).length
14+
const sentences = trimmed.split(/[.!?]+/).filter(s => s.trim().length > 0).length
15+
const paragraphs = trimmed.split(/\r?\n+/).filter(p => p.trim().length > 0).length
16+
const readingTime = Math.round((words / 200) * 60)
17+
return { readingTime, characters, words, sentences, paragraphs }
18+
}

controllers/spellCheck.js

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,8 @@ export default async function spellCheck (text, options) {
1111
input = input.replace(/\b[\w-]+(?:\.[\w-]+)+(?:\/\S*)?/gi, ' ')
1212
// remove alphanumeric tokens like 123abc
1313
input = input.replace(/[0-9]{1,}[a-zA-Z]{1,}/gi, ' ')
14-
// collapse whitespace
15-
input = input.replace(/\s+/g, ' ').trim()
14+
// collapse spaces but preserve line breaks for accurate line numbers
15+
input = input.replace(/\r\n/g, '\n').replace(/[ \t]+/g, ' ')
1616

1717
if (typeof options === 'undefined') {
1818
options = { dictionary }

controllers/summary.js

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
export function buildSummary (text) {
2+
if (!text || typeof text !== 'string') return { text: '', sentences: [] }
3+
const sentences = text.match(/[^.!?]+[.!?]/g) || [text]
4+
const top = sentences.slice(0, 5).map(s => s.trim())
5+
return { text: top.join(' ').trim(), sentences: top }
6+
}

controllers/textProcessing.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ export function getRawText (html) {
1313
unorderedListItemPrefix: ''
1414
}
1515
let rawText = htmlToText(html, options)
16-
rawText = nlp(rawText).normalize().out('text')
16+
rawText = nlp(rawText).out('text')
1717
const containsUrlLike = (s) => {
1818
if (!s) return false
1919
const str = String(s)

controllers/titleDetector.js

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,9 @@ function normalizeTitle(title) {
1212
if (!title) return null
1313
let t = String(title).replace(/(\r\n|\n|\r)/gm, ' ').replace(/\s+/g, ' ').trim()
1414
// remove common site suffixes after delimiters
15-
t = t.replace(/\s*[|\-:·»]\s*[^|\-:·»]{2,}\s*$/u, () => '')
15+
t = t
16+
.replace(/\s*[|:·»]\s*[^|:·»-]{2,}\s*$/u, '')
17+
.replace(/\s+-\s+[^|:·»-]{2,}\s*$/u, '')
1618
return t.trim() || null
1719
}
1820

0 commit comments

Comments
 (0)