Releases: fmacpro/horseman-article-parser
Releases · fmacpro/horseman-article-parser
1.0.0 Resurrection update
Summary
- Convert the entire codebase from CommonJS to native ES modules, add type: "module" in package.json, and switch lint config to eslint.config.mjs.
- Extract keyword parsing, spell checking, and lighthouse analysis into dedicated controller modules. Lighthouse audits now run against the already-launched browser instance instead of spawning a second session.
- Update Puppeteer, Lighthouse, jsdom, ESLint, and related packages to their latest major versions, add a custom puppeteer-extra-plugin-user-data-dir override, and document these changes in README.md and APIDOC.md.
- Rewrite test.js using modern async/await, replace fs.writeFile with promise-based I/O, and remove process.exit usage.
Breaking Changes
- Package now publishes as an ES module only; consumers must use import syntax (or dynamic import() from CommonJS) and run on Node.js versions that support ES modules.
- Major dependency upgrades (e.g., Puppeteer 24, Lighthouse 12, jsdom 26) likely raise the minimum supported Node.js version (≈18+) and may introduce API or behavior changes in downstream tooling.
0.9.0
- Allows passing of rules for returning an articles title & contents. This is useful in a case
where the parser is unable to return the desired title or content e.g.
rules: [
{
host: 'www.bbc.co.uk',
content: () => {
var j = window.$
j('article section, article figure, article header').remove()
return j('article').html()
}
},
{
host: 'www.youtube.com',
title: () => {
return window.ytInitialData.contents.twoColumnWatchNextResults.results.results.contents[0].videoPrimaryInfoRenderer.title.runs[0].text
},
content: () => {
return window.ytInitialData.contents.twoColumnWatchNextResults.results.results.contents[1].videoSecondaryInfoRenderer.description.runs[0].text
}
}
]
0.8.54
0.8.53
0.8.52
- sidebar keyword removed from unlikely candidates regex & handled unexpected redirects ( fixes #47 )
- article body identification rules (regexes) moved to options
- exposed original html of document on response object ( #48 )
- dependency security updates
- amended the default
puppeteer.gotowaitUntiloption to benetworkidle2rather thandomcontentloaded
0.8.51
0.8.5
- Allow compromise plugins to be passed in
- Update docs
Compromise is the natural language processor that allows horseman-article-parser to return
topics e.g. people, places & organisations. You can now pass custom plugins to compromise to modify or add to the word lists like so:
/** add some names
let testPlugin = function(Doc, world) {
world.addWords({
'rishi': 'FirstName',
'sunak': 'LastName',
})
}
const options = {
url: 'https://www.theguardian.com/commentisfree/2020/jul/08/the-guardian-view-on-rishi-sunak-right-words-right-focus-wrong-policies',
enabled: ['lighthouse', 'screenshot', 'links', 'sentiment', 'entities', 'spelling', 'keywords'],
nlp: {
plugins: [testPlugin]
}
}
This allows us to match - for example - names which are not in the base compromise word lists.
0.8.4
0.8.3
- Refactor title processing
Title processing can now be turned on and is off by default. It is now also possible to configure the title processing functionality as below
var options = {
title: {
useBestTitlePart: true, // true turns on the title processing
commonSeparatingCharacters: [' | ', ' _ ', ' - ', '«', '»', ' — ', ' — ', ' – '],
minimumTitlePartLength: 10
}
}