Skip to content

Releases: fmacpro/horseman-article-parser

1.0.0 Resurrection update

03 Sep 14:04
3261fe8

Choose a tag to compare

Summary

  • Convert the entire codebase from CommonJS to native ES modules, add type: "module" in package.json, and switch lint config to eslint.config.mjs.
  • Extract keyword parsing, spell checking, and lighthouse analysis into dedicated controller modules. Lighthouse audits now run against the already-launched browser instance instead of spawning a second session.
  • Update Puppeteer, Lighthouse, jsdom, ESLint, and related packages to their latest major versions, add a custom puppeteer-extra-plugin-user-data-dir override, and document these changes in README.md and APIDOC.md.
  • Rewrite test.js using modern async/await, replace fs.writeFile with promise-based I/O, and remove process.exit usage.

Breaking Changes

  1. Package now publishes as an ES module only; consumers must use import syntax (or dynamic import() from CommonJS) and run on Node.js versions that support ES modules.
  2. Major dependency upgrades (e.g., Puppeteer 24, Lighthouse 12, jsdom 26) likely raise the minimum supported Node.js version (≈18+) and may introduce API or behavior changes in downstream tooling.

0.9.0

14 Nov 18:53
e53e187

Choose a tag to compare

  • Allows passing of rules for returning an articles title & contents. This is useful in a case
    where the parser is unable to return the desired title or content e.g.
rules: [
  {
    host: 'www.bbc.co.uk',
    content: () => {
      var j = window.$
      j('article section, article figure, article header').remove()
      return j('article').html()
    }
  },
  {
    host: 'www.youtube.com',
    title: () => {
      return window.ytInitialData.contents.twoColumnWatchNextResults.results.results.contents[0].videoPrimaryInfoRenderer.title.runs[0].text
    },
    content: () => {
      return window.ytInitialData.contents.twoColumnWatchNextResults.results.results.contents[1].videoSecondaryInfoRenderer.description.runs[0].text
    }
  }
]

0.8.54

31 Aug 19:37
b28a125

Choose a tag to compare

  • get site icon url

0.8.53

06 Aug 19:30
697034c

Choose a tag to compare

  • BBC article scraping fixed
  • Dependencies updated

0.8.52

07 Jan 19:21
4bde87f

Choose a tag to compare

  • sidebar keyword removed from unlikely candidates regex & handled unexpected redirects ( fixes #47 )
  • article body identification rules (regexes) moved to options
  • exposed original html of document on response object ( #48 )
  • dependency security updates
  • amended the default puppeteer.goto waitUntil option to be networkidle2 rather than domcontentloaded

0.8.51

07 Aug 23:54
28d204f

Choose a tag to compare

  • dependencies updated

0.8.5

09 Jul 19:32

Choose a tag to compare

  • Allow compromise plugins to be passed in
  • Update docs

Compromise is the natural language processor that allows horseman-article-parser to return
topics e.g. people, places & organisations. You can now pass custom plugins to compromise to modify or add to the word lists like so:

/** add some names
let testPlugin = function(Doc, world) {
  world.addWords({
    'rishi': 'FirstName',
    'sunak': 'LastName',
  })
}

const options = {
  url: 'https://www.theguardian.com/commentisfree/2020/jul/08/the-guardian-view-on-rishi-sunak-right-words-right-focus-wrong-policies',
  enabled: ['lighthouse', 'screenshot', 'links', 'sentiment', 'entities', 'spelling', 'keywords'],
  nlp: {
    plugins: [testPlugin]
  }
}

This allows us to match - for example - names which are not in the base compromise word lists.

0.8.4

08 Jul 17:12
8a8fc35

Choose a tag to compare

  • Removed title manipulation logic

The title manipulation isn't good enough. I think this is better done in the application using the package if required where logic specific to the site being crawled can be applied.

0.8.3

07 Jul 23:41
209934c

Choose a tag to compare

  • Refactor title processing

Title processing can now be turned on and is off by default. It is now also possible to configure the title processing functionality as below

var options = {
  title: {
    useBestTitlePart: true, // true turns on the title processing
    commonSeparatingCharacters: [' | ', ' _ ', ' - ', '«', '»', ' — ', ' — ', ' – '],
    minimumTitlePartLength: 10
  }
}

0.8.2

07 Jul 18:53
0d7cf5a

Choose a tag to compare

  • Improve title handling