diff --git a/.github/workflows/ci-test.yml b/.github/workflows/ci-test.yml
index e61fc73b..60c2913d 100644
--- a/.github/workflows/ci-test.yml
+++ b/.github/workflows/ci-test.yml
@@ -8,40 +8,33 @@ on: [push, pull_request]
jobs:
test:
- runs-on: ubuntu-20.04
+ runs-on: ubuntu-latest
strategy:
matrix:
- node_version: [14.x, 15.x, 16.x, 17.x, 18.x]
+ node_version: [20.x, 22.x, 24.x]
steps:
- - uses: actions/checkout@v2
+ - uses: actions/checkout@v4
- name: setup Node.js v${{ matrix.node_version }}
- uses: actions/setup-node@v2
+ uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node_version }}
- name: run npm scripts
+ env:
+ PROXY_SERVER: ${{ secrets.PROXY_SERVER }}
run: |
- npm i -g standard
npm install
npm run lint
npm run build --if-present
npm run test
- - name: sync to coveralls
- uses: coverallsapp/github-action@v1.1.2
- with:
- github-token: ${{ secrets.GITHUB_TOKEN }}
-
- name: cache node modules
- uses: actions/cache@v2
+ uses: actions/cache@v4
with:
path: ~/.npm
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
restore-keys: |
${{ runner.os }}-node-
-
-
-
diff --git a/.github/workflows/codeql-analysis.yml b/.github/workflows/codeql-analysis.yml
index 2124bd64..a77d776a 100644
--- a/.github/workflows/codeql-analysis.yml
+++ b/.github/workflows/codeql-analysis.yml
@@ -38,7 +38,7 @@ jobs:
steps:
- name: Checkout repository
- uses: actions/checkout@v3
+ uses: actions/checkout@v4
# Initializes the CodeQL tools for scanning.
- name: Initialize CodeQL
diff --git a/.gitignore b/.gitignore
index 33f33bbe..5b416c34 100644
--- a/.gitignore
+++ b/.gitignore
@@ -15,5 +15,8 @@ coverage
yarn.lock
coverage.lcov
pnpm-lock.yaml
+lcov.info
-dist/
+deno.lock
+
+evaluation
diff --git a/.npmignore b/.npmignore
index 68aa0872..f2f3c65a 100644
--- a/.npmignore
+++ b/.npmignore
@@ -1,18 +1,7 @@
-node_modules/
-src/
-test-data/
-.idea/
-coverage/
-.vscode/
-
-.DS_Store
-yarn.lock
-coverage.lcov
+node_modules
+coverage
+.github
pnpm-lock.yaml
-
-*.js
-*.cjs
-*.js.map
-
-!dist/**/*.js
-!index.js
+examples
+test-data
+lcov.info
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
new file mode 100644
index 00000000..8cfca770
--- /dev/null
+++ b/CONTRIBUTING.md
@@ -0,0 +1,71 @@
+# Contributing to `@extractus/article-extractor`
+
+Glad to see you here.
+
+Collaborations and pull requests are always welcomed, though larger proposals should be discussed first.
+
+As an OSS, it's better to follow the Unix philosophy: "do one thing and do it well".
+
+## Third-party libraries
+
+Please avoid using libaries other than those available in the standard library, unless necessary.
+
+This library needs to be simple and flexible to run on multiple platforms such as Deno, Bun, or even browser.
+
+
+## Coding convention
+
+Make sure your code lints before opening a pull request.
+
+
+```bash
+cd article-extractor
+
+# check coding convention issue
+npm run lint
+
+# auto fix coding convention issue
+npm run lint:fix
+```
+
+*When you run `npm test`, the linting process will be triggered at first.*
+
+
+## Testing
+
+Be sure to run the unit test suite before opening a pull request. An example test run is shown below.
+
+```bash
+cd article-extractor
+npm test
+```
+
+
+
+If test coverage decreased, please check test scripts and try to improve this number.
+
+
+## Documentation
+
+If you've changed APIs, please update README and [the examples](examples).
+
+
+## Clean commit histories
+
+When you open a pull request, please ensure the commit history is clean.
+Squash the commits into logical blocks, perhaps a single commit if that makes sense.
+
+What you want to avoid is commits such as "WIP" and "fix test" in the history.
+This is so we keep history on master clean and straightforward.
+
+For people new to git, please refer the following guides:
+
+- [Writing good commit messages](https://github.com/erlang/otp/wiki/writing-good-commit-messages)
+- [Commit Message Guidelines](https://gist.github.com/robertpainsi/b632364184e70900af4ab688decf6f53)
+
+
+## License
+
+By contributing to `@extractus/article-extractor`, you agree that your contributions will be licensed under its [MIT license](LICENSE).
+
+---
diff --git a/LICENSE b/LICENSE
index 487bbe05..6c13cab6 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,6 +1,6 @@
The MIT License (MIT)
-Copyright (c) 2016 Dong Nguyen
+Copyright (c) 2016 Extractus
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
diff --git a/README.md b/README.md
index eb56ac3d..0f077110 100644
--- a/README.md
+++ b/README.md
@@ -1,271 +1,505 @@
-# article-parser
+# @extractus/article-extractor
Extract main article, main image and meta data from URL.
-[](https://badge.fury.io/js/article-parser)
-
-[](https://coveralls.io/github/ndaidong/article-parser)
-
-[](https://standardjs.com)
+[](https://badge.fury.io/js/@extractus%2Farticle-extractor)
+
+
+(This library is derived from [article-parser](https://www.npmjs.com/package/article-parser) renamed.)
## Demo
-- [Give it a try!](https://demos.pwshub.com/article-parser)
-- [Example FaaS](https://extractor.pwshub.com/article/parse?url=https://www.binance.com/en/blog/markets/15-new-years-resolutions-that-will-make-2022-your-best-year-yet-421499824684903249&apikey=demo-orePhhidnWKWPvF8EYKap7z55cN)
+- [Give it a try!](https://extractus-demo.vercel.app/article)
-## Setup
+## Install
-- Node.js
+```bash
+# npm, pnpm, yarn
+npm i @extractus/article-extractor
+
+# bun
+bun add @extractus/article-extractor
+```
- ```bash
- npm i article-parser
+## Usage
- # pnpm
- pnpm i article-parser
+```ts
+import { extract } from '@extractus/article-extractor'
- # yarn
- yarn add article-parser
- ```
+const data = await extract(ARTICLE_URL)
+console.log(data)
+```
-### Usage
+## APIs
-```js
-import { extract } from 'article-parser'
+- [extract()](#extract)
+- [extractFromHtml()](#extractfromhtml)
+- [Transformations](#transformations)
+ - [`transformation` object](#transformation-object)
+ - [.addTransformations](#addtransformationsobject-transformation--array-transformations)
+ - [.removeTransformations](#removetransformationsarray-patterns)
+ - [Priority order](#priority-order)
+- [`sanitize-html`'s options](#sanitize-htmls-options)
+
+---
+
+### `extract()`
+
+Load and extract article data. Return a Promise object.
-// with CommonJS environments
-// const { extract } = require('article-parser/dist/cjs/article-parser.js')
+#### Syntax
-const url = 'https://www.binance.com/en/blog/markets/15-new-years-resolutions-that-will-make-2022-your-best-year-yet-421499824684903249'
+```ts
+extract(String input)
+extract(String input, Object parserOptions)
+extract(String input, Object parserOptions, Object fetchOptions)
+```
+
+Example:
+
+```js
+import { extract } from '@extractus/article-extractor'
+
+const input = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'
-extract(url).then((article) => {
+// here we use top-level await, assume current platform supports it
+try {
+ const article = await extract(input)
console.log(article)
-}).catch((err) => {
- console.trace(err)
+} catch (err) {
+ console.error(err)
+}
+```
+
+The result - `article` - can be `null` or an object with the following structure:
+
+```ts
+{
+ url: String,
+ title: String,
+ description: String,
+ image: String,
+ author: String,
+ favicon: String,
+ content: String,
+ published: Date String,
+ type: String, // page type
+ source: String, // original publisher
+ links: Array, // list of alternative links
+ ttr: Number, // time to read in second, 0 = unknown
+}
+```
+
+Read [string-comparison](https://www.npmjs.com/package/string-comparison) docs for more info about `urlsCompareAlgorithm`.
+
+#### Parameters
+
+##### `input` *required*
+
+URL string links to the article or HTML content of that web page.
+
+##### `parserOptions` *optional*
+
+Object with all or several of the following properties:
+
+ - `wordsPerMinute`: Number, to estimate time to read. Default `300`.
+ - `descriptionTruncateLen`: Number, max num of chars generated for description. Default `210`.
+ - `descriptionLengthThreshold`: Number, min num of chars required for description. Default `180`.
+ - `contentLengthThreshold`: Number, min num of chars required for content. Default `200`.
+
+For example:
+
+```js
+import { extract } from '@extractus/article-extractor'
+
+const article = await extract('https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html', {
+ descriptionLengthThreshold: 120,
+ contentLengthThreshold: 500
})
+
+console.log(article)
```
-##### Note:
+##### `fetchOptions` *optional*
-> Since Node.js v14, ECMAScript modules [have became the official standard format](https://nodejs.org/docs/latest-v14.x/api/esm.html#esm_modules_ecmascript_modules).
-> Just ensure that you are [using module system](https://nodejs.org/api/packages.html#determining-module-system) and enjoy with ES6 import/export syntax.
+`fetchOptions` is an object that can have the following properties:
+- `headers`: to set request headers
+- `proxy`: another endpoint to forward the request to
+- `agent`: a HTTP proxy agent
+- `signal`: AbortController signal or AbortSignal timeout to terminate the request
-## APIs
+For example, you can use this param to set request headers to fetch as below:
-- [.extract(String url | String html)](#extractstring-url--string-html)
-- [.addQueryRules(Array queryRules)](#addqueryrulesarray-queryrules)
-- [Configuration methods](#configuration-methods)
+```js
+import { extract } from '@extractus/article-extractor'
+const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'
+const article = await extract(url, {}, {
+ headers: {
+ 'user-agent': 'Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1'
+ }
+})
-#### extract(String url | String html)
+console.log(article)
+```
-Load and extract article data. Return a Promise object.
+You can also specify a proxy endpoint to load remote content, instead of fetching directly.
-Example:
+For example:
```js
-import { extract } from 'article-parser'
-
-const getArticle = async (url) => {
- try {
- const article = await extract(url)
- return article
- } catch (err) {
- console.trace(err)
- return null
+import { extract } from '@extractus/article-extractor'
+
+const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'
+
+await extract(url, {}, {
+ headers: {
+ 'user-agent': 'Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1'
+ },
+ proxy: {
+ target: 'https://your-secret-proxy.io/loadXml?url=',
+ headers: {
+ 'Proxy-Authorization': 'Bearer YWxhZGRpbjpvcGVuc2VzYW1l...'
+ },
}
-}
+})
+```
+
+Passing requests to proxy is useful while running `@extractus/article-extractor` on browser. View [examples/browser-article-parser](examples/browser-article-parser) as reference example.
+
+For more info about proxy authentication, please refer [HTTP authentication](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication)
+
+For a deeper customization, you can consider using [Proxy](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Proxy) to replace `fetch` behaviors with your own handlers.
-getArticle('https://domain.com/path/to/article')
+Another way to work with proxy is use `agent` option instead of `proxy` as below:
+
+```js
+import { extract } from '@extractus/article-extractor'
+
+import { HttpsProxyAgent } from 'https-proxy-agent'
+
+const proxy = 'http://abc:RaNdoMpasswORd_country-France@proxy.packetstream.io:31113'
+
+const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'
+
+const article = await extract(url, {}, {
+ agent: new HttpsProxyAgent(proxy),
+})
+console.log('Run article-extractor with proxy:', proxy)
+console.log(article)
```
-If the extraction works well, you should get an `article` object with the structure as below:
+For more info about [https-proxy-agent](https://www.npmjs.com/package/https-proxy-agent), check [its repo](https://github.com/TooTallNate/proxy-agents).
-```json
-{
- "url": URI String,
- "title": String,
- "description": String,
- "image": URI String,
- "author": Person[], // https://schema.org/Person
- "publisher": Organization, // https://schema.org/Organization
- "content": HTML String,
- "published": Date String,
- "source": String, // original publisher
- "links": Array, // list of alternative links
- "ttr": Number, // time to read in second, 0 = unknown
-}
+By default, there is no request timeout. You can use the option `signal` to cancel request at the right time.
+
+The common way is to use AbortControler:
+
+```js
+const controller = new AbortController()
+
+// stop after 5 seconds
+setTimeout(() => {
+ controller.abort()
+}, 5000)
+
+const data = await extract(url, null, {
+ signal: controller.signal,
+})
```
-[Click here](https://extractor.pwshub.com/article/parse?url=https://www.binance.com/en/blog/markets/15-new-years-resolutions-that-will-make-2022-your-best-year-yet-421499824684903249&apikey=demo-orePhhidnWKWPvF8EYKap7z55cN) for seeing an actual result.
+A newer solution is AbortSignal's `timeout()` static method:
+
+```js
+// stop after 5 seconds
+const data = await extract(url, null, {
+ signal: AbortSignal.timeout(5000),
+})
+```
+
+For more info:
+
+- [AbortController constructor](https://developer.mozilla.org/en-US/docs/Web/API/AbortController)
+- [AbortSignal: timeout() static method](https://developer.mozilla.org/en-US/docs/Web/API/AbortSignal/timeout_static)
+
+### `extractFromHtml()`
-#### addQueryRules(Array queryRules)
+Extract article data from HTML string. Return a Promise object as same as `extract()` method above.
-Add custom rules to get main article from the specific domains.
+#### Syntax
-This can be useful when the default extraction algorithm fails, or when you want to remove some parts of main article content.
+```ts
+extractFromHtml(String html)
+extractFromHtml(String html, String url)
+extractFromHtml(String html, String url, Object parserOptions)
+```
Example:
```js
-import { addQueryRules, extract } from 'article-parser'
+import { extractFromHtml } from '@extractus/article-extractor'
-// extractor doesn't work for you!
-extract('https://bad-website.domain/page/article')
+const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'
-// add some rules for bad-website.domain
-addQueryRules([
- {
- patterns: [
- { hostname: 'bad-website.domain' }
- ],
- selector: '#noop_article_locates_here',
- unwanted: [
- '.advertise-area',
- '.stupid-banner'
- ]
+const res = await fetch(url)
+const html = await res.text()
+
+// you can do whatever with this raw html here: clean up, remove ads banner, etc
+// just ensure a html string returned
+
+const article = await extractFromHtml(html, url)
+console.log(article)
+```
+
+#### Parameters
+
+##### `html` *required*
+
+HTML string which contains the article you want to extract.
+
+##### `url` *optional*
+
+URL string that indicates the source of that HTML content.
+`article-extractor` may use this info to handle internal/relative links.
+
+##### `parserOptions` *optional*
+
+See [parserOptions](#parseroptions-optional) above.
+
+
+---
+
+### Transformations
+
+Sometimes the default extraction algorithm may not work well. That is the time when we need transformations.
+
+By adding some functions before and after the main extraction step, we aim to come up with a better result as much as possible.
+
+There are 2 methods to play with transformations:
+
+- `addTransformations(Object transformation | Array transformations)`
+- `removeTransformations(Array patterns)`
+
+At first, let's talk about `transformation` object.
+
+#### `transformation` object
+
+In `@extractus/article-extractor`, `transformation` is an object with the following properties:
+
+- `patterns`: required, a list of regexps to match the URLs
+- `pre`: optional, a function to process raw HTML
+- `post`: optional, a function to process extracted article
+
+Basically, the meaning of `transformation` can be interpreted like this:
+
+> with the urls which match these `patterns`
+> let's run `pre` function to normalize HTML content
+> then extract main article content with normalized HTML, and if success
+> let's run `post` function to normalize extracted article content
+
+
+
+Here is an example transformation:
+
+```ts
+{
+ patterns: [
+ /([\w]+.)?domain.tld\/*/,
+ /domain.tld\/articles\/*/
+ ],
+ pre: (document) => {
+ // remove all .advertise-area and its siblings from raw HTML content
+ document.querySelectorAll('.advertise-area').forEach((element) => {
+ if (element.nodeName === 'DIV') {
+ while (element.nextSibling) {
+ element.parentNode.removeChild(element.nextSibling)
+ }
+ element.parentNode.removeChild(element)
+ }
+ })
+ return document
+ },
+ post: (document) => {
+ // with extracted article, replace all h4 tags with h2
+ document.querySelectorAll('h4').forEach((element) => {
+ const h2Element = document.createElement('h2')
+ h2Element.innerHTML = element.innerHTML
+ element.parentNode.replaceChild(h2Element, element)
+ })
+ // change small sized images to original version
+ document.querySelectorAll('img').forEach((element) => {
+ const src = element.getAttribute('src')
+ if (src.includes('domain.tld/pics/150x120/')) {
+ const fullSrc = src.replace('/pics/150x120/', '/pics/original/')
+ element.setAttribute('src', fullSrc)
+ }
+ })
+ return document
}
-])
+}
+```
-// extractor will try to find article at `#noop_article_locates_here`
+- To write better transformation logic, please refer [linkedom](https://github.com/WebReflection/linkedom) and [Document Object](https://developer.mozilla.org/en-US/docs/Web/API/Document).
-// call it again, hopefully it works for you now :)
-extract('https://bad-website.domain/page/article')
-````
+#### `addTransformations(Object transformation | Array transformations)`
-While adding rules, you can specify a `transform()` function to fine-tune article content more thoroughly.
+Add a single transformation or a list of transformations. For example:
-Example rule with transformation:
+```ts
+import { addTransformations } from '@extractus/article-extractor'
-```js
-import { addQueryRules } from 'article-parser'
+addTransformations({
+ patterns: [
+ /([\w]+.)?abc.tld\/*/
+ ],
+ pre: (document) => {
+ // do something with document
+ return document
+ },
+ post: (document) => {
+ // do something with document
+ return document
+ }
+})
-addQueryRules([
+addTransformations([
+ {
+ patterns: [
+ /([\w]+.)?def.tld\/*/
+ ],
+ pre: (document) => {
+ // do something with document
+ return document
+ },
+ post: (document) => {
+ // do something with document
+ return document
+ }
+ },
{
patterns: [
- { hostname: 'bad-website.domain' }
+ /([\w]+.)?xyz.tld\/*/
],
- selector: '#article_id_here',
- transform: (document) => {
- // document is parsed by https://github.com/WebReflection/linkedom which is almost identical to the browser Document object.
- // for example, here we replace all
Few can name a rational peach that isn't a conscientious goldfish! One cannot separate snakes from plucky pomegranates? Draped neatly on a hanger, the melons could be said to resemble knowledgeable pigs. Some posit the enchanting tiger to be less than confident. The literature would have us believe that an impartial turtle is not but a hippopotamus. Unfortunately, that is wrong; on the contrary, those cows are nothing more than pandas! The chicken is a shark; A turtle can hardly be considered a kind horse without also being a pomegranate. Zebras are witty persimmons.
+
+ Those cheetahs are nothing more than dogs. A watermelon is an exuberant kangaroo. An octopus is the tangerine of a grapes? The cherry is a shark. Recent controversy aside, they were lost without the cheerful plum that composed their fox. As far as we can estimate, one cannot separate camels from dynamic hamsters. Those tigers are nothing more than cows! A cow is a squirrel from the right perspective. Their banana was, in this moment, a helpful bear.
+
The first fair dog is, in its own way, a lemon.
+ 4746 Kelly Drive, West Virginia
+
+
+
+ const add = (a, b) => {
+ return a + b
+ }
+
+
+
OK, that is good
+
+
+
+ Some widget here
+ Some widget here
+
+
+
+
+
diff --git a/test-data/html-article-no-source.html b/test-data/html-article-no-source.html
index c040fa2e..12237080 100644
--- a/test-data/html-article-no-source.html
+++ b/test-data/html-article-no-source.html
@@ -17,6 +17,6 @@
To be more specific, those turtles are nothing more than fishes. A grape can hardly be considered a shrewd goldfish without also being an owl. Some unbiased goats are thought of simply as tangerines.
Shouting with happiness, a courageous elephant is a duck of the mind? Some posit the upbeat hippopotamus to be less than enchanting. It's an undeniable fact, really; authors often misinterpret the grape as an endurable rabbit, when in actuality it feels more like a tough dolphin. We know that a cherry can hardly be considered a responsible apricot without also being a nectarine.
- s
+