diff --git a/.github/workflows/ci-test.yml b/.github/workflows/ci-test.yml index e61fc73b..60c2913d 100644 --- a/.github/workflows/ci-test.yml +++ b/.github/workflows/ci-test.yml @@ -8,40 +8,33 @@ on: [push, pull_request] jobs: test: - runs-on: ubuntu-20.04 + runs-on: ubuntu-latest strategy: matrix: - node_version: [14.x, 15.x, 16.x, 17.x, 18.x] + node_version: [20.x, 22.x, 24.x] steps: - - uses: actions/checkout@v2 + - uses: actions/checkout@v4 - name: setup Node.js v${{ matrix.node_version }} - uses: actions/setup-node@v2 + uses: actions/setup-node@v4 with: node-version: ${{ matrix.node_version }} - name: run npm scripts + env: + PROXY_SERVER: ${{ secrets.PROXY_SERVER }} run: | - npm i -g standard npm install npm run lint npm run build --if-present npm run test - - name: sync to coveralls - uses: coverallsapp/github-action@v1.1.2 - with: - github-token: ${{ secrets.GITHUB_TOKEN }} - - name: cache node modules - uses: actions/cache@v2 + uses: actions/cache@v4 with: path: ~/.npm key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }} restore-keys: | ${{ runner.os }}-node- - - - diff --git a/.github/workflows/codeql-analysis.yml b/.github/workflows/codeql-analysis.yml index 2124bd64..a77d776a 100644 --- a/.github/workflows/codeql-analysis.yml +++ b/.github/workflows/codeql-analysis.yml @@ -38,7 +38,7 @@ jobs: steps: - name: Checkout repository - uses: actions/checkout@v3 + uses: actions/checkout@v4 # Initializes the CodeQL tools for scanning. - name: Initialize CodeQL diff --git a/.gitignore b/.gitignore index 33f33bbe..5b416c34 100644 --- a/.gitignore +++ b/.gitignore @@ -15,5 +15,8 @@ coverage yarn.lock coverage.lcov pnpm-lock.yaml +lcov.info -dist/ +deno.lock + +evaluation diff --git a/.npmignore b/.npmignore index 68aa0872..f2f3c65a 100644 --- a/.npmignore +++ b/.npmignore @@ -1,18 +1,7 @@ -node_modules/ -src/ -test-data/ -.idea/ -coverage/ -.vscode/ - -.DS_Store -yarn.lock -coverage.lcov +node_modules +coverage +.github pnpm-lock.yaml - -*.js -*.cjs -*.js.map - -!dist/**/*.js -!index.js +examples +test-data +lcov.info diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 00000000..8cfca770 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,71 @@ +# Contributing to `@extractus/article-extractor` + +Glad to see you here. + +Collaborations and pull requests are always welcomed, though larger proposals should be discussed first. + +As an OSS, it's better to follow the Unix philosophy: "do one thing and do it well". + +## Third-party libraries + +Please avoid using libaries other than those available in the standard library, unless necessary. + +This library needs to be simple and flexible to run on multiple platforms such as Deno, Bun, or even browser. + + +## Coding convention + +Make sure your code lints before opening a pull request. + + +```bash +cd article-extractor + +# check coding convention issue +npm run lint + +# auto fix coding convention issue +npm run lint:fix +``` + +*When you run `npm test`, the linting process will be triggered at first.* + + +## Testing + +Be sure to run the unit test suite before opening a pull request. An example test run is shown below. + +```bash +cd article-extractor +npm test +``` + +![article-extractor unit test](https://i.imgur.com/TbRCUSS.png?110222) + +If test coverage decreased, please check test scripts and try to improve this number. + + +## Documentation + +If you've changed APIs, please update README and [the examples](examples). + + +## Clean commit histories + +When you open a pull request, please ensure the commit history is clean. +Squash the commits into logical blocks, perhaps a single commit if that makes sense. + +What you want to avoid is commits such as "WIP" and "fix test" in the history. +This is so we keep history on master clean and straightforward. + +For people new to git, please refer the following guides: + +- [Writing good commit messages](https://github.com/erlang/otp/wiki/writing-good-commit-messages) +- [Commit Message Guidelines](https://gist.github.com/robertpainsi/b632364184e70900af4ab688decf6f53) + + +## License + +By contributing to `@extractus/article-extractor`, you agree that your contributions will be licensed under its [MIT license](LICENSE). + +--- diff --git a/LICENSE b/LICENSE index 487bbe05..6c13cab6 100644 --- a/LICENSE +++ b/LICENSE @@ -1,6 +1,6 @@ The MIT License (MIT) -Copyright (c) 2016 Dong Nguyen +Copyright (c) 2016 Extractus Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/README.md b/README.md index eb56ac3d..0f077110 100644 --- a/README.md +++ b/README.md @@ -1,271 +1,505 @@ -# article-parser +# @extractus/article-extractor Extract main article, main image and meta data from URL. -[![NPM](https://badge.fury.io/js/article-parser.svg)](https://badge.fury.io/js/article-parser) -![CI test](https://github.com/ndaidong/article-parser/workflows/ci-test/badge.svg) -[![Coverage Status](https://coveralls.io/repos/github/ndaidong/article-parser/badge.svg)](https://coveralls.io/github/ndaidong/article-parser) -![CodeQL](https://github.com/ndaidong/article-parser/workflows/CodeQL/badge.svg) -[![JavaScript Style Guide](https://img.shields.io/badge/code_style-standard-brightgreen.svg)](https://standardjs.com) +[![npm version](https://badge.fury.io/js/@extractus%2Farticle-extractor.svg)](https://badge.fury.io/js/@extractus%2Farticle-extractor) +![CodeQL](https://github.com/extractus/article-extractor/workflows/CodeQL/badge.svg) +![CI test](https://github.com/extractus/article-extractor/workflows/ci-test/badge.svg) +(This library is derived from [article-parser](https://www.npmjs.com/package/article-parser) renamed.) ## Demo -- [Give it a try!](https://demos.pwshub.com/article-parser) -- [Example FaaS](https://extractor.pwshub.com/article/parse?url=https://www.binance.com/en/blog/markets/15-new-years-resolutions-that-will-make-2022-your-best-year-yet-421499824684903249&apikey=demo-orePhhidnWKWPvF8EYKap7z55cN) +- [Give it a try!](https://extractus-demo.vercel.app/article) -## Setup +## Install -- Node.js +```bash +# npm, pnpm, yarn +npm i @extractus/article-extractor + +# bun +bun add @extractus/article-extractor +``` - ```bash - npm i article-parser +## Usage - # pnpm - pnpm i article-parser +```ts +import { extract } from '@extractus/article-extractor' - # yarn - yarn add article-parser - ``` +const data = await extract(ARTICLE_URL) +console.log(data) +``` -### Usage +## APIs -```js -import { extract } from 'article-parser' +- [extract()](#extract) +- [extractFromHtml()](#extractfromhtml) +- [Transformations](#transformations) + - [`transformation` object](#transformation-object) + - [.addTransformations](#addtransformationsobject-transformation--array-transformations) + - [.removeTransformations](#removetransformationsarray-patterns) + - [Priority order](#priority-order) +- [`sanitize-html`'s options](#sanitize-htmls-options) + +--- + +### `extract()` + +Load and extract article data. Return a Promise object. -// with CommonJS environments -// const { extract } = require('article-parser/dist/cjs/article-parser.js') +#### Syntax -const url = 'https://www.binance.com/en/blog/markets/15-new-years-resolutions-that-will-make-2022-your-best-year-yet-421499824684903249' +```ts +extract(String input) +extract(String input, Object parserOptions) +extract(String input, Object parserOptions, Object fetchOptions) +``` + +Example: + +```js +import { extract } from '@extractus/article-extractor' + +const input = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html' -extract(url).then((article) => { +// here we use top-level await, assume current platform supports it +try { + const article = await extract(input) console.log(article) -}).catch((err) => { - console.trace(err) +} catch (err) { + console.error(err) +} +``` + +The result - `article` - can be `null` or an object with the following structure: + +```ts +{ + url: String, + title: String, + description: String, + image: String, + author: String, + favicon: String, + content: String, + published: Date String, + type: String, // page type + source: String, // original publisher + links: Array, // list of alternative links + ttr: Number, // time to read in second, 0 = unknown +} +``` + +Read [string-comparison](https://www.npmjs.com/package/string-comparison) docs for more info about `urlsCompareAlgorithm`. + +#### Parameters + +##### `input` *required* + +URL string links to the article or HTML content of that web page. + +##### `parserOptions` *optional* + +Object with all or several of the following properties: + + - `wordsPerMinute`: Number, to estimate time to read. Default `300`. + - `descriptionTruncateLen`: Number, max num of chars generated for description. Default `210`. + - `descriptionLengthThreshold`: Number, min num of chars required for description. Default `180`. + - `contentLengthThreshold`: Number, min num of chars required for content. Default `200`. + +For example: + +```js +import { extract } from '@extractus/article-extractor' + +const article = await extract('https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html', { + descriptionLengthThreshold: 120, + contentLengthThreshold: 500 }) + +console.log(article) ``` -##### Note: +##### `fetchOptions` *optional* -> Since Node.js v14, ECMAScript modules [have became the official standard format](https://nodejs.org/docs/latest-v14.x/api/esm.html#esm_modules_ecmascript_modules). -> Just ensure that you are [using module system](https://nodejs.org/api/packages.html#determining-module-system) and enjoy with ES6 import/export syntax. +`fetchOptions` is an object that can have the following properties: +- `headers`: to set request headers +- `proxy`: another endpoint to forward the request to +- `agent`: a HTTP proxy agent +- `signal`: AbortController signal or AbortSignal timeout to terminate the request -## APIs +For example, you can use this param to set request headers to fetch as below: -- [.extract(String url | String html)](#extractstring-url--string-html) -- [.addQueryRules(Array queryRules)](#addqueryrulesarray-queryrules) -- [Configuration methods](#configuration-methods) +```js +import { extract } from '@extractus/article-extractor' +const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html' +const article = await extract(url, {}, { + headers: { + 'user-agent': 'Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1' + } +}) -#### extract(String url | String html) +console.log(article) +``` -Load and extract article data. Return a Promise object. +You can also specify a proxy endpoint to load remote content, instead of fetching directly. -Example: +For example: ```js -import { extract } from 'article-parser' - -const getArticle = async (url) => { - try { - const article = await extract(url) - return article - } catch (err) { - console.trace(err) - return null +import { extract } from '@extractus/article-extractor' + +const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html' + +await extract(url, {}, { + headers: { + 'user-agent': 'Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1' + }, + proxy: { + target: 'https://your-secret-proxy.io/loadXml?url=', + headers: { + 'Proxy-Authorization': 'Bearer YWxhZGRpbjpvcGVuc2VzYW1l...' + }, } -} +}) +``` + +Passing requests to proxy is useful while running `@extractus/article-extractor` on browser. View [examples/browser-article-parser](examples/browser-article-parser) as reference example. + +For more info about proxy authentication, please refer [HTTP authentication](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication) + +For a deeper customization, you can consider using [Proxy](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Proxy) to replace `fetch` behaviors with your own handlers. -getArticle('https://domain.com/path/to/article') +Another way to work with proxy is use `agent` option instead of `proxy` as below: + +```js +import { extract } from '@extractus/article-extractor' + +import { HttpsProxyAgent } from 'https-proxy-agent' + +const proxy = 'http://abc:RaNdoMpasswORd_country-France@proxy.packetstream.io:31113' + +const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html' + +const article = await extract(url, {}, { + agent: new HttpsProxyAgent(proxy), +}) +console.log('Run article-extractor with proxy:', proxy) +console.log(article) ``` -If the extraction works well, you should get an `article` object with the structure as below: +For more info about [https-proxy-agent](https://www.npmjs.com/package/https-proxy-agent), check [its repo](https://github.com/TooTallNate/proxy-agents). -```json -{ - "url": URI String, - "title": String, - "description": String, - "image": URI String, - "author": Person[], // https://schema.org/Person - "publisher": Organization, // https://schema.org/Organization - "content": HTML String, - "published": Date String, - "source": String, // original publisher - "links": Array, // list of alternative links - "ttr": Number, // time to read in second, 0 = unknown -} +By default, there is no request timeout. You can use the option `signal` to cancel request at the right time. + +The common way is to use AbortControler: + +```js +const controller = new AbortController() + +// stop after 5 seconds +setTimeout(() => { + controller.abort() +}, 5000) + +const data = await extract(url, null, { + signal: controller.signal, +}) ``` -[Click here](https://extractor.pwshub.com/article/parse?url=https://www.binance.com/en/blog/markets/15-new-years-resolutions-that-will-make-2022-your-best-year-yet-421499824684903249&apikey=demo-orePhhidnWKWPvF8EYKap7z55cN) for seeing an actual result. +A newer solution is AbortSignal's `timeout()` static method: + +```js +// stop after 5 seconds +const data = await extract(url, null, { + signal: AbortSignal.timeout(5000), +}) +``` + +For more info: + +- [AbortController constructor](https://developer.mozilla.org/en-US/docs/Web/API/AbortController) +- [AbortSignal: timeout() static method](https://developer.mozilla.org/en-US/docs/Web/API/AbortSignal/timeout_static) + +### `extractFromHtml()` -#### addQueryRules(Array queryRules) +Extract article data from HTML string. Return a Promise object as same as `extract()` method above. -Add custom rules to get main article from the specific domains. +#### Syntax -This can be useful when the default extraction algorithm fails, or when you want to remove some parts of main article content. +```ts +extractFromHtml(String html) +extractFromHtml(String html, String url) +extractFromHtml(String html, String url, Object parserOptions) +``` Example: ```js -import { addQueryRules, extract } from 'article-parser' +import { extractFromHtml } from '@extractus/article-extractor' -// extractor doesn't work for you! -extract('https://bad-website.domain/page/article') +const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html' -// add some rules for bad-website.domain -addQueryRules([ - { - patterns: [ - { hostname: 'bad-website.domain' } - ], - selector: '#noop_article_locates_here', - unwanted: [ - '.advertise-area', - '.stupid-banner' - ] +const res = await fetch(url) +const html = await res.text() + +// you can do whatever with this raw html here: clean up, remove ads banner, etc +// just ensure a html string returned + +const article = await extractFromHtml(html, url) +console.log(article) +``` + +#### Parameters + +##### `html` *required* + +HTML string which contains the article you want to extract. + +##### `url` *optional* + +URL string that indicates the source of that HTML content. +`article-extractor` may use this info to handle internal/relative links. + +##### `parserOptions` *optional* + +See [parserOptions](#parseroptions-optional) above. + + +--- + +### Transformations + +Sometimes the default extraction algorithm may not work well. That is the time when we need transformations. + +By adding some functions before and after the main extraction step, we aim to come up with a better result as much as possible. + +There are 2 methods to play with transformations: + +- `addTransformations(Object transformation | Array transformations)` +- `removeTransformations(Array patterns)` + +At first, let's talk about `transformation` object. + +#### `transformation` object + +In `@extractus/article-extractor`, `transformation` is an object with the following properties: + +- `patterns`: required, a list of regexps to match the URLs +- `pre`: optional, a function to process raw HTML +- `post`: optional, a function to process extracted article + +Basically, the meaning of `transformation` can be interpreted like this: + +> with the urls which match these `patterns`
+> let's run `pre` function to normalize HTML content
+> then extract main article content with normalized HTML, and if success
+> let's run `post` function to normalize extracted article content + +![article-extractor extraction process](https://res.cloudinary.com/pwshub/image/upload/v1657336822/documentation/article-parser_extraction_process.png) + +Here is an example transformation: + +```ts +{ + patterns: [ + /([\w]+.)?domain.tld\/*/, + /domain.tld\/articles\/*/ + ], + pre: (document) => { + // remove all .advertise-area and its siblings from raw HTML content + document.querySelectorAll('.advertise-area').forEach((element) => { + if (element.nodeName === 'DIV') { + while (element.nextSibling) { + element.parentNode.removeChild(element.nextSibling) + } + element.parentNode.removeChild(element) + } + }) + return document + }, + post: (document) => { + // with extracted article, replace all h4 tags with h2 + document.querySelectorAll('h4').forEach((element) => { + const h2Element = document.createElement('h2') + h2Element.innerHTML = element.innerHTML + element.parentNode.replaceChild(h2Element, element) + }) + // change small sized images to original version + document.querySelectorAll('img').forEach((element) => { + const src = element.getAttribute('src') + if (src.includes('domain.tld/pics/150x120/')) { + const fullSrc = src.replace('/pics/150x120/', '/pics/original/') + element.setAttribute('src', fullSrc) + } + }) + return document } -]) +} +``` -// extractor will try to find article at `#noop_article_locates_here` +- To write better transformation logic, please refer [linkedom](https://github.com/WebReflection/linkedom) and [Document Object](https://developer.mozilla.org/en-US/docs/Web/API/Document). -// call it again, hopefully it works for you now :) -extract('https://bad-website.domain/page/article') -```` +#### `addTransformations(Object transformation | Array transformations)` -While adding rules, you can specify a `transform()` function to fine-tune article content more thoroughly. +Add a single transformation or a list of transformations. For example: -Example rule with transformation: +```ts +import { addTransformations } from '@extractus/article-extractor' -```js -import { addQueryRules } from 'article-parser' +addTransformations({ + patterns: [ + /([\w]+.)?abc.tld\/*/ + ], + pre: (document) => { + // do something with document + return document + }, + post: (document) => { + // do something with document + return document + } +}) -addQueryRules([ +addTransformations([ + { + patterns: [ + /([\w]+.)?def.tld\/*/ + ], + pre: (document) => { + // do something with document + return document + }, + post: (document) => { + // do something with document + return document + } + }, { patterns: [ - { hostname: 'bad-website.domain' } + /([\w]+.)?xyz.tld\/*/ ], - selector: '#article_id_here', - transform: (document) => { - // document is parsed by https://github.com/WebReflection/linkedom which is almost identical to the browser Document object. - // for example, here we replace all

with - document.querySelectorAll('h1').forEach(node => { - const newNode = document.createElement('b') - newNode.innerHTML = node.innerHTML - node.parentNode.replaceChild(newNode, node) - }) - // at the end, you mush return document + pre: (document) => { + // do something with document + return document + }, + post: (document) => { + // do something with document return document } } ]) -``` +```` -Please refer [MDN](https://developer.mozilla.org/zh-CN/docs/Web/API/Document) for more info. +The transformations without `patterns` will be ignored. +#### `removeTransformations(Array patterns)` -#### Configuration methods +To remove transformations that match the specific patterns. -In addition, this lib provides some methods to customize default settings. Don't touch them unless you have reason to do that. +For example, we can remove all added transformations above: -- getParserOptions() -- setParserOptions(Object parserOptions) -- getRequestOptions() -- setRequestOptions(Object requestOptions) -- getSanitizeHtmlOptions() -- setSanitizeHtmlOptions(Object sanitizeHtmlOptions) +```js +import { removeTransformations } from '@extractus/article-extractor' -Here are default properties/values: +removeTransformations([ + /([\w]+.)?abc.tld\/*/, + /([\w]+.)?def.tld\/*/, + /([\w]+.)?xyz.tld\/*/ +]) +``` +Calling `removeTransformations()` without parameter will remove all current transformations. -#### Object `parserOptions`: +#### Priority order -```js -{ - wordsPerMinute: 300, // to estimate "time to read" - urlsCompareAlgorithm: 'levenshtein', // to find the best url from list - descriptionLengthThreshold: 40, // min num of chars required for description - descriptionTruncateLen: 156, // max num of chars generated for description - contentLengthThreshold: 200 // content must have at least 200 chars -} +While processing an article, more than one transformation can be applied. + +Suppose that we have the following transformations: + +```ts +[ + { + patterns: [ + /http(s?):\/\/google.com\/*/, + /http(s?):\/\/goo.gl\/*/ + ], + pre: function_one, + post: function_two + }, + { + patterns: [ + /http(s?):\/\/goo.gl\/*/, + /http(s?):\/\/google.inc\/*/ + ], + pre: function_three, + post: function_four + } +] ``` -Read [string-comparison](https://www.npmjs.com/package/string-comparison) docs for more info about `urlsCompareAlgorithm`. +As you can see, an article from `goo.gl` certainly matches both them. +In this scenario, `@extractus/article-extractor` will execute both transformations, one by one: -#### Object `requestOptions`: +`function_one` -> `function_three` -> extraction -> `function_two` -> `function_four` -```js -{ - headers: { - 'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0', - accept: 'text/html; charset=utf-8' - }, - responseType: 'text', - responseEncoding: 'utf8', - timeout: 6e4, - maxRedirects: 3 -} -``` -Read [axios' request config](https://axios-http.com/docs/req_config) for more info. +--- -#### Object `sanitizeHtmlOptions`: +### `sanitize-html`'s options -```js -{ - allowedTags: [ - 'h1', 'h2', 'h3', 'h4', 'h5', - 'u', 'b', 'i', 'em', 'strong', 'small', 'sup', 'sub', - 'div', 'span', 'p', 'article', 'blockquote', 'section', - 'details', 'summary', - 'pre', 'code', - 'ul', 'ol', 'li', 'dd', 'dl', - 'table', 'th', 'tr', 'td', 'thead', 'tbody', 'tfood', - 'fieldset', 'legend', - 'figure', 'figcaption', 'img', 'picture', - 'video', 'audio', 'source', - 'iframe', - 'progress', - 'br', 'p', 'hr', - 'label', - 'abbr', - 'a', - 'svg' - ], - allowedAttributes: { - a: ['href', 'target', 'title'], - abbr: ['title'], - progress: ['value', 'max'], - img: ['src', 'srcset', 'alt', 'width', 'height', 'style', 'title'], - picture: ['media', 'srcset'], - video: ['controls', 'width', 'height', 'autoplay', 'muted'], - audio: ['controls'], - source: ['src', 'srcset', 'data-srcset', 'type', 'media', 'sizes'], - iframe: ['src', 'frameborder', 'height', 'width', 'scrolling'], - svg: ['width', 'height'] - }, - allowedIframeDomains: ['youtube.com', 'vimeo.com'] -} -``` +`@extractus/article-extractor` uses [sanitize-html](https://github.com/apostrophecms/sanitize-html) to make a clean sweep of HTML content. -Read [sanitize-html](https://www.npmjs.com/package/sanitize-html#what-are-the-default-options) docs for more info. +Here is the [default options](src/config.js#L5) +Depending on the needs of your content system, you might want to gather some HTML tags/attributes, while ignoring others. + +There are 2 methods to access and modify these options in `@extractus/article-extractor`. + +- `getSanitizeHtmlOptions()` +- `setSanitizeHtmlOptions(Object sanitizeHtmlOptions)` + +Read [sanitize-html](https://github.com/apostrophecms/sanitize-html#default-options) docs for more info. + +--- ## Test ```bash -git clone https://github.com/ndaidong/article-parser.git -cd article-parser -npm install -npm test +git clone https://github.com/extractus/article-extractor.git +cd article-extractor +pnpm i +pnpm test +``` + +![article-extractor-test.png](https://i.imgur.com/TbRCUSS.png?110222) + -# quick evaluation -npm run eval {URL_TO_PARSE_ARTICLE} +## Quick evaluation + +```bash +git clone https://github.com/extractus/article-extractor.git +cd article-extractor +pnpm i +pnpm eval {URL_TO_PARSE_ARTICLE} ``` ## License + The MIT License (MIT) +## Support the project + +If you find value from this open source project, you can support in the following ways: + +- Give it a star ⭐ +- Buy me a coffee: https://paypal.me/ndaidong 🍵 +- Subscribe [Article Extractor service](https://rapidapi.com/pwshub-pwshub-default/api/article-extractor2) on RapidAPI 😉 + +Thank you. + --- diff --git a/SECURITY.md b/SECURITY.md new file mode 100644 index 00000000..921d17aa --- /dev/null +++ b/SECURITY.md @@ -0,0 +1,17 @@ +# Security Policy + +## Supported Versions + +Due to resource limitations, only the latest stable minor release is getting bugfixes (including security ones). + +So e.g. if the latest stable version is 7.2.5, then 7.2.x line will still get security fixes but older versions (like 7.1.x) won't get any fixes. + +Description above is a general rule and may be altered on case by case basis. + +## Reporting a Vulnerability + +You can report low severity vulnerabilities as GitHub issues. + +More severe vulnerabilities should be reported to email extractus@pwshub.com. + +--- diff --git a/build.js b/build.js index d68a7af3..ae29d587 100644 --- a/build.js +++ b/build.js @@ -13,7 +13,7 @@ const pkgNameFlattened = pkgName.replace('/', '-').replace(/[^a-zA-Z-]/g, '') rmSync('dist', { force: true, - recursive: true + recursive: true, }) mkdirSync('dist') @@ -21,7 +21,7 @@ const buildTime = (new Date()).toISOString() const comment = [ `// ${pkgNameFlattened}@${pkg.version}, by ${pkg.author}`, `built with esbuild at ${buildTime}`, - `published under ${pkg.license} license` + `published under ${pkg.license} license`, ].join(' - ') /** @@ -35,7 +35,7 @@ const baseOpt = { minify: true, write: true, sourcemap: 'external', - external: ['canvas'] + external: ['canvas'], } /** @@ -48,15 +48,15 @@ const cjsVersion = { mainFields: ['main'], outfile: `dist/cjs/${pkgNameFlattened}.js`, banner: { - js: comment - } + js: comment, + }, } buildSync(cjsVersion) const cjspkg = { name: pkgName + '-cjs', version: pkg.version, - main: `./${pkgNameFlattened}.js` + main: `./${pkgNameFlattened}.js`, } writeFileSync( 'dist/cjs/package.json', @@ -73,7 +73,7 @@ const browserVersion = { format: 'esm', outfile: `dist/${pkgNameFlattened}.browser.js`, banner: { - js: comment - } + js: comment, + }, } buildSync(browserVersion) diff --git a/eslint.config.js b/eslint.config.js new file mode 100644 index 00000000..6701eb1e --- /dev/null +++ b/eslint.config.js @@ -0,0 +1,127 @@ +// eslint.config.js + +import eslintjs from '@eslint/js' +import globals from 'globals' + +export default [ + eslintjs.configs.recommended, + { + languageOptions: { + ecmaVersion: 'latest', + sourceType: 'module', + globals: { + ...globals.node, + ...globals.browser, + Intl: 'readonly', + }, + }, + ignores: [ + 'node_modules', + 'storage', + ], + rules: { + 'arrow-spacing': ['error', { 'before': true, 'after': true }], + 'block-spacing': ['error', 'always'], + 'brace-style': ['error', '1tbs', { 'allowSingleLine': true }], + 'camelcase': ['error', { + 'allow': ['^UNSAFE_'], + 'properties': 'never', + 'ignoreGlobals': true, + }], + 'comma-dangle': ['error', { + 'arrays': 'always-multiline', + 'objects': 'always-multiline', + 'imports': 'never', + 'exports': 'never', + 'functions': 'never', + }], + 'comma-spacing': ['error', { 'before': false, 'after': true }], + 'eol-last': 'error', + 'eqeqeq': ['error', 'always', { 'null': 'ignore' }], + 'func-call-spacing': ['error', 'never'], + 'indent': [ + 'error', + 2, + { + 'MemberExpression': 1, + 'FunctionDeclaration': { + 'body': 1, + 'parameters': 2, + }, + 'SwitchCase': 1, + 'ignoredNodes': ['TemplateLiteral > *'], + }, + ], + 'key-spacing': ['error', { 'beforeColon': false, 'afterColon': true }], + 'keyword-spacing': ['error', { 'before': true, 'after': true }], + 'lines-between-class-members': ['error', 'always', { 'exceptAfterSingleLine': true }], + 'max-len': [ + 'error', + { + 'code': 120, + 'ignoreTrailingComments': true, + 'ignoreComments': true, + 'ignoreUrls': true, + }, + ], + 'max-lines': [ + 'error', + { + 'max': 360, + 'skipBlankLines': true, + 'skipComments': false, + }, + ], + 'max-lines-per-function': [ + 'error', + { + 'max': 180, + 'skipBlankLines': true, + }, + ], + 'max-params': ['error', 3], + 'no-array-constructor': 'error', + 'no-mixed-spaces-and-tabs': 'error', + 'no-multi-spaces': 'error', + 'no-multi-str': 'error', + 'no-multiple-empty-lines': [ + 'error', + { + 'max': 1, + 'maxEOF': 0, + }, + ], + 'no-restricted-syntax': [ + 'error', + 'WithStatement', + 'BinaryExpression[operator=\'in\']', + ], + 'no-trailing-spaces': 'error', + 'no-use-before-define': [ + 'error', + { + 'functions': true, + 'classes': true, + 'variables': false, + }, + ], + 'no-var': 'warn', + 'object-curly-spacing': ['error', 'always'], + 'padded-blocks': [ + 'error', + { + 'blocks': 'never', + 'switches': 'never', + 'classes': 'never', + }, + ], + 'quotes': ['error', 'single'], + 'space-before-blocks': ['error', 'always'], + 'space-before-function-paren': ['error', 'always'], + 'space-infix-ops': 'error', + 'space-unary-ops': ['error', { 'words': true, 'nonwords': false }], + 'space-in-parens': ['error', 'never'], + 'semi': ['error', 'never'], + }, + }, +] diff --git a/eval.js b/eval.js index 2c18daf0..993869b1 100644 --- a/eval.js +++ b/eval.js @@ -1,14 +1,27 @@ // eval.js -import { readFileSync, existsSync } from 'fs' +import { execSync } from 'node:child_process' +import { readFileSync, writeFileSync, existsSync } from 'node:fs' -import isValidUrl from './src/utils/isValidUrl.js' -import { extract } from './src/main.js' +import { slugify } from '@ndaidong/bellajs' + +import { isValid as isValidUrl } from './src/utils/linker.js' +import { extract, extractFromHtml } from './src/main.js' + +if (!existsSync('evaluation')) { + execSync('mkdir evaluation') +} const extractFromUrl = async (url) => { try { + console.time('extraction') const art = await extract(url) console.log(art) + if (art) { + const slug = slugify(art.title) + writeFileSync(`evaluation/${slug}.html`, art.content, 'utf8') + } + console.timeEnd('extraction') } catch (err) { console.trace(err) } @@ -17,8 +30,12 @@ const extractFromUrl = async (url) => { const extractFromFile = async (fpath) => { try { const html = readFileSync(fpath, 'utf8') - const art = await extract(html) + const art = await extractFromHtml(html) console.log(art) + if (art) { + const slug = slugify(art.title) + writeFileSync(`evaluation/${slug}.html`, art.content, 'utf8') + } } catch (err) { console.trace(err) } diff --git a/index.d.ts b/index.d.ts index 848b219f..3a495780 100644 --- a/index.d.ts +++ b/index.d.ts @@ -1,84 +1,73 @@ // Type definitions -import {AxiosRequestConfig} from "axios"; -import {IOptions as SanitizeOptions} from "sanitize-html"; -import {defaults} from "html-crush"; -import {URLPatternInit} from "urlpattern-polyfill"; +import { IOptions as SanitizeOptions } from "sanitize-html"; -type HtmlCrushOptions = Partial - -/** - * @example - * { - * patterns: [ - * '*://example.com/books/:id', { - * hostname: 'example.com', - * pathname: '/books/:id', - * } - * ], - * selector: '.article-body', - * unwanted: ['.removing-box'] - * } - */ -export interface QueryRule { - patterns: Array, - unwanted?: Array, - selector?: String, - transform?: (document: Document) => Document +export interface Transformation { + patterns: Array, + pre?: (document: Document) => Document + post?: (document: Document) => Document } -/** - * @param input url or html - */ -export function extract(input: string): Promise; - -export function setParserOptions(options: ParserOptions): void; - -export function setRequestOptions(options: AxiosRequestConfig): void; - -export function setSanitizeHtmlOptions(options: SanitizeOptions): void; - -export function setHtmlCrushOptions(options: HtmlCrushOptions): void; - -export function addQueryRules(...rules: Array): Number; - -export function getQueryRules(): Array; - -export function setQueryRules(rules: Array): void; - -export function getParserOptions(): ParserOptions; - -export function getRequestOptions(): AxiosRequestConfig; +export function addTransformations(transformations: Array): Number; +export function removeTransformations(options: Array): Number; export function getSanitizeHtmlOptions(): SanitizeOptions; +export function setSanitizeHtmlOptions(options: SanitizeOptions): void; -export function getHtmlCrushOptions(): HtmlCrushOptions; +/** + * @param input url or html + */ export interface ParserOptions { /** - * For estimating "time to read". + * to estimate time to read. * Default: 300 */ - wordsPerMinute: number + wordsPerMinute?: number /** - * To find the best url from list + * max num of chars generated for description + * Default: 210 */ - urlsCompareAlgorithm: 'levenshtein' | 'cosine' | 'diceCoefficient' | 'jaccardIndex' | 'lcs' | 'mlcs' + descriptionTruncateLen?: number /** - * Min num of chars required for description - * Default: 40 + * min num of chars required for description + * Default: 180 */ - descriptionLengthThreshold: number + descriptionLengthThreshold?: number /** - * Max num of chars generated for description - * Default: 156 + * min num of chars required for content + * Default: 200 */ - descriptionTruncateLen: number + contentLengthThreshold?: number +} + +export interface ProxyConfig { + target?: string; + headers?: Record; +} + +export interface FetchOptions { /** - * Min num of chars required for content - * Default: 200 + * list of request headers + * default: null */ - contentLengthThreshold: number + headers?: Record; + /** + * the values to configure proxy + * default: null + */ + proxy?: ProxyConfig; + + /** + * http proxy agent + * default: null + */ + agent?: object; + /** + * signal to terminate request + * default: null + */ + signal?: object; } export interface ArticleData { @@ -87,10 +76,15 @@ export interface ArticleData { title?: string; description?: string; image?: string; - author?: any[]; - publisher?: any; + favicon?: string; + author?: string; content?: string; source?: string; published?: string; ttr?: number; + type?: string; } + +export function extract(input: string, parserOptions?: ParserOptions, fetchOptions?: FetchOptions): Promise; + +export function extractFromHtml(html: string, url?: string, parserOptions?: ParserOptions): Promise; diff --git a/index.js b/index.js deleted file mode 100644 index 4500fb05..00000000 --- a/index.js +++ /dev/null @@ -1,9 +0,0 @@ -/** - * Starting app - * @ndaidong - **/ - -import metadata from './package.json' - -export * from './src/main.js' -export const version = metadata.version diff --git a/jest.config.js b/jest.config.js index f58e590d..5aeb87d7 100644 --- a/jest.config.js +++ b/jest.config.js @@ -8,7 +8,7 @@ const config = { transform: {}, // TODO https://github.com/makotoshimazu/jest-module-field-resolver/issues/2 moduleNameMapper: { - 'urlpattern-polyfill': '/node_modules/urlpattern-polyfill/index.js' - } + 'urlpattern-polyfill': '/node_modules/urlpattern-polyfill/index.js', + }, } export default config diff --git a/package-lock.json b/package-lock.json new file mode 100644 index 00000000..135f4374 --- /dev/null +++ b/package-lock.json @@ -0,0 +1,1604 @@ +{ + "name": "@arbitral/article-parser", + "version": "8.0.20", + "lockfileVersion": 3, + "requires": true, + "packages": { + "": { + "name": "@arbitral/article-parser", + "version": "8.0.20", + "license": "MIT", + "dependencies": { + "@mozilla/readability": "^0.6.0", + "@ndaidong/bellajs": "^12.0.1", + "cross-fetch": "^4.1.0", + "linkedom": "^0.18.12", + "sanitize-html": "2.17.0" + }, + "devDependencies": { + "@eslint/js": "^9.34.0", + "@types/sanitize-html": "^2.16.0", + "eslint": "^9.34.0", + "globals": "^16.3.0", + "https-proxy-agent": "^7.0.6", + "nock": "^14.0.10" + }, + "engines": { + "node": ">= 20" + } + }, + "node_modules/@eslint-community/eslint-utils": { + "version": "4.9.0", + "resolved": "https://registry.npmjs.org/@eslint-community/eslint-utils/-/eslint-utils-4.9.0.tgz", + "integrity": "sha512-ayVFHdtZ+hsq1t2Dy24wCmGXGe4q9Gu3smhLYALJrr473ZH27MsnSL+LKUlimp4BWJqMDMLmPpx/Q9R3OAlL4g==", + "dev": true, + "license": "MIT", + "dependencies": { + "eslint-visitor-keys": "^3.4.3" + }, + "engines": { + "node": "^12.22.0 || ^14.17.0 || >=16.0.0" + }, + "funding": { + "url": "https://opencollective.com/eslint" + }, + "peerDependencies": { + "eslint": "^6.0.0 || ^7.0.0 || >=8.0.0" + } + }, + "node_modules/@eslint-community/eslint-utils/node_modules/eslint-visitor-keys": { + "version": "3.4.3", + "resolved": "https://registry.npmjs.org/eslint-visitor-keys/-/eslint-visitor-keys-3.4.3.tgz", + "integrity": "sha512-wpc+LXeiyiisxPlEkUzU6svyS1frIO3Mgxj1fdy7Pm8Ygzguax2N3Fa/D/ag1WqbOprdI+uY6wMUl8/a2G+iag==", + "dev": true, + "license": "Apache-2.0", + "engines": { + "node": "^12.22.0 || ^14.17.0 || >=16.0.0" + }, + "funding": { + "url": "https://opencollective.com/eslint" + } + }, + "node_modules/@eslint-community/regexpp": { + "version": "4.12.2", + "resolved": "https://registry.npmjs.org/@eslint-community/regexpp/-/regexpp-4.12.2.tgz", + "integrity": "sha512-EriSTlt5OC9/7SXkRSCAhfSxxoSUgBm33OH+IkwbdpgoqsSsUg7y3uh+IICI/Qg4BBWr3U2i39RpmycbxMq4ew==", + "dev": true, + "license": "MIT", + "engines": { + "node": "^12.0.0 || ^14.0.0 || >=16.0.0" + } + }, + "node_modules/@eslint/config-array": { + "version": "0.21.1", + "resolved": "https://registry.npmjs.org/@eslint/config-array/-/config-array-0.21.1.tgz", + "integrity": "sha512-aw1gNayWpdI/jSYVgzN5pL0cfzU02GT3NBpeT/DXbx1/1x7ZKxFPd9bwrzygx/qiwIQiJ1sw/zD8qY/kRvlGHA==", + "dev": true, + "license": "Apache-2.0", + "dependencies": { + "@eslint/object-schema": "^2.1.7", + "debug": "^4.3.1", + "minimatch": "^3.1.2" + }, + "engines": { + "node": "^18.18.0 || ^20.9.0 || >=21.1.0" + } + }, + "node_modules/@eslint/config-helpers": { + "version": "0.4.2", + "resolved": "https://registry.npmjs.org/@eslint/config-helpers/-/config-helpers-0.4.2.tgz", + "integrity": "sha512-gBrxN88gOIf3R7ja5K9slwNayVcZgK6SOUORm2uBzTeIEfeVaIhOpCtTox3P6R7o2jLFwLFTLnC7kU/RGcYEgw==", + "dev": true, + "license": "Apache-2.0", + "dependencies": { + "@eslint/core": "^0.17.0" + }, + "engines": { + "node": "^18.18.0 || ^20.9.0 || >=21.1.0" + } + }, + "node_modules/@eslint/core": { + "version": "0.17.0", + "resolved": "https://registry.npmjs.org/@eslint/core/-/core-0.17.0.tgz", + "integrity": "sha512-yL/sLrpmtDaFEiUj1osRP4TI2MDz1AddJL+jZ7KSqvBuliN4xqYY54IfdN8qD8Toa6g1iloph1fxQNkjOxrrpQ==", + "dev": true, + "license": "Apache-2.0", + "dependencies": { + "@types/json-schema": "^7.0.15" + }, + "engines": { + "node": "^18.18.0 || ^20.9.0 || >=21.1.0" + } + }, + "node_modules/@eslint/eslintrc": { + "version": "3.3.1", + "resolved": "https://registry.npmjs.org/@eslint/eslintrc/-/eslintrc-3.3.1.tgz", + "integrity": "sha512-gtF186CXhIl1p4pJNGZw8Yc6RlshoePRvE0X91oPGb3vZ8pM3qOS9W9NGPat9LziaBV7XrJWGylNQXkGcnM3IQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "ajv": "^6.12.4", + "debug": "^4.3.2", + "espree": "^10.0.1", + "globals": "^14.0.0", + "ignore": "^5.2.0", + "import-fresh": "^3.2.1", + "js-yaml": "^4.1.0", + "minimatch": "^3.1.2", + "strip-json-comments": "^3.1.1" + }, + "engines": { + "node": "^18.18.0 || ^20.9.0 || >=21.1.0" + }, + "funding": { + "url": "https://opencollective.com/eslint" + } + }, + "node_modules/@eslint/eslintrc/node_modules/globals": { + "version": "14.0.0", + "resolved": "https://registry.npmjs.org/globals/-/globals-14.0.0.tgz", + "integrity": "sha512-oahGvuMGQlPw/ivIYBjVSrWAfWLBeku5tpPE2fOPLi+WHffIWbuh2tCjhyQhTBPMf5E9jDEH4FOmTYgYwbKwtQ==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=18" + }, + "funding": { + "url": "https://github.com/sponsors/sindresorhus" + } + }, + "node_modules/@eslint/js": { + "version": "9.39.1", + "resolved": "https://registry.npmjs.org/@eslint/js/-/js-9.39.1.tgz", + "integrity": "sha512-S26Stp4zCy88tH94QbBv3XCuzRQiZ9yXofEILmglYTh/Ug/a9/umqvgFtYBAo3Lp0nsI/5/qH1CCrbdK3AP1Tw==", + "dev": true, + "license": "MIT", + "engines": { + "node": "^18.18.0 || ^20.9.0 || >=21.1.0" + }, + "funding": { + "url": "https://eslint.org/donate" + } + }, + "node_modules/@eslint/object-schema": { + "version": "2.1.7", + "resolved": "https://registry.npmjs.org/@eslint/object-schema/-/object-schema-2.1.7.tgz", + "integrity": "sha512-VtAOaymWVfZcmZbp6E2mympDIHvyjXs/12LqWYjVw6qjrfF+VK+fyG33kChz3nnK+SU5/NeHOqrTEHS8sXO3OA==", + "dev": true, + "license": "Apache-2.0", + "engines": { + "node": "^18.18.0 || ^20.9.0 || >=21.1.0" + } + }, + "node_modules/@eslint/plugin-kit": { + "version": "0.4.1", + "resolved": "https://registry.npmjs.org/@eslint/plugin-kit/-/plugin-kit-0.4.1.tgz", + "integrity": "sha512-43/qtrDUokr7LJqoF2c3+RInu/t4zfrpYdoSDfYyhg52rwLV6TnOvdG4fXm7IkSB3wErkcmJS9iEhjVtOSEjjA==", + "dev": true, + "license": "Apache-2.0", + "dependencies": { + "@eslint/core": "^0.17.0", + "levn": "^0.4.1" + }, + "engines": { + "node": "^18.18.0 || ^20.9.0 || >=21.1.0" + } + }, + "node_modules/@humanfs/core": { + "version": "0.19.1", + "resolved": "https://registry.npmjs.org/@humanfs/core/-/core-0.19.1.tgz", + "integrity": "sha512-5DyQ4+1JEUzejeK1JGICcideyfUbGixgS9jNgex5nqkW+cY7WZhxBigmieN5Qnw9ZosSNVC9KQKyb+GUaGyKUA==", + "dev": true, + "license": "Apache-2.0", + "engines": { + "node": ">=18.18.0" + } + }, + "node_modules/@humanfs/node": { + "version": "0.16.7", + "resolved": "https://registry.npmjs.org/@humanfs/node/-/node-0.16.7.tgz", + "integrity": "sha512-/zUx+yOsIrG4Y43Eh2peDeKCxlRt/gET6aHfaKpuq267qXdYDFViVHfMaLyygZOnl0kGWxFIgsBy8QFuTLUXEQ==", + "dev": true, + "license": "Apache-2.0", + "dependencies": { + "@humanfs/core": "^0.19.1", + "@humanwhocodes/retry": "^0.4.0" + }, + "engines": { + "node": ">=18.18.0" + } + }, + "node_modules/@humanwhocodes/module-importer": { + "version": "1.0.1", + "resolved": "https://registry.npmjs.org/@humanwhocodes/module-importer/-/module-importer-1.0.1.tgz", + "integrity": "sha512-bxveV4V8v5Yb4ncFTT3rPSgZBOpCkjfK0y4oVVVJwIuDVBRMDXrPyXRL988i5ap9m9bnyEEjWfm5WkBmtffLfA==", + "dev": true, + "license": "Apache-2.0", + "engines": { + "node": ">=12.22" + }, + "funding": { + "type": "github", + "url": "https://github.com/sponsors/nzakas" + } + }, + "node_modules/@humanwhocodes/retry": { + "version": "0.4.3", + "resolved": "https://registry.npmjs.org/@humanwhocodes/retry/-/retry-0.4.3.tgz", + "integrity": "sha512-bV0Tgo9K4hfPCek+aMAn81RppFKv2ySDQeMoSZuvTASywNTnVJCArCZE2FWqpvIatKu7VMRLWlR1EazvVhDyhQ==", + "dev": true, + "license": "Apache-2.0", + "engines": { + "node": ">=18.18" + }, + "funding": { + "type": "github", + "url": "https://github.com/sponsors/nzakas" + } + }, + "node_modules/@mozilla/readability": { + "version": "0.6.0", + "resolved": "https://registry.npmjs.org/@mozilla/readability/-/readability-0.6.0.tgz", + "integrity": "sha512-juG5VWh4qAivzTAeMzvY9xs9HY5rAcr2E4I7tiSSCokRFi7XIZCAu92ZkSTsIj1OPceCifL3cpfteP3pDT9/QQ==", + "license": "Apache-2.0", + "engines": { + "node": ">=14.0.0" + } + }, + "node_modules/@mswjs/interceptors": { + "version": "0.39.8", + "resolved": "https://registry.npmjs.org/@mswjs/interceptors/-/interceptors-0.39.8.tgz", + "integrity": "sha512-2+BzZbjRO7Ct61k8fMNHEtoKjeWI9pIlHFTqBwZ5icHpqszIgEZbjb1MW5Z0+bITTCTl3gk4PDBxs9tA/csXvA==", + "dev": true, + "license": "MIT", + "dependencies": { + "@open-draft/deferred-promise": "^2.2.0", + "@open-draft/logger": "^0.3.0", + "@open-draft/until": "^2.0.0", + "is-node-process": "^1.2.0", + "outvariant": "^1.4.3", + "strict-event-emitter": "^0.5.1" + }, + "engines": { + "node": ">=18" + } + }, + "node_modules/@ndaidong/bellajs": { + "version": "12.0.1", + "resolved": "https://registry.npmjs.org/@ndaidong/bellajs/-/bellajs-12.0.1.tgz", + "integrity": "sha512-1iY42uiHz0cxNMbde7O3zVN+ZX1viOOUOBRt6ht6lkRZbSjwOnFV34Zv4URp3hGzEe6L9Byk7BOq/41H0PzAOQ==", + "license": "MIT" + }, + "node_modules/@open-draft/deferred-promise": { + "version": "2.2.0", + "resolved": "https://registry.npmjs.org/@open-draft/deferred-promise/-/deferred-promise-2.2.0.tgz", + "integrity": "sha512-CecwLWx3rhxVQF6V4bAgPS5t+So2sTbPgAzafKkVizyi7tlwpcFpdFqq+wqF2OwNBmqFuu6tOyouTuxgpMfzmA==", + "dev": true, + "license": "MIT" + }, + "node_modules/@open-draft/logger": { + "version": "0.3.0", + "resolved": "https://registry.npmjs.org/@open-draft/logger/-/logger-0.3.0.tgz", + "integrity": "sha512-X2g45fzhxH238HKO4xbSr7+wBS8Fvw6ixhTDuvLd5mqh6bJJCFAPwU9mPDxbcrRtfxv4u5IHCEH77BmxvXmmxQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "is-node-process": "^1.2.0", + "outvariant": "^1.4.0" + } + }, + "node_modules/@open-draft/until": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/@open-draft/until/-/until-2.1.0.tgz", + "integrity": "sha512-U69T3ItWHvLwGg5eJ0n3I62nWuE6ilHlmz7zM0npLBRvPRd7e6NYmg54vvRtP5mZG7kZqZCFVdsTWo7BPtBujg==", + "dev": true, + "license": "MIT" + }, + "node_modules/@types/estree": { + "version": "1.0.8", + "resolved": "https://registry.npmjs.org/@types/estree/-/estree-1.0.8.tgz", + "integrity": "sha512-dWHzHa2WqEXI/O1E9OjrocMTKJl2mSrEolh1Iomrv6U+JuNwaHXsXx9bLu5gG7BUWFIN0skIQJQ/L1rIex4X6w==", + "dev": true, + "license": "MIT" + }, + "node_modules/@types/json-schema": { + "version": "7.0.15", + "resolved": "https://registry.npmjs.org/@types/json-schema/-/json-schema-7.0.15.tgz", + "integrity": "sha512-5+fP8P8MFNC+AyZCDxrB2pkZFPGzqQWUzpSeuuVLvm8VMcorNYavBqoFcxK8bQz4Qsbn4oUEEem4wDLfcysGHA==", + "dev": true, + "license": "MIT" + }, + "node_modules/@types/sanitize-html": { + "version": "2.16.0", + "resolved": "https://registry.npmjs.org/@types/sanitize-html/-/sanitize-html-2.16.0.tgz", + "integrity": "sha512-l6rX1MUXje5ztPT0cAFtUayXF06DqPhRyfVXareEN5gGCFaP/iwsxIyKODr9XDhfxPpN6vXUFNfo5kZMXCxBtw==", + "dev": true, + "license": "MIT", + "dependencies": { + "htmlparser2": "^8.0.0" + } + }, + "node_modules/acorn": { + "version": "8.15.0", + "resolved": "https://registry.npmjs.org/acorn/-/acorn-8.15.0.tgz", + "integrity": "sha512-NZyJarBfL7nWwIq+FDL6Zp/yHEhePMNnnJ0y3qfieCrmNvYct8uvtiV41UvlSe6apAfk0fY1FbWx+NwfmpvtTg==", + "dev": true, + "license": "MIT", + "peer": true, + "bin": { + "acorn": "bin/acorn" + }, + "engines": { + "node": ">=0.4.0" + } + }, + "node_modules/acorn-jsx": { + "version": "5.3.2", + "resolved": "https://registry.npmjs.org/acorn-jsx/-/acorn-jsx-5.3.2.tgz", + "integrity": "sha512-rq9s+JNhf0IChjtDXxllJ7g41oZk5SlXtp0LHwyA5cejwn7vKmKp4pPri6YEePv2PU65sAsegbXtIinmDFDXgQ==", + "dev": true, + "license": "MIT", + "peerDependencies": { + "acorn": "^6.0.0 || ^7.0.0 || ^8.0.0" + } + }, + "node_modules/agent-base": { + "version": "7.1.4", + "resolved": "https://registry.npmjs.org/agent-base/-/agent-base-7.1.4.tgz", + "integrity": "sha512-MnA+YT8fwfJPgBx3m60MNqakm30XOkyIoH1y6huTQvC0PwZG7ki8NacLBcrPbNoo8vEZy7Jpuk7+jMO+CUovTQ==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">= 14" + } + }, + "node_modules/ajv": { + "version": "6.12.6", + "resolved": "https://registry.npmjs.org/ajv/-/ajv-6.12.6.tgz", + "integrity": "sha512-j3fVLgvTo527anyYyJOGTYJbG+vnnQYvE0m5mmkc1TK+nxAppkCLMIL0aZ4dblVCNoGShhm+kzE4ZUykBoMg4g==", + "dev": true, + "license": "MIT", + "dependencies": { + "fast-deep-equal": "^3.1.1", + "fast-json-stable-stringify": "^2.0.0", + "json-schema-traverse": "^0.4.1", + "uri-js": "^4.2.2" + }, + "funding": { + "type": "github", + "url": "https://github.com/sponsors/epoberezkin" + } + }, + "node_modules/ansi-styles": { + "version": "4.3.0", + "resolved": "https://registry.npmjs.org/ansi-styles/-/ansi-styles-4.3.0.tgz", + "integrity": "sha512-zbB9rCJAT1rbjiVDb2hqKFHNYLxgtk8NURxZ3IZwD3F6NtxbXZQCnnSi1Lkx+IDohdPlFp222wVALIheZJQSEg==", + "dev": true, + "license": "MIT", + "dependencies": { + "color-convert": "^2.0.1" + }, + "engines": { + "node": ">=8" + }, + "funding": { + "url": "https://github.com/chalk/ansi-styles?sponsor=1" + } + }, + "node_modules/argparse": { + "version": "2.0.1", + "resolved": "https://registry.npmjs.org/argparse/-/argparse-2.0.1.tgz", + "integrity": "sha512-8+9WqebbFzpX9OR+Wa6O29asIogeRMzcGtAINdpMHHyAg10f05aSFVBbcEqGf/PXw1EjAZ+q2/bEBg3DvurK3Q==", + "dev": true, + "license": "Python-2.0" + }, + "node_modules/balanced-match": { + "version": "1.0.2", + "resolved": "https://registry.npmjs.org/balanced-match/-/balanced-match-1.0.2.tgz", + "integrity": "sha512-3oSeUO0TMV67hN1AmbXsK4yaqU7tjiHlbxRDZOpH0KW9+CeX4bRAaX0Anxt0tx2MrpRpWwQaPwIlISEJhYU5Pw==", + "dev": true, + "license": "MIT" + }, + "node_modules/boolbase": { + "version": "1.0.0", + "resolved": "https://registry.npmjs.org/boolbase/-/boolbase-1.0.0.tgz", + "integrity": "sha512-JZOSA7Mo9sNGB8+UjSgzdLtokWAky1zbztM3WRLCbZ70/3cTANmQmOdR7y2g+J0e2WXywy1yS468tY+IruqEww==", + "license": "ISC" + }, + "node_modules/brace-expansion": { + "version": "1.1.12", + "resolved": "https://registry.npmjs.org/brace-expansion/-/brace-expansion-1.1.12.tgz", + "integrity": "sha512-9T9UjW3r0UW5c1Q7GTwllptXwhvYmEzFhzMfZ9H7FQWt+uZePjZPjBP/W1ZEyZ1twGWom5/56TF4lPcqjnDHcg==", + "dev": true, + "license": "MIT", + "dependencies": { + "balanced-match": "^1.0.0", + "concat-map": "0.0.1" + } + }, + "node_modules/callsites": { + "version": "3.1.0", + "resolved": "https://registry.npmjs.org/callsites/-/callsites-3.1.0.tgz", + "integrity": "sha512-P8BjAsXvZS+VIDUI11hHCQEv74YT67YUi5JJFNWIqL235sBmjX4+qx9Muvls5ivyNENctx46xQLQ3aTuE7ssaQ==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=6" + } + }, + "node_modules/chalk": { + "version": "4.1.2", + "resolved": "https://registry.npmjs.org/chalk/-/chalk-4.1.2.tgz", + "integrity": "sha512-oKnbhFyRIXpUuez8iBMmyEa4nbj4IOQyuhc/wy9kY7/WVPcwIO9VA668Pu8RkO7+0G76SLROeyw9CpQ061i4mA==", + "dev": true, + "license": "MIT", + "dependencies": { + "ansi-styles": "^4.1.0", + "supports-color": "^7.1.0" + }, + "engines": { + "node": ">=10" + }, + "funding": { + "url": "https://github.com/chalk/chalk?sponsor=1" + } + }, + "node_modules/color-convert": { + "version": "2.0.1", + "resolved": "https://registry.npmjs.org/color-convert/-/color-convert-2.0.1.tgz", + "integrity": "sha512-RRECPsj7iu/xb5oKYcsFHSppFNnsj/52OVTRKb4zP5onXwVF3zVmmToNcOfGC+CRDpfK/U584fMg38ZHCaElKQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "color-name": "~1.1.4" + }, + "engines": { + "node": ">=7.0.0" + } + }, + "node_modules/color-name": { + "version": "1.1.4", + "resolved": "https://registry.npmjs.org/color-name/-/color-name-1.1.4.tgz", + "integrity": "sha512-dOy+3AuW3a2wNbZHIuMZpTcgjGuLU/uBL/ubcZF9OXbDo8ff4O8yVp5Bf0efS8uEoYo5q4Fx7dY9OgQGXgAsQA==", + "dev": true, + "license": "MIT" + }, + "node_modules/concat-map": { + "version": "0.0.1", + "resolved": "https://registry.npmjs.org/concat-map/-/concat-map-0.0.1.tgz", + "integrity": "sha512-/Srv4dswyQNBfohGpz9o6Yb3Gz3SrUDqBH5rTuhGR7ahtlbYKnVxw2bCFMRljaA7EXHaXZ8wsHdodFvbkhKmqg==", + "dev": true, + "license": "MIT" + }, + "node_modules/cross-fetch": { + "version": "4.1.0", + "resolved": "https://registry.npmjs.org/cross-fetch/-/cross-fetch-4.1.0.tgz", + "integrity": "sha512-uKm5PU+MHTootlWEY+mZ4vvXoCn4fLQxT9dSc1sXVMSFkINTJVN8cAQROpwcKm8bJ/c7rgZVIBWzH5T78sNZZw==", + "license": "MIT", + "dependencies": { + "node-fetch": "^2.7.0" + } + }, + "node_modules/cross-spawn": { + "version": "7.0.6", + "resolved": "https://registry.npmjs.org/cross-spawn/-/cross-spawn-7.0.6.tgz", + "integrity": "sha512-uV2QOWP2nWzsy2aMp8aRibhi9dlzF5Hgh5SHaB9OiTGEyDTiJJyx0uy51QXdyWbtAHNua4XJzUKca3OzKUd3vA==", + "dev": true, + "license": "MIT", + "dependencies": { + "path-key": "^3.1.0", + "shebang-command": "^2.0.0", + "which": "^2.0.1" + }, + "engines": { + "node": ">= 8" + } + }, + "node_modules/css-select": { + "version": "5.2.2", + "resolved": "https://registry.npmjs.org/css-select/-/css-select-5.2.2.tgz", + "integrity": "sha512-TizTzUddG/xYLA3NXodFM0fSbNizXjOKhqiQQwvhlspadZokn1KDy0NZFS0wuEubIYAV5/c1/lAr0TaaFXEXzw==", + "license": "BSD-2-Clause", + "dependencies": { + "boolbase": "^1.0.0", + "css-what": "^6.1.0", + "domhandler": "^5.0.2", + "domutils": "^3.0.1", + "nth-check": "^2.0.1" + }, + "funding": { + "url": "https://github.com/sponsors/fb55" + } + }, + "node_modules/css-what": { + "version": "6.2.2", + "resolved": "https://registry.npmjs.org/css-what/-/css-what-6.2.2.tgz", + "integrity": "sha512-u/O3vwbptzhMs3L1fQE82ZSLHQQfto5gyZzwteVIEyeaY5Fc7R4dapF/BvRoSYFeqfBk4m0V1Vafq5Pjv25wvA==", + "license": "BSD-2-Clause", + "engines": { + "node": ">= 6" + }, + "funding": { + "url": "https://github.com/sponsors/fb55" + } + }, + "node_modules/cssom": { + "version": "0.5.0", + "resolved": "https://registry.npmjs.org/cssom/-/cssom-0.5.0.tgz", + "integrity": "sha512-iKuQcq+NdHqlAcwUY0o/HL69XQrUaQdMjmStJ8JFmUaiiQErlhrmuigkg/CU4E2J0IyUKUrMAgl36TvN67MqTw==", + "license": "MIT" + }, + "node_modules/debug": { + "version": "4.4.3", + "resolved": "https://registry.npmjs.org/debug/-/debug-4.4.3.tgz", + "integrity": "sha512-RGwwWnwQvkVfavKVt22FGLw+xYSdzARwm0ru6DhTVA3umU5hZc28V3kO4stgYryrTlLpuvgI9GiijltAjNbcqA==", + "dev": true, + "license": "MIT", + "dependencies": { + "ms": "^2.1.3" + }, + "engines": { + "node": ">=6.0" + }, + "peerDependenciesMeta": { + "supports-color": { + "optional": true + } + } + }, + "node_modules/deep-is": { + "version": "0.1.4", + "resolved": "https://registry.npmjs.org/deep-is/-/deep-is-0.1.4.tgz", + "integrity": "sha512-oIPzksmTg4/MriiaYGO+okXDT7ztn/w3Eptv/+gSIdMdKsJo0u4CfYNFJPy+4SKMuCqGw2wxnA+URMg3t8a/bQ==", + "dev": true, + "license": "MIT" + }, + "node_modules/deepmerge": { + "version": "4.3.1", + "resolved": "https://registry.npmjs.org/deepmerge/-/deepmerge-4.3.1.tgz", + "integrity": "sha512-3sUqbMEc77XqpdNO7FRyRog+eW3ph+GYCbj+rK+uYyRMuwsVy0rMiVtPn+QJlKFvWP/1PYpapqYn0Me2knFn+A==", + "license": "MIT", + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/dom-serializer": { + "version": "2.0.0", + "resolved": "https://registry.npmjs.org/dom-serializer/-/dom-serializer-2.0.0.tgz", + "integrity": "sha512-wIkAryiqt/nV5EQKqQpo3SToSOV9J0DnbJqwK7Wv/Trc92zIAYZ4FlMu+JPFW1DfGFt81ZTCGgDEabffXeLyJg==", + "license": "MIT", + "dependencies": { + "domelementtype": "^2.3.0", + "domhandler": "^5.0.2", + "entities": "^4.2.0" + }, + "funding": { + "url": "https://github.com/cheeriojs/dom-serializer?sponsor=1" + } + }, + "node_modules/domelementtype": { + "version": "2.3.0", + "resolved": "https://registry.npmjs.org/domelementtype/-/domelementtype-2.3.0.tgz", + "integrity": "sha512-OLETBj6w0OsagBwdXnPdN0cnMfF9opN69co+7ZrbfPGrdpPVNBUj02spi6B1N7wChLQiPn4CSH/zJvXw56gmHw==", + "funding": [ + { + "type": "github", + "url": "https://github.com/sponsors/fb55" + } + ], + "license": "BSD-2-Clause" + }, + "node_modules/domhandler": { + "version": "5.0.3", + "resolved": "https://registry.npmjs.org/domhandler/-/domhandler-5.0.3.tgz", + "integrity": "sha512-cgwlv/1iFQiFnU96XXgROh8xTeetsnJiDsTc7TYCLFd9+/WNkIqPTxiM/8pSd8VIrhXGTf1Ny1q1hquVqDJB5w==", + "license": "BSD-2-Clause", + "dependencies": { + "domelementtype": "^2.3.0" + }, + "engines": { + "node": ">= 4" + }, + "funding": { + "url": "https://github.com/fb55/domhandler?sponsor=1" + } + }, + "node_modules/domutils": { + "version": "3.2.2", + "resolved": "https://registry.npmjs.org/domutils/-/domutils-3.2.2.tgz", + "integrity": "sha512-6kZKyUajlDuqlHKVX1w7gyslj9MPIXzIFiz/rGu35uC1wMi+kMhQwGhl4lt9unC9Vb9INnY9Z3/ZA3+FhASLaw==", + "license": "BSD-2-Clause", + "dependencies": { + "dom-serializer": "^2.0.0", + "domelementtype": "^2.3.0", + "domhandler": "^5.0.3" + }, + "funding": { + "url": "https://github.com/fb55/domutils?sponsor=1" + } + }, + "node_modules/entities": { + "version": "4.5.0", + "resolved": "https://registry.npmjs.org/entities/-/entities-4.5.0.tgz", + "integrity": "sha512-V0hjH4dGPh9Ao5p0MoRY6BVqtwCjhz6vI5LT8AJ55H+4g9/4vbHx1I54fS0XuclLhDHArPQCiMjDxjaL8fPxhw==", + "license": "BSD-2-Clause", + "engines": { + "node": ">=0.12" + }, + "funding": { + "url": "https://github.com/fb55/entities?sponsor=1" + } + }, + "node_modules/escape-string-regexp": { + "version": "4.0.0", + "resolved": "https://registry.npmjs.org/escape-string-regexp/-/escape-string-regexp-4.0.0.tgz", + "integrity": "sha512-TtpcNJ3XAzx3Gq8sWRzJaVajRs0uVxA2YAkdb1jm2YkPz4G6egUFAyA3n5vtEIZefPk5Wa4UXbKuS5fKkJWdgA==", + "license": "MIT", + "engines": { + "node": ">=10" + }, + "funding": { + "url": "https://github.com/sponsors/sindresorhus" + } + }, + "node_modules/eslint": { + "version": "9.39.1", + "resolved": "https://registry.npmjs.org/eslint/-/eslint-9.39.1.tgz", + "integrity": "sha512-BhHmn2yNOFA9H9JmmIVKJmd288g9hrVRDkdoIgRCRuSySRUHH7r/DI6aAXW9T1WwUuY3DFgrcaqB+deURBLR5g==", + "dev": true, + "license": "MIT", + "peer": true, + "dependencies": { + "@eslint-community/eslint-utils": "^4.8.0", + "@eslint-community/regexpp": "^4.12.1", + "@eslint/config-array": "^0.21.1", + "@eslint/config-helpers": "^0.4.2", + "@eslint/core": "^0.17.0", + "@eslint/eslintrc": "^3.3.1", + "@eslint/js": "9.39.1", + "@eslint/plugin-kit": "^0.4.1", + "@humanfs/node": "^0.16.6", + "@humanwhocodes/module-importer": "^1.0.1", + "@humanwhocodes/retry": "^0.4.2", + "@types/estree": "^1.0.6", + "ajv": "^6.12.4", + "chalk": "^4.0.0", + "cross-spawn": "^7.0.6", + "debug": "^4.3.2", + "escape-string-regexp": "^4.0.0", + "eslint-scope": "^8.4.0", + "eslint-visitor-keys": "^4.2.1", + "espree": "^10.4.0", + "esquery": "^1.5.0", + "esutils": "^2.0.2", + "fast-deep-equal": "^3.1.3", + "file-entry-cache": "^8.0.0", + "find-up": "^5.0.0", + "glob-parent": "^6.0.2", + "ignore": "^5.2.0", + "imurmurhash": "^0.1.4", + "is-glob": "^4.0.0", + "json-stable-stringify-without-jsonify": "^1.0.1", + "lodash.merge": "^4.6.2", + "minimatch": "^3.1.2", + "natural-compare": "^1.4.0", + "optionator": "^0.9.3" + }, + "bin": { + "eslint": "bin/eslint.js" + }, + "engines": { + "node": "^18.18.0 || ^20.9.0 || >=21.1.0" + }, + "funding": { + "url": "https://eslint.org/donate" + }, + "peerDependencies": { + "jiti": "*" + }, + "peerDependenciesMeta": { + "jiti": { + "optional": true + } + } + }, + "node_modules/eslint-scope": { + "version": "8.4.0", + "resolved": "https://registry.npmjs.org/eslint-scope/-/eslint-scope-8.4.0.tgz", + "integrity": "sha512-sNXOfKCn74rt8RICKMvJS7XKV/Xk9kA7DyJr8mJik3S7Cwgy3qlkkmyS2uQB3jiJg6VNdZd/pDBJu0nvG2NlTg==", + "dev": true, + "license": "BSD-2-Clause", + "dependencies": { + "esrecurse": "^4.3.0", + "estraverse": "^5.2.0" + }, + "engines": { + "node": "^18.18.0 || ^20.9.0 || >=21.1.0" + }, + "funding": { + "url": "https://opencollective.com/eslint" + } + }, + "node_modules/eslint-visitor-keys": { + "version": "4.2.1", + "resolved": "https://registry.npmjs.org/eslint-visitor-keys/-/eslint-visitor-keys-4.2.1.tgz", + "integrity": "sha512-Uhdk5sfqcee/9H/rCOJikYz67o0a2Tw2hGRPOG2Y1R2dg7brRe1uG0yaNQDHu+TO/uQPF/5eCapvYSmHUjt7JQ==", + "dev": true, + "license": "Apache-2.0", + "engines": { + "node": "^18.18.0 || ^20.9.0 || >=21.1.0" + }, + "funding": { + "url": "https://opencollective.com/eslint" + } + }, + "node_modules/espree": { + "version": "10.4.0", + "resolved": "https://registry.npmjs.org/espree/-/espree-10.4.0.tgz", + "integrity": "sha512-j6PAQ2uUr79PZhBjP5C5fhl8e39FmRnOjsD5lGnWrFU8i2G776tBK7+nP8KuQUTTyAZUwfQqXAgrVH5MbH9CYQ==", + "dev": true, + "license": "BSD-2-Clause", + "dependencies": { + "acorn": "^8.15.0", + "acorn-jsx": "^5.3.2", + "eslint-visitor-keys": "^4.2.1" + }, + "engines": { + "node": "^18.18.0 || ^20.9.0 || >=21.1.0" + }, + "funding": { + "url": "https://opencollective.com/eslint" + } + }, + "node_modules/esquery": { + "version": "1.6.0", + "resolved": "https://registry.npmjs.org/esquery/-/esquery-1.6.0.tgz", + "integrity": "sha512-ca9pw9fomFcKPvFLXhBKUK90ZvGibiGOvRJNbjljY7s7uq/5YO4BOzcYtJqExdx99rF6aAcnRxHmcUHcz6sQsg==", + "dev": true, + "license": "BSD-3-Clause", + "dependencies": { + "estraverse": "^5.1.0" + }, + "engines": { + "node": ">=0.10" + } + }, + "node_modules/esrecurse": { + "version": "4.3.0", + "resolved": "https://registry.npmjs.org/esrecurse/-/esrecurse-4.3.0.tgz", + "integrity": "sha512-KmfKL3b6G+RXvP8N1vr3Tq1kL/oCFgn2NYXEtqP8/L3pKapUA4G8cFVaoF3SU323CD4XypR/ffioHmkti6/Tag==", + "dev": true, + "license": "BSD-2-Clause", + "dependencies": { + "estraverse": "^5.2.0" + }, + "engines": { + "node": ">=4.0" + } + }, + "node_modules/estraverse": { + "version": "5.3.0", + "resolved": "https://registry.npmjs.org/estraverse/-/estraverse-5.3.0.tgz", + "integrity": "sha512-MMdARuVEQziNTeJD8DgMqmhwR11BRQ/cBP+pLtYdSTnf3MIO8fFeiINEbX36ZdNlfU/7A9f3gUw49B3oQsvwBA==", + "dev": true, + "license": "BSD-2-Clause", + "engines": { + "node": ">=4.0" + } + }, + "node_modules/esutils": { + "version": "2.0.3", + "resolved": "https://registry.npmjs.org/esutils/-/esutils-2.0.3.tgz", + "integrity": "sha512-kVscqXk4OCp68SZ0dkgEKVi6/8ij300KBWTJq32P/dYeWTSwK41WyTxalN1eRmA5Z9UU/LX9D7FWSmV9SAYx6g==", + "dev": true, + "license": "BSD-2-Clause", + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/fast-deep-equal": { + "version": "3.1.3", + "resolved": "https://registry.npmjs.org/fast-deep-equal/-/fast-deep-equal-3.1.3.tgz", + "integrity": "sha512-f3qQ9oQy9j2AhBe/H9VC91wLmKBCCU/gDOnKNAYG5hswO7BLKj09Hc5HYNz9cGI++xlpDCIgDaitVs03ATR84Q==", + "dev": true, + "license": "MIT" + }, + "node_modules/fast-json-stable-stringify": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/fast-json-stable-stringify/-/fast-json-stable-stringify-2.1.0.tgz", + "integrity": "sha512-lhd/wF+Lk98HZoTCtlVraHtfh5XYijIjalXck7saUtuanSDyLMxnHhSXEDJqHxD7msR8D0uCmqlkwjCV8xvwHw==", + "dev": true, + "license": "MIT" + }, + "node_modules/fast-levenshtein": { + "version": "2.0.6", + "resolved": "https://registry.npmjs.org/fast-levenshtein/-/fast-levenshtein-2.0.6.tgz", + "integrity": "sha512-DCXu6Ifhqcks7TZKY3Hxp3y6qphY5SJZmrWMDrKcERSOXWQdMhU9Ig/PYrzyw/ul9jOIyh0N4M0tbC5hodg8dw==", + "dev": true, + "license": "MIT" + }, + "node_modules/file-entry-cache": { + "version": "8.0.0", + "resolved": "https://registry.npmjs.org/file-entry-cache/-/file-entry-cache-8.0.0.tgz", + "integrity": "sha512-XXTUwCvisa5oacNGRP9SfNtYBNAMi+RPwBFmblZEF7N7swHYQS6/Zfk7SRwx4D5j3CH211YNRco1DEMNVfZCnQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "flat-cache": "^4.0.0" + }, + "engines": { + "node": ">=16.0.0" + } + }, + "node_modules/find-up": { + "version": "5.0.0", + "resolved": "https://registry.npmjs.org/find-up/-/find-up-5.0.0.tgz", + "integrity": "sha512-78/PXT1wlLLDgTzDs7sjq9hzz0vXD+zn+7wypEe4fXQxCmdmqfGsEPQxmiCSQI3ajFV91bVSsvNtrJRiW6nGng==", + "dev": true, + "license": "MIT", + "dependencies": { + "locate-path": "^6.0.0", + "path-exists": "^4.0.0" + }, + "engines": { + "node": ">=10" + }, + "funding": { + "url": "https://github.com/sponsors/sindresorhus" + } + }, + "node_modules/flat-cache": { + "version": "4.0.1", + "resolved": "https://registry.npmjs.org/flat-cache/-/flat-cache-4.0.1.tgz", + "integrity": "sha512-f7ccFPK3SXFHpx15UIGyRJ/FJQctuKZ0zVuN3frBo4HnK3cay9VEW0R6yPYFHC0AgqhukPzKjq22t5DmAyqGyw==", + "dev": true, + "license": "MIT", + "dependencies": { + "flatted": "^3.2.9", + "keyv": "^4.5.4" + }, + "engines": { + "node": ">=16" + } + }, + "node_modules/flatted": { + "version": "3.3.3", + "resolved": "https://registry.npmjs.org/flatted/-/flatted-3.3.3.tgz", + "integrity": "sha512-GX+ysw4PBCz0PzosHDepZGANEuFCMLrnRTiEy9McGjmkCQYwRq4A/X786G/fjM/+OjsWSU1ZrY5qyARZmO/uwg==", + "dev": true, + "license": "ISC" + }, + "node_modules/glob-parent": { + "version": "6.0.2", + "resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-6.0.2.tgz", + "integrity": "sha512-XxwI8EOhVQgWp6iDL+3b0r86f4d6AX6zSU55HfB4ydCEuXLXc5FcYeOu+nnGftS4TEju/11rt4KJPTMgbfmv4A==", + "dev": true, + "license": "ISC", + "dependencies": { + "is-glob": "^4.0.3" + }, + "engines": { + "node": ">=10.13.0" + } + }, + "node_modules/globals": { + "version": "16.5.0", + "resolved": "https://registry.npmjs.org/globals/-/globals-16.5.0.tgz", + "integrity": "sha512-c/c15i26VrJ4IRt5Z89DnIzCGDn9EcebibhAOjw5ibqEHsE1wLUgkPn9RDmNcUKyU87GeaL633nyJ+pplFR2ZQ==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=18" + }, + "funding": { + "url": "https://github.com/sponsors/sindresorhus" + } + }, + "node_modules/has-flag": { + "version": "4.0.0", + "resolved": "https://registry.npmjs.org/has-flag/-/has-flag-4.0.0.tgz", + "integrity": "sha512-EykJT/Q1KjTWctppgIAgfSO0tKVuZUjhgMr17kqTumMl6Afv3EISleU7qZUzoXDFTAHTDC4NOoG/ZxU3EvlMPQ==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=8" + } + }, + "node_modules/html-escaper": { + "version": "3.0.3", + "resolved": "https://registry.npmjs.org/html-escaper/-/html-escaper-3.0.3.tgz", + "integrity": "sha512-RuMffC89BOWQoY0WKGpIhn5gX3iI54O6nRA0yC124NYVtzjmFWBIiFd8M0x+ZdX0P9R4lADg1mgP8C7PxGOWuQ==", + "license": "MIT" + }, + "node_modules/htmlparser2": { + "version": "8.0.2", + "resolved": "https://registry.npmjs.org/htmlparser2/-/htmlparser2-8.0.2.tgz", + "integrity": "sha512-GYdjWKDkbRLkZ5geuHs5NY1puJ+PXwP7+fHPRz06Eirsb9ugf6d8kkXav6ADhcODhFFPMIXyxkxSuMf3D6NCFA==", + "funding": [ + "https://github.com/fb55/htmlparser2?sponsor=1", + { + "type": "github", + "url": "https://github.com/sponsors/fb55" + } + ], + "license": "MIT", + "dependencies": { + "domelementtype": "^2.3.0", + "domhandler": "^5.0.3", + "domutils": "^3.0.1", + "entities": "^4.4.0" + } + }, + "node_modules/https-proxy-agent": { + "version": "7.0.6", + "resolved": "https://registry.npmjs.org/https-proxy-agent/-/https-proxy-agent-7.0.6.tgz", + "integrity": "sha512-vK9P5/iUfdl95AI+JVyUuIcVtd4ofvtrOr3HNtM2yxC9bnMbEdp3x01OhQNnjb8IJYi38VlTE3mBXwcfvywuSw==", + "dev": true, + "license": "MIT", + "dependencies": { + "agent-base": "^7.1.2", + "debug": "4" + }, + "engines": { + "node": ">= 14" + } + }, + "node_modules/ignore": { + "version": "5.3.2", + "resolved": "https://registry.npmjs.org/ignore/-/ignore-5.3.2.tgz", + "integrity": "sha512-hsBTNUqQTDwkWtcdYI2i06Y/nUBEsNEDJKjWdigLvegy8kDuJAS8uRlpkkcQpyEXL0Z/pjDy5HBmMjRCJ2gq+g==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">= 4" + } + }, + "node_modules/import-fresh": { + "version": "3.3.1", + "resolved": "https://registry.npmjs.org/import-fresh/-/import-fresh-3.3.1.tgz", + "integrity": "sha512-TR3KfrTZTYLPB6jUjfx6MF9WcWrHL9su5TObK4ZkYgBdWKPOFoSoQIdEuTuR82pmtxH2spWG9h6etwfr1pLBqQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "parent-module": "^1.0.0", + "resolve-from": "^4.0.0" + }, + "engines": { + "node": ">=6" + }, + "funding": { + "url": "https://github.com/sponsors/sindresorhus" + } + }, + "node_modules/imurmurhash": { + "version": "0.1.4", + "resolved": "https://registry.npmjs.org/imurmurhash/-/imurmurhash-0.1.4.tgz", + "integrity": "sha512-JmXMZ6wuvDmLiHEml9ykzqO6lwFbof0GG4IkcGaENdCRDDmMVnny7s5HsIgHCbaq0w2MyPhDqkhTUgS2LU2PHA==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=0.8.19" + } + }, + "node_modules/is-extglob": { + "version": "2.1.1", + "resolved": "https://registry.npmjs.org/is-extglob/-/is-extglob-2.1.1.tgz", + "integrity": "sha512-SbKbANkN603Vi4jEZv49LeVJMn4yGwsbzZworEoyEiutsN3nJYdbO36zfhGJ6QEDpOZIFkDtnq5JRxmvl3jsoQ==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/is-glob": { + "version": "4.0.3", + "resolved": "https://registry.npmjs.org/is-glob/-/is-glob-4.0.3.tgz", + "integrity": "sha512-xelSayHH36ZgE7ZWhli7pW34hNbNl8Ojv5KVmkJD4hBdD3th8Tfk9vYasLM+mXWOZhFkgZfxhLSnrwRr4elSSg==", + "dev": true, + "license": "MIT", + "dependencies": { + "is-extglob": "^2.1.1" + }, + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/is-node-process": { + "version": "1.2.0", + "resolved": "https://registry.npmjs.org/is-node-process/-/is-node-process-1.2.0.tgz", + "integrity": "sha512-Vg4o6/fqPxIjtxgUH5QLJhwZ7gW5diGCVlXpuUfELC62CuxM1iHcRe51f2W1FDy04Ai4KJkagKjx3XaqyfRKXw==", + "dev": true, + "license": "MIT" + }, + "node_modules/is-plain-object": { + "version": "5.0.0", + "resolved": "https://registry.npmjs.org/is-plain-object/-/is-plain-object-5.0.0.tgz", + "integrity": "sha512-VRSzKkbMm5jMDoKLbltAkFQ5Qr7VDiTFGXxYFXXowVj387GeGNOCsOH6Msy00SGZ3Fp84b1Naa1psqgcCIEP5Q==", + "license": "MIT", + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/isexe": { + "version": "2.0.0", + "resolved": "https://registry.npmjs.org/isexe/-/isexe-2.0.0.tgz", + "integrity": "sha512-RHxMLp9lnKHGHRng9QFhRCMbYAcVpn69smSGcq3f36xjgVVWThj4qqLbTLlq7Ssj8B+fIQ1EuCEGI2lKsyQeIw==", + "dev": true, + "license": "ISC" + }, + "node_modules/js-yaml": { + "version": "4.1.1", + "resolved": "https://registry.npmjs.org/js-yaml/-/js-yaml-4.1.1.tgz", + "integrity": "sha512-qQKT4zQxXl8lLwBtHMWwaTcGfFOZviOJet3Oy/xmGk2gZH677CJM9EvtfdSkgWcATZhj/55JZ0rmy3myCT5lsA==", + "dev": true, + "license": "MIT", + "dependencies": { + "argparse": "^2.0.1" + }, + "bin": { + "js-yaml": "bin/js-yaml.js" + } + }, + "node_modules/json-buffer": { + "version": "3.0.1", + "resolved": "https://registry.npmjs.org/json-buffer/-/json-buffer-3.0.1.tgz", + "integrity": "sha512-4bV5BfR2mqfQTJm+V5tPPdf+ZpuhiIvTuAB5g8kcrXOZpTT/QwwVRWBywX1ozr6lEuPdbHxwaJlm9G6mI2sfSQ==", + "dev": true, + "license": "MIT" + }, + "node_modules/json-schema-traverse": { + "version": "0.4.1", + "resolved": "https://registry.npmjs.org/json-schema-traverse/-/json-schema-traverse-0.4.1.tgz", + "integrity": "sha512-xbbCH5dCYU5T8LcEhhuh7HJ88HXuW3qsI3Y0zOZFKfZEHcpWiHU/Jxzk629Brsab/mMiHQti9wMP+845RPe3Vg==", + "dev": true, + "license": "MIT" + }, + "node_modules/json-stable-stringify-without-jsonify": { + "version": "1.0.1", + "resolved": "https://registry.npmjs.org/json-stable-stringify-without-jsonify/-/json-stable-stringify-without-jsonify-1.0.1.tgz", + "integrity": "sha512-Bdboy+l7tA3OGW6FjyFHWkP5LuByj1Tk33Ljyq0axyzdk9//JSi2u3fP1QSmd1KNwq6VOKYGlAu87CisVir6Pw==", + "dev": true, + "license": "MIT" + }, + "node_modules/json-stringify-safe": { + "version": "5.0.1", + "resolved": "https://registry.npmjs.org/json-stringify-safe/-/json-stringify-safe-5.0.1.tgz", + "integrity": "sha512-ZClg6AaYvamvYEE82d3Iyd3vSSIjQ+odgjaTzRuO3s7toCdFKczob2i0zCh7JE8kWn17yvAWhUVxvqGwUalsRA==", + "dev": true, + "license": "ISC" + }, + "node_modules/keyv": { + "version": "4.5.4", + "resolved": "https://registry.npmjs.org/keyv/-/keyv-4.5.4.tgz", + "integrity": "sha512-oxVHkHR/EJf2CNXnWxRLW6mg7JyCCUcG0DtEGmL2ctUo1PNTin1PUil+r/+4r5MpVgC/fn1kjsx7mjSujKqIpw==", + "dev": true, + "license": "MIT", + "dependencies": { + "json-buffer": "3.0.1" + } + }, + "node_modules/levn": { + "version": "0.4.1", + "resolved": "https://registry.npmjs.org/levn/-/levn-0.4.1.tgz", + "integrity": "sha512-+bT2uH4E5LGE7h/n3evcS/sQlJXCpIp6ym8OWJ5eV6+67Dsql/LaaT7qJBAt2rzfoa/5QBGBhxDix1dMt2kQKQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "prelude-ls": "^1.2.1", + "type-check": "~0.4.0" + }, + "engines": { + "node": ">= 0.8.0" + } + }, + "node_modules/linkedom": { + "version": "0.18.12", + "resolved": "https://registry.npmjs.org/linkedom/-/linkedom-0.18.12.tgz", + "integrity": "sha512-jalJsOwIKuQJSeTvsgzPe9iJzyfVaEJiEXl+25EkKevsULHvMJzpNqwvj1jOESWdmgKDiXObyjOYwlUqG7wo1Q==", + "license": "ISC", + "dependencies": { + "css-select": "^5.1.0", + "cssom": "^0.5.0", + "html-escaper": "^3.0.3", + "htmlparser2": "^10.0.0", + "uhyphen": "^0.2.0" + }, + "engines": { + "node": ">=16" + }, + "peerDependencies": { + "canvas": ">= 2" + }, + "peerDependenciesMeta": { + "canvas": { + "optional": true + } + } + }, + "node_modules/linkedom/node_modules/entities": { + "version": "6.0.1", + "resolved": "https://registry.npmjs.org/entities/-/entities-6.0.1.tgz", + "integrity": "sha512-aN97NXWF6AWBTahfVOIrB/NShkzi5H7F9r1s9mD3cDj4Ko5f2qhhVoYMibXF7GlLveb/D2ioWay8lxI97Ven3g==", + "license": "BSD-2-Clause", + "engines": { + "node": ">=0.12" + }, + "funding": { + "url": "https://github.com/fb55/entities?sponsor=1" + } + }, + "node_modules/linkedom/node_modules/htmlparser2": { + "version": "10.0.0", + "resolved": "https://registry.npmjs.org/htmlparser2/-/htmlparser2-10.0.0.tgz", + "integrity": "sha512-TwAZM+zE5Tq3lrEHvOlvwgj1XLWQCtaaibSN11Q+gGBAS7Y1uZSWwXXRe4iF6OXnaq1riyQAPFOBtYc77Mxq0g==", + "funding": [ + "https://github.com/fb55/htmlparser2?sponsor=1", + { + "type": "github", + "url": "https://github.com/sponsors/fb55" + } + ], + "license": "MIT", + "dependencies": { + "domelementtype": "^2.3.0", + "domhandler": "^5.0.3", + "domutils": "^3.2.1", + "entities": "^6.0.0" + } + }, + "node_modules/locate-path": { + "version": "6.0.0", + "resolved": "https://registry.npmjs.org/locate-path/-/locate-path-6.0.0.tgz", + "integrity": "sha512-iPZK6eYjbxRu3uB4/WZ3EsEIMJFMqAoopl3R+zuq0UjcAm/MO6KCweDgPfP3elTztoKP3KtnVHxTn2NHBSDVUw==", + "dev": true, + "license": "MIT", + "dependencies": { + "p-locate": "^5.0.0" + }, + "engines": { + "node": ">=10" + }, + "funding": { + "url": "https://github.com/sponsors/sindresorhus" + } + }, + "node_modules/lodash.merge": { + "version": "4.6.2", + "resolved": "https://registry.npmjs.org/lodash.merge/-/lodash.merge-4.6.2.tgz", + "integrity": "sha512-0KpjqXRVvrYyCsX1swR/XTK0va6VQkQM6MNo7PqW77ByjAhoARA8EfrP1N4+KlKj8YS0ZUCtRT/YUuhyYDujIQ==", + "dev": true, + "license": "MIT" + }, + "node_modules/minimatch": { + "version": "3.1.2", + "resolved": "https://registry.npmjs.org/minimatch/-/minimatch-3.1.2.tgz", + "integrity": "sha512-J7p63hRiAjw1NDEww1W7i37+ByIrOWO5XQQAzZ3VOcL0PNybwpfmV/N05zFAzwQ9USyEcX6t3UO+K5aqBQOIHw==", + "dev": true, + "license": "ISC", + "dependencies": { + "brace-expansion": "^1.1.7" + }, + "engines": { + "node": "*" + } + }, + "node_modules/ms": { + "version": "2.1.3", + "resolved": "https://registry.npmjs.org/ms/-/ms-2.1.3.tgz", + "integrity": "sha512-6FlzubTLZG3J2a/NVCAleEhjzq5oxgHyaCU9yYXvcLsvoVaHJq/s5xXI6/XXP6tz7R9xAOtHnSO/tXtF3WRTlA==", + "dev": true, + "license": "MIT" + }, + "node_modules/nanoid": { + "version": "3.3.11", + "resolved": "https://registry.npmjs.org/nanoid/-/nanoid-3.3.11.tgz", + "integrity": "sha512-N8SpfPUnUp1bK+PMYW8qSWdl9U+wwNWI4QKxOYDy9JAro3WMX7p2OeVRF9v+347pnakNevPmiHhNmZ2HbFA76w==", + "funding": [ + { + "type": "github", + "url": "https://github.com/sponsors/ai" + } + ], + "license": "MIT", + "bin": { + "nanoid": "bin/nanoid.cjs" + }, + "engines": { + "node": "^10 || ^12 || ^13.7 || ^14 || >=15.0.1" + } + }, + "node_modules/natural-compare": { + "version": "1.4.0", + "resolved": "https://registry.npmjs.org/natural-compare/-/natural-compare-1.4.0.tgz", + "integrity": "sha512-OWND8ei3VtNC9h7V60qff3SVobHr996CTwgxubgyQYEpg290h9J0buyECNNJexkFm5sOajh5G116RYA1c8ZMSw==", + "dev": true, + "license": "MIT" + }, + "node_modules/nock": { + "version": "14.0.10", + "resolved": "https://registry.npmjs.org/nock/-/nock-14.0.10.tgz", + "integrity": "sha512-Q7HjkpyPeLa0ZVZC5qpxBt5EyLczFJ91MEewQiIi9taWuA0KB/MDJlUWtON+7dGouVdADTQsf9RA7TZk6D8VMw==", + "dev": true, + "license": "MIT", + "dependencies": { + "@mswjs/interceptors": "^0.39.5", + "json-stringify-safe": "^5.0.1", + "propagate": "^2.0.0" + }, + "engines": { + "node": ">=18.20.0 <20 || >=20.12.1" + } + }, + "node_modules/node-fetch": { + "version": "2.7.0", + "resolved": "https://registry.npmjs.org/node-fetch/-/node-fetch-2.7.0.tgz", + "integrity": "sha512-c4FRfUm/dbcWZ7U+1Wq0AwCyFL+3nt2bEw05wfxSz+DWpWsitgmSgYmy2dQdWyKC1694ELPqMs/YzUSNozLt8A==", + "license": "MIT", + "dependencies": { + "whatwg-url": "^5.0.0" + }, + "engines": { + "node": "4.x || >=6.0.0" + }, + "peerDependencies": { + "encoding": "^0.1.0" + }, + "peerDependenciesMeta": { + "encoding": { + "optional": true + } + } + }, + "node_modules/nth-check": { + "version": "2.1.1", + "resolved": "https://registry.npmjs.org/nth-check/-/nth-check-2.1.1.tgz", + "integrity": "sha512-lqjrjmaOoAnWfMmBPL+XNnynZh2+swxiX3WUE0s4yEHI6m+AwrK2UZOimIRl3X/4QctVqS8AiZjFqyOGrMXb/w==", + "license": "BSD-2-Clause", + "dependencies": { + "boolbase": "^1.0.0" + }, + "funding": { + "url": "https://github.com/fb55/nth-check?sponsor=1" + } + }, + "node_modules/optionator": { + "version": "0.9.4", + "resolved": "https://registry.npmjs.org/optionator/-/optionator-0.9.4.tgz", + "integrity": "sha512-6IpQ7mKUxRcZNLIObR0hz7lxsapSSIYNZJwXPGeF0mTVqGKFIXj1DQcMoT22S3ROcLyY/rz0PWaWZ9ayWmad9g==", + "dev": true, + "license": "MIT", + "dependencies": { + "deep-is": "^0.1.3", + "fast-levenshtein": "^2.0.6", + "levn": "^0.4.1", + "prelude-ls": "^1.2.1", + "type-check": "^0.4.0", + "word-wrap": "^1.2.5" + }, + "engines": { + "node": ">= 0.8.0" + } + }, + "node_modules/outvariant": { + "version": "1.4.3", + "resolved": "https://registry.npmjs.org/outvariant/-/outvariant-1.4.3.tgz", + "integrity": "sha512-+Sl2UErvtsoajRDKCE5/dBz4DIvHXQQnAxtQTF04OJxY0+DyZXSo5P5Bb7XYWOh81syohlYL24hbDwxedPUJCA==", + "dev": true, + "license": "MIT" + }, + "node_modules/p-limit": { + "version": "3.1.0", + "resolved": "https://registry.npmjs.org/p-limit/-/p-limit-3.1.0.tgz", + "integrity": "sha512-TYOanM3wGwNGsZN2cVTYPArw454xnXj5qmWF1bEoAc4+cU/ol7GVh7odevjp1FNHduHc3KZMcFduxU5Xc6uJRQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "yocto-queue": "^0.1.0" + }, + "engines": { + "node": ">=10" + }, + "funding": { + "url": "https://github.com/sponsors/sindresorhus" + } + }, + "node_modules/p-locate": { + "version": "5.0.0", + "resolved": "https://registry.npmjs.org/p-locate/-/p-locate-5.0.0.tgz", + "integrity": "sha512-LaNjtRWUBY++zB5nE/NwcaoMylSPk+S+ZHNB1TzdbMJMny6dynpAGt7X/tl/QYq3TIeE6nxHppbo2LGymrG5Pw==", + "dev": true, + "license": "MIT", + "dependencies": { + "p-limit": "^3.0.2" + }, + "engines": { + "node": ">=10" + }, + "funding": { + "url": "https://github.com/sponsors/sindresorhus" + } + }, + "node_modules/parent-module": { + "version": "1.0.1", + "resolved": "https://registry.npmjs.org/parent-module/-/parent-module-1.0.1.tgz", + "integrity": "sha512-GQ2EWRpQV8/o+Aw8YqtfZZPfNRWZYkbidE9k5rpl/hC3vtHHBfGm2Ifi6qWV+coDGkrUKZAxE3Lot5kcsRlh+g==", + "dev": true, + "license": "MIT", + "dependencies": { + "callsites": "^3.0.0" + }, + "engines": { + "node": ">=6" + } + }, + "node_modules/parse-srcset": { + "version": "1.0.2", + "resolved": "https://registry.npmjs.org/parse-srcset/-/parse-srcset-1.0.2.tgz", + "integrity": "sha512-/2qh0lav6CmI15FzA3i/2Bzk2zCgQhGMkvhOhKNcBVQ1ldgpbfiNTVslmooUmWJcADi1f1kIeynbDRVzNlfR6Q==", + "license": "MIT" + }, + "node_modules/path-exists": { + "version": "4.0.0", + "resolved": "https://registry.npmjs.org/path-exists/-/path-exists-4.0.0.tgz", + "integrity": "sha512-ak9Qy5Q7jYb2Wwcey5Fpvg2KoAc/ZIhLSLOSBmRmygPsGwkVVt0fZa0qrtMz+m6tJTAHfZQ8FnmB4MG4LWy7/w==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=8" + } + }, + "node_modules/path-key": { + "version": "3.1.1", + "resolved": "https://registry.npmjs.org/path-key/-/path-key-3.1.1.tgz", + "integrity": "sha512-ojmeN0qd+y0jszEtoY48r0Peq5dwMEkIlCOu6Q5f41lfkswXuKtYrhgoTpLnyIcHm24Uhqx+5Tqm2InSwLhE6Q==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=8" + } + }, + "node_modules/picocolors": { + "version": "1.1.1", + "resolved": "https://registry.npmjs.org/picocolors/-/picocolors-1.1.1.tgz", + "integrity": "sha512-xceH2snhtb5M9liqDsmEw56le376mTZkEX/jEb/RxNFyegNul7eNslCXP9FDj/Lcu0X8KEyMceP2ntpaHrDEVA==", + "license": "ISC" + }, + "node_modules/postcss": { + "version": "8.5.6", + "resolved": "https://registry.npmjs.org/postcss/-/postcss-8.5.6.tgz", + "integrity": "sha512-3Ybi1tAuwAP9s0r1UQ2J4n5Y0G05bJkpUIO0/bI9MhwmD70S5aTWbXGBwxHrelT+XM1k6dM0pk+SwNkpTRN7Pg==", + "funding": [ + { + "type": "opencollective", + "url": "https://opencollective.com/postcss/" + }, + { + "type": "tidelift", + "url": "https://tidelift.com/funding/github/npm/postcss" + }, + { + "type": "github", + "url": "https://github.com/sponsors/ai" + } + ], + "license": "MIT", + "dependencies": { + "nanoid": "^3.3.11", + "picocolors": "^1.1.1", + "source-map-js": "^1.2.1" + }, + "engines": { + "node": "^10 || ^12 || >=14" + } + }, + "node_modules/prelude-ls": { + "version": "1.2.1", + "resolved": "https://registry.npmjs.org/prelude-ls/-/prelude-ls-1.2.1.tgz", + "integrity": "sha512-vkcDPrRZo1QZLbn5RLGPpg/WmIQ65qoWWhcGKf/b5eplkkarX0m9z8ppCat4mlOqUsWpyNuYgO3VRyrYHSzX5g==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">= 0.8.0" + } + }, + "node_modules/propagate": { + "version": "2.0.1", + "resolved": "https://registry.npmjs.org/propagate/-/propagate-2.0.1.tgz", + "integrity": "sha512-vGrhOavPSTz4QVNuBNdcNXePNdNMaO1xj9yBeH1ScQPjk/rhg9sSlCXPhMkFuaNNW/syTvYqsnbIJxMBfRbbag==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">= 8" + } + }, + "node_modules/punycode": { + "version": "2.3.1", + "resolved": "https://registry.npmjs.org/punycode/-/punycode-2.3.1.tgz", + "integrity": "sha512-vYt7UD1U9Wg6138shLtLOvdAu+8DsC/ilFtEVHcH+wydcSpNE20AfSOduf6MkRFahL5FY7X1oU7nKVZFtfq8Fg==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=6" + } + }, + "node_modules/resolve-from": { + "version": "4.0.0", + "resolved": "https://registry.npmjs.org/resolve-from/-/resolve-from-4.0.0.tgz", + "integrity": "sha512-pb/MYmXstAkysRFx8piNI1tGFNQIFA3vkE3Gq4EuA1dF6gHp/+vgZqsCGJapvy8N3Q+4o7FwvquPJcnZ7RYy4g==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=4" + } + }, + "node_modules/sanitize-html": { + "version": "2.17.0", + "resolved": "https://registry.npmjs.org/sanitize-html/-/sanitize-html-2.17.0.tgz", + "integrity": "sha512-dLAADUSS8rBwhaevT12yCezvioCA+bmUTPH/u57xKPT8d++voeYE6HeluA/bPbQ15TwDBG2ii+QZIEmYx8VdxA==", + "license": "MIT", + "dependencies": { + "deepmerge": "^4.2.2", + "escape-string-regexp": "^4.0.0", + "htmlparser2": "^8.0.0", + "is-plain-object": "^5.0.0", + "parse-srcset": "^1.0.2", + "postcss": "^8.3.11" + } + }, + "node_modules/shebang-command": { + "version": "2.0.0", + "resolved": "https://registry.npmjs.org/shebang-command/-/shebang-command-2.0.0.tgz", + "integrity": "sha512-kHxr2zZpYtdmrN1qDjrrX/Z1rR1kG8Dx+gkpK1G4eXmvXswmcE1hTWBWYUzlraYw1/yZp6YuDY77YtvbN0dmDA==", + "dev": true, + "license": "MIT", + "dependencies": { + "shebang-regex": "^3.0.0" + }, + "engines": { + "node": ">=8" + } + }, + "node_modules/shebang-regex": { + "version": "3.0.0", + "resolved": "https://registry.npmjs.org/shebang-regex/-/shebang-regex-3.0.0.tgz", + "integrity": "sha512-7++dFhtcx3353uBaq8DDR4NuxBetBzC7ZQOhmTQInHEd6bSrXdiEyzCvG07Z44UYdLShWUyXt5M/yhz8ekcb1A==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=8" + } + }, + "node_modules/source-map-js": { + "version": "1.2.1", + "resolved": "https://registry.npmjs.org/source-map-js/-/source-map-js-1.2.1.tgz", + "integrity": "sha512-UXWMKhLOwVKb728IUtQPXxfYU+usdybtUrK/8uGE8CQMvrhOpwvzDBwj0QhSL7MQc7vIsISBG8VQ8+IDQxpfQA==", + "license": "BSD-3-Clause", + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/strict-event-emitter": { + "version": "0.5.1", + "resolved": "https://registry.npmjs.org/strict-event-emitter/-/strict-event-emitter-0.5.1.tgz", + "integrity": "sha512-vMgjE/GGEPEFnhFub6pa4FmJBRBVOLpIII2hvCZ8Kzb7K0hlHo7mQv6xYrBvCL2LtAIBwFUK8wvuJgTVSQ5MFQ==", + "dev": true, + "license": "MIT" + }, + "node_modules/strip-json-comments": { + "version": "3.1.1", + "resolved": "https://registry.npmjs.org/strip-json-comments/-/strip-json-comments-3.1.1.tgz", + "integrity": "sha512-6fPc+R4ihwqP6N/aIv2f1gMH8lOVtWQHoqC4yK6oSDVVocumAsfCqjkXnqiYMhmMwS/mEHLp7Vehlt3ql6lEig==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=8" + }, + "funding": { + "url": "https://github.com/sponsors/sindresorhus" + } + }, + "node_modules/supports-color": { + "version": "7.2.0", + "resolved": "https://registry.npmjs.org/supports-color/-/supports-color-7.2.0.tgz", + "integrity": "sha512-qpCAvRl9stuOHveKsn7HncJRvv501qIacKzQlO/+Lwxc9+0q2wLyv4Dfvt80/DPn2pqOBsJdDiogXGR9+OvwRw==", + "dev": true, + "license": "MIT", + "dependencies": { + "has-flag": "^4.0.0" + }, + "engines": { + "node": ">=8" + } + }, + "node_modules/tr46": { + "version": "0.0.3", + "resolved": "https://registry.npmjs.org/tr46/-/tr46-0.0.3.tgz", + "integrity": "sha512-N3WMsuqV66lT30CrXNbEjx4GEwlow3v6rr4mCcv6prnfwhS01rkgyFdjPNBYd9br7LpXV1+Emh01fHnq2Gdgrw==", + "license": "MIT" + }, + "node_modules/type-check": { + "version": "0.4.0", + "resolved": "https://registry.npmjs.org/type-check/-/type-check-0.4.0.tgz", + "integrity": "sha512-XleUoc9uwGXqjWwXaUTZAmzMcFZ5858QA2vvx1Ur5xIcixXIP+8LnFDgRplU30us6teqdlskFfu+ae4K79Ooew==", + "dev": true, + "license": "MIT", + "dependencies": { + "prelude-ls": "^1.2.1" + }, + "engines": { + "node": ">= 0.8.0" + } + }, + "node_modules/uhyphen": { + "version": "0.2.0", + "resolved": "https://registry.npmjs.org/uhyphen/-/uhyphen-0.2.0.tgz", + "integrity": "sha512-qz3o9CHXmJJPGBdqzab7qAYuW8kQGKNEuoHFYrBwV6hWIMcpAmxDLXojcHfFr9US1Pe6zUswEIJIbLI610fuqA==", + "license": "ISC" + }, + "node_modules/uri-js": { + "version": "4.4.1", + "resolved": "https://registry.npmjs.org/uri-js/-/uri-js-4.4.1.tgz", + "integrity": "sha512-7rKUyy33Q1yc98pQ1DAmLtwX109F7TIfWlW1Ydo8Wl1ii1SeHieeh0HHfPeL2fMXK6z0s8ecKs9frCuLJvndBg==", + "dev": true, + "license": "BSD-2-Clause", + "dependencies": { + "punycode": "^2.1.0" + } + }, + "node_modules/webidl-conversions": { + "version": "3.0.1", + "resolved": "https://registry.npmjs.org/webidl-conversions/-/webidl-conversions-3.0.1.tgz", + "integrity": "sha512-2JAn3z8AR6rjK8Sm8orRC0h/bcl/DqL7tRPdGZ4I1CjdF+EaMLmYxBHyXuKL849eucPFhvBoxMsflfOb8kxaeQ==", + "license": "BSD-2-Clause" + }, + "node_modules/whatwg-url": { + "version": "5.0.0", + "resolved": "https://registry.npmjs.org/whatwg-url/-/whatwg-url-5.0.0.tgz", + "integrity": "sha512-saE57nupxk6v3HY35+jzBwYa0rKSy0XR8JSxZPwgLr7ys0IBzhGviA1/TUGJLmSVqs8pb9AnvICXEuOHLprYTw==", + "license": "MIT", + "dependencies": { + "tr46": "~0.0.3", + "webidl-conversions": "^3.0.0" + } + }, + "node_modules/which": { + "version": "2.0.2", + "resolved": "https://registry.npmjs.org/which/-/which-2.0.2.tgz", + "integrity": "sha512-BLI3Tl1TW3Pvl70l3yq3Y64i+awpwXqsGBYWkkqMtnbXgrMD+yj7rhW0kuEDxzJaYXGjEW5ogapKNMEKNMjibA==", + "dev": true, + "license": "ISC", + "dependencies": { + "isexe": "^2.0.0" + }, + "bin": { + "node-which": "bin/node-which" + }, + "engines": { + "node": ">= 8" + } + }, + "node_modules/word-wrap": { + "version": "1.2.5", + "resolved": "https://registry.npmjs.org/word-wrap/-/word-wrap-1.2.5.tgz", + "integrity": "sha512-BN22B5eaMMI9UMtjrGd5g5eCYPpCPDUy0FJXbYsaT5zYxjFOckS53SQDE3pWkVoWpHXVb3BrYcEN4Twa55B5cA==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/yocto-queue": { + "version": "0.1.0", + "resolved": "https://registry.npmjs.org/yocto-queue/-/yocto-queue-0.1.0.tgz", + "integrity": "sha512-rVksvsnNCdJ/ohGc6xgPwyN8eheCxsiLM8mxuE/t/mOVqJewPuO1miLpTHQiRgTKCLexL4MeAFVagts7HmNZ2Q==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=10" + }, + "funding": { + "url": "https://github.com/sponsors/sindresorhus" + } + } + } +} diff --git a/package.json b/package.json index 33a1ca0f..239d57de 100644 --- a/package.json +++ b/package.json @@ -1,56 +1,48 @@ { - "version": "6.0.7", + "version": "8.0.20", "name": "@arbitral/article-parser", "description": "To extract main article from given URL", - "homepage": "https://ndaidong.github.io/article-parser-demo/", + "homepage": "https://github.com/extractus/article-extractor", "repository": { "type": "git", - "url": "git@github.com:ndaidong/article-parser.git" + "url": "git@github.com:extractus/article-extractor.git" + }, + "author": "@extractus", + "main": "./src/main.js", + "type": "module", + "imports": { + "cross-fetch": "./src/deno/cross-fetch.js" }, - "author": "@ndaidong", - "main": "./dist/cjs/arbitral-article-parser.js", - "module": "./src/main.js", "browser": { - "linkedom": "./src/browser/linkedom.js", - "./main.js": "./dist/arbitral-article-parser.browser.js" + "cross-fetch": "./src/deno/cross-fetch.js", + "linkedom": "./src/browser/linkedom.js" }, - "type": "module", "types": "./index.d.ts", "engines": { - "node": ">= 14" + "node": ">= 20" }, "scripts": { - "lint": "standard .", + "lint": "eslint .", + "lint:fix": "eslint --fix .", "pretest": "npm run lint", - "test": "cross-env NODE_ENV=test NODE_OPTIONS=--experimental-vm-modules jest --unhandled-rejections=strict", - "build": "node build", - "eval": "cross-env DEBUG=*:* node eval", - "eval:cjs": "cross-env DEBUG=*:* node eval.cjs", + "test": "node --test --experimental-test-coverage", + "eval": "node eval", "reset": "node reset" }, "dependencies": { - "@mozilla/readability": "^0.4.2", - "axios": "^0.27.2", - "bellajs": "^11.0.2", - "debug": "^4.3.4", - "html-crush": "^5.0.18", - "linkedom": "^0.14.9", - "sanitize-html": "^2.7.0", - "string-comparison": "^1.1.0", - "urlpattern-polyfill": "^5.0.3" - }, - "standard": { - "ignore": [ - "/dist" - ] + "@mozilla/readability": "^0.6.0", + "@ndaidong/bellajs": "^12.0.1", + "cross-fetch": "^4.1.0", + "linkedom": "^0.18.12", + "sanitize-html": "2.17.0" }, "devDependencies": { - "@types/sanitize-html": "^2.6.2", - "cross-env": "^7.0.3", - "esbuild": "^0.14.41", - "jest": "^28.1.0", - "nock": "^13.2.4", - "standard": "^17.0.0" + "@eslint/js": "^9.34.0", + "@types/sanitize-html": "^2.16.0", + "eslint": "^9.34.0", + "globals": "^16.3.0", + "https-proxy-agent": "^7.0.6", + "nock": "^14.0.10" }, "keywords": [ "article", diff --git a/prettier.config.cjs b/prettier.config.cjs index a6d43d72..2a998112 100644 --- a/prettier.config.cjs +++ b/prettier.config.cjs @@ -4,5 +4,5 @@ module.exports = { singleQuote: true, tabWidth: 2, trailingComma: 'none', - useTabs: false + useTabs: false, } diff --git a/reset.js b/reset.js index c1e50323..6afa5a09 100644 --- a/reset.js +++ b/reset.js @@ -1,29 +1,26 @@ -/** - * reset.js - * @ndaidong -**/ +// reset.js import { existsSync, unlinkSync -} from 'fs' +} from 'node:fs' -import { execSync } from 'child_process' +import { execSync } from 'node:child_process' const dirs = [ - 'dist', + 'evaluation', 'docs', '.nyc_output', 'coverage', 'node_modules', - '.nuxt' + '.nuxt', ] const files = [ 'yarn.lock', 'pnpm-lock.yaml', 'package-lock.json', - 'coverage.lcov' + 'coverage.lcov', ] dirs.forEach((d) => { diff --git a/src/browser/linkedom.js b/src/browser/linkedom.js index 11f7bdac..6d5be046 100644 --- a/src/browser/linkedom.js +++ b/src/browser/linkedom.js @@ -1 +1 @@ -export const DOMParser = global.DOMParser +export const DOMParser = window.DOMParser diff --git a/src/config.js b/src/config.js index 6aecb82a..50d5d75a 100644 --- a/src/config.js +++ b/src/config.js @@ -1,74 +1,55 @@ -// configs +// config.js -import { clone, copies, isArray } from 'bellajs' - -import { rules as defaultRules } from './rules.js' - -let rules = clone(defaultRules) - -const requestOptions = { - headers: { - 'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0', - accept: 'text/html; charset=utf-8' - }, - responseType: 'text', - responseEncoding: 'utf8', - timeout: 6e4, // 1 minute - maxRedirects: 3 -} +import { clone } from '@ndaidong/bellajs' const sanitizeHtmlOptions = { - allowedTags: ['h1', 'h2', 'h3', 'h4', 'h5', 'u', 'b', 'i', 'em', 'strong', 'small', 'sup', 'sub', 'div', 'span', 'p', 'article', 'blockquote', 'section', 'details', 'summary', 'pre', 'code', 'ul', 'ol', 'li', 'dd', 'dl', 'table', 'th', 'tr', 'td', 'thead', 'tbody', 'tfood', 'fieldset', 'legend', 'figure', 'figcaption', 'img', 'picture', 'video', 'audio', 'source', 'iframe', 'progress', 'br', 'p', 'hr', 'label', 'abbr', 'a', 'svg'], + allowedTags: [ + 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', + 'u', 'b', 'i', 'em', 'strong', 'small', 'sup', 'sub', + 'div', 'span', 'p', 'article', 'blockquote', 'section', + 'details', 'summary', + 'pre', 'code', + 'ul', 'ol', 'li', 'dd', 'dl', + 'table', 'th', 'tr', 'td', 'thead', 'tbody', 'tfood', + 'fieldset', 'legend', + 'figure', 'figcaption', 'img', 'picture', + 'video', 'audio', 'source', + 'iframe', + 'progress', + 'br', 'p', 'hr', + 'label', + 'abbr', + 'a', + 'svg', + ], allowedAttributes: { + h1: ['id'], + h2: ['id'], + h3: ['id'], + h4: ['id'], + h5: ['id'], + h6: ['id'], a: ['href', 'target', 'title'], abbr: ['title'], progress: ['value', 'max'], - img: ['src', 'srcset', 'alt', 'width', 'height', 'style', 'title'], + img: ['src', 'srcset', 'alt', 'title'], picture: ['media', 'srcset'], - video: ['controls', 'width', 'height', 'autoplay', 'muted'], - audio: ['controls'], + video: ['controls', 'width', 'height', 'autoplay', 'muted', 'loop', 'src'], + audio: ['controls', 'width', 'height', 'autoplay', 'muted', 'loop', 'src'], source: ['src', 'srcset', 'data-srcset', 'type', 'media', 'sizes'], - iframe: ['src', 'frameborder', 'height', 'width', 'scrolling'], - svg: ['width', 'height'] // sanitize-html does not support svg fully yet + iframe: ['src', 'frameborder', 'height', 'width', 'scrolling', 'allow'], + svg: ['width', 'height'], // sanitize-html does not support svg fully yet }, - allowedIframeDomains: ['youtube.com', 'vimeo.com'] -} - -const parserOptions = { - wordsPerMinute: 300, // to estimate "time to read" - urlsCompareAlgorithm: 'levenshtein', // to find the best url from list - descriptionLengthThreshold: 40, // min num of chars required for description - descriptionTruncateLen: 156, // max num of chars generated for description - contentLengthThreshold: 200 // content must have at least 200 chars -} - -/** - * @type {HtmlCrushOptions} - */ -const htmlCrushOptions = { - removeHTMLComments: 2, - removeLineBreaks: true -} - -/** - * @returns {ParserOptions} - */ -export const getParserOptions = () => { - return clone(parserOptions) -} - -/** - * @returns {RequestOptions} - */ -export const getRequestOptions = () => { - return clone(requestOptions) -} - -/** - * @returns {HtmlCrushOptions} - */ -export const getHtmlCrushOptions = () => { - return clone(htmlCrushOptions) + allowedIframeDomains: [ + 'youtube.com', 'vimeo.com', 'odysee.com', + 'soundcloud.com', 'audius.co', + 'github.com', 'codepen.com', + 'twitter.com', 'facebook.com', 'instagram.com', + ], + disallowedTagsMode: 'discard', + allowVulnerableTags: false, + parseStyleAttributes: false, + enforceHtmlBoundary: false, } /** @@ -78,40 +59,8 @@ export const getSanitizeHtmlOptions = () => { return clone(sanitizeHtmlOptions) } -export const setParserOptions = (opts) => { - Object.keys(parserOptions).forEach((key) => { - if (key in opts) { - parserOptions[key] = opts[key] - } - }) -} - -export const setRequestOptions = (opts) => { - copies(opts, requestOptions) -} - -export const setHtmlCrushOptions = (opts) => { - copies(opts, htmlCrushOptions) -} - -export const setSanitizeHtmlOptions = (opts) => { +export const setSanitizeHtmlOptions = (opts = {}) => { Object.keys(opts).forEach((key) => { sanitizeHtmlOptions[key] = clone(opts[key]) }) } - -/** - * @returns {QueryRule[]} - */ -export const getQueryRules = () => clone(rules) - -/** - * @param value {QueryRule[]} - */ -export const setQueryRules = (value) => { rules = value } - -/** - * @param entries {QueryRule} - * @returns {number} - */ -export const addQueryRules = (...entries) => rules.unshift(...entries.filter((item) => isArray(item?.patterns))) diff --git a/src/config.test.js b/src/config.test.js index cc4c6758..cf3ba632 100644 --- a/src/config.test.js +++ b/src/config.test.js @@ -1,114 +1,37 @@ // config.test -/* eslint-env jest */ +import { describe, it } from 'node:test' +import assert from 'node:assert' import { - setRequestOptions, - getRequestOptions, - setParserOptions, - getParserOptions, setSanitizeHtmlOptions, - getSanitizeHtmlOptions, - getQueryRules, - addQueryRules + getSanitizeHtmlOptions } from './config.js' -import { rules as defaultRules } from './rules.js' - -test('Testing setRequestOptions/getRequestOptions methods', () => { - setRequestOptions({ - headers: { - authorization: 'bearer ' - }, - timeout: 20, - somethingElse: 1000 - }) - - const actual = getRequestOptions() - const expectedHeader = { - authorization: 'bearer ', - 'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0', - accept: 'text/html; charset=utf-8' - } - - expect(actual.headers).toEqual(expectedHeader) - expect(actual.timeout).toEqual(20) -}) - -test('Testing setParserOptions/getParserOptions methods', () => { - const expectedWPM = 400 - const expectedAlgorithm = 'levenshtein' - - setParserOptions({ - wordsPerMinute: expectedWPM - }) - - const actual = getParserOptions() - - expect(actual.wordsPerMinute).toEqual(expectedWPM) - expect(actual.urlsCompareAlgorithm).toEqual(expectedAlgorithm) -}) - -test('Testing setSanitizeHtmlOptions/getSanitizeHtmlOptions methods', () => { - setSanitizeHtmlOptions({ - allowedTags: ['div', 'span'], - allowedAttributes: { - a: ['href', 'title'] +describe('check config methods', () => { + it('Testing setSanitizeHtmlOptions/getSanitizeHtmlOptions methods', () => { + setSanitizeHtmlOptions({ + allowedTags: ['div', 'span'], + allowedAttributes: { + a: ['href', 'title'], + }, + }) + + const actual = getSanitizeHtmlOptions() + const actualAllowedAttributes = actual.allowedAttributes + const expectedAllowedAttributes = { + a: ['href', 'title'], } - }) - const actual = getSanitizeHtmlOptions() - const actualAllowedAttributes = actual.allowedAttributes - const expectedAllowedAttributes = { - a: ['href', 'title'] - } + assert.deepEqual(actualAllowedAttributes, expectedAllowedAttributes) - expect(actualAllowedAttributes).toEqual(expectedAllowedAttributes) + const actualAllowedTags = actual.allowedTags + const expectedAllowedTags = ['div', 'span'] + assert.deepEqual(actualAllowedTags, expectedAllowedTags) - const actualAllowedTags = actual.allowedTags - const expectedAllowedTags = ['div', 'span'] - expect(actualAllowedTags).toEqual(expectedAllowedTags) + setSanitizeHtmlOptions({ + allowedTags: [], + }) - setSanitizeHtmlOptions({ - allowedTags: [] + assert.deepEqual(getSanitizeHtmlOptions().allowedTags, []) }) - - expect(getSanitizeHtmlOptions().allowedTags).toEqual([]) -}) - -test('Testing addQueryRules/getQueryRules methods', () => { - const currentRules = getQueryRules() - - expect(currentRules).toEqual(defaultRules) - - addQueryRules() - addQueryRules(...[]) - expect(getQueryRules()).toHaveLength(defaultRules.length) - - const newRules = [ - { - patterns: [ - /somewhere.com\/*/ - ], - selector: '.article-body', - unwanted: [ - '.removing-box', - '.ads-section' - ] - }, - { - patterns: [ - /elsewhere.net\/*/ - ], - selector: '.main-content', - unwanted: [ - '.related-posts' - ] - } - ] - addQueryRules(...newRules) - - const updatedRules = getQueryRules() - expect(updatedRules).toHaveLength(defaultRules.length + newRules.length) - expect(updatedRules[0]).toEqual(newRules[0]) - expect(updatedRules[updatedRules.length - 1]).toEqual(defaultRules[defaultRules.length - 1]) }) diff --git a/src/deno/cross-fetch.js b/src/deno/cross-fetch.js new file mode 100644 index 00000000..d084f98d --- /dev/null +++ b/src/deno/cross-fetch.js @@ -0,0 +1,2 @@ +// cross-fetch.js +export default fetch diff --git a/src/main.js b/src/main.js index badd0797..7b65fba2 100644 --- a/src/main.js +++ b/src/main.js @@ -1,36 +1,36 @@ -/** - * Article parser - * @ndaidong - **/ +// main.js import { isString -} from 'bellajs' - -import isValidUrl from './utils/isValidUrl.js' -import isHTMLString from './utils/isHTMLString.js' +} from '@ndaidong/bellajs' import retrieve from './utils/retrieve.js' - import parseFromHtml from './utils/parseFromHtml.js' +import { getCharset } from './utils/html.js' +import { isValid as isValidUrl } from './utils/linker.js' -export const extract = async (input) => { +export const extract = async (input, parserOptions = {}, fetchOptions = {}) => { if (!isString(input)) { throw new Error('Input must be a string') } - if (isHTMLString(input)) { - return parseFromHtml(input) - } if (!isValidUrl(input)) { - throw new Error('Input must be a valid URL') + return parseFromHtml(input, null, parserOptions || {}) } - const html = await retrieve(input) - if (!html) { + const buffer = await retrieve(input, fetchOptions) + const text = buffer ? Buffer.from(buffer).toString().trim() : '' + if (!text) { return null } + const charset = getCharset(text) + const decoder = new TextDecoder(charset) + const html = decoder.decode(buffer) + return parseFromHtml(html, input, parserOptions || {}) +} - return parseFromHtml(html, input) +export const extractFromHtml = async (html, url, parserOptions = {}) => { + return parseFromHtml(html, url, parserOptions) } -export * from './config.js' +export { addTransformations, removeTransformations } from './utils/transformation.js' +export { setSanitizeHtmlOptions, getSanitizeHtmlOptions } from './config.js' diff --git a/src/main.test.js b/src/main.test.js index b3b7e15d..b52794f2 100644 --- a/src/main.test.js +++ b/src/main.test.js @@ -1,24 +1,49 @@ // main.test -/* eslint-env jest */ -import { - readFileSync -} from 'fs' +import { describe, it } from 'node:test' +import assert from 'node:assert' + +import { readFileSync } from 'fs' + +import { HttpsProxyAgent } from 'https-proxy-agent' import nock from 'nock' import { - extract -} from './main' + extract, + getSanitizeHtmlOptions, + setSanitizeHtmlOptions, + addTransformations, + removeTransformations +} from './main.js' + +const env = process.env || {} +const PROXY_SERVER = env.PROXY_SERVER || '' const parseUrl = (url) => { const re = new URL(url) return { baseUrl: `${re.protocol}//${re.host}`, - path: re.pathname + path: re.pathname, } } +describe('check all exported methods', () => { + const fns = [ + extract, + getSanitizeHtmlOptions, + setSanitizeHtmlOptions, + addTransformations, + removeTransformations, + ] + + fns.forEach((fn) => { + it(` check ${fn.name}`, () => { + assert.ok(fn) + }) + }) +}) + describe('test extract(bad url)', () => { const badSamples = [ '', @@ -29,51 +54,56 @@ describe('test extract(bad url)', () => { 'fpt://abc.com/failed-none-sense', 'ttp://badcom/146753785', 'https://674458092126388225', - 'https://soundcloud^(*%%$%^$$%$$*&(&)())' + 'https://soundcloud^(*%%$%^$$%$$*&(&)())', ] badSamples.forEach((url) => { - test(`testing extract bad url "${url}"`, async () => { + it(`testing extract bad url "${url}"`, async () => { try { await extract(url) } catch (err) { - expect(err).toBeTruthy() + assert.ok(err) } }) }) }) describe('test extract(regular article url)', () => { + const expDesc = [ + 'Navigation here Few can name a rational peach that isn\'t a conscientious goldfish!', + 'One cannot separate snakes from plucky pomegranates?', + 'Draped neatly on a hanger, the melons could be said to resemble knowledgeable pigs.', + ].join(' ') const cases = [ { input: { url: 'https://somewhere.com/path/to/no/article', - html: readFileSync('./test-data/html-no-article.html', 'utf8') + html: readFileSync('./test-data/html-no-article.html', 'utf8'), + }, + validate: (result) => { + assert.equal(result, null) }, - validate: (result, expect) => { - expect(result).toBeFalsy() - } }, { input: { url: 'https://somewhere.com/path/to/no/content', - html: '' + html: '', + }, + validate: (result) => { + assert.equal(result, null) }, - validate: (result, expect) => { - expect(result).toBeFalsy() - } }, { input: { url: 'https://somewhere.com/path/to/article', - html: readFileSync('./test-data/regular-article.html', 'utf8') + html: readFileSync('./test-data/regular-article.html', 'utf8'), }, - validate: (result, expect) => { - expect(result).toBeTruthy() - expect(result.title).toEqual('Article title here') - expect(result.description).toEqual('Few words to summarize this article content') - } - } + validate: (result) => { + assert.ok(result) + assert.equal(result.title, 'Article title here') + assert.equal(result.description, expDesc) + }, + }, ] cases.forEach(({ input, validate }) => { const { url, html, statusCode = 200 } = input @@ -81,19 +111,54 @@ describe('test extract(regular article url)', () => { const scope = nock(baseUrl) scope.get(path) .reply(statusCode, html, { - 'Content-Type': 'text/html' + 'Content-Type': 'text/html', }) - test(`check extract("${url}")`, async () => { + it(`check extract("${url}")`, async () => { const result = await extract(url) - validate(result, expect) + validate(result) }) }) - test('check extract(html string)', async () => { + it('check extract(html string)', async () => { const html = readFileSync('./test-data/regular-article.html', 'utf8') const result = await extract(html) - expect(result).toBeTruthy() - expect(result.title).toEqual('Article title here') - expect(result.description).toEqual('Few words to summarize this article content') + assert.ok(result) + assert.equal(result.title, 'Article title here') + assert.equal(result.description, expDesc) }) }) + +describe('test extract with modified sanitize-html options', () => { + const currentSanitizeOptions = getSanitizeHtmlOptions() + + setSanitizeHtmlOptions({ + ...currentSanitizeOptions, + allowedAttributes: { + ...currentSanitizeOptions.allowedAttributes, + code: ['class'], + div: ['class'], + }, + allowedClasses: { + code: ['language-*', 'lang-*'], + }, + }) + + it('check if output contain class attribute', async () => { + const html = readFileSync('./test-data/article-with-classes-attributes.html', 'utf8') + const result = await extract(html) + assert.ok(result.content.includes('code class="lang-js"')) + }) +}) + +if (PROXY_SERVER !== '') { + describe('test extract live article API via proxy server', () => { + it('check if extract method works with proxy server', async () => { + const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html' + const result = await extract(url, {}, { + agent: new HttpsProxyAgent(PROXY_SERVER), + }) + assert.ok(result.title.includes('Federal Reserve')) + assert.equal(result.source, 'cnbc.com') + }, 10000) + }) +} diff --git a/src/rules.js b/src/rules.js index cd889da7..a5a9b930 100644 --- a/src/rules.js +++ b/src/rules.js @@ -13,48 +13,48 @@ export const rules = [ unwanted: [ '.morenews', '.zone--media', - '.zone--timeline' - ] + '.zone--timeline', + ], }, { patterns: ['*://zingnews.vn/*'], unwanted: [ '.the-article-category', '.the-article-meta', - '.the-article-tags' - ] + '.the-article-tags', + ], }, { patterns: ['*://{*.}?vnexpress.net/*'], unwanted: [ - '.header-content' - ] + '.header-content', + ], }, { patterns: ['*://{*.}?vietnamnet.vn/*', '*://{*.}?vnn.vn/*'], selector: '#ArticleContent', unwanted: [ '.inner-article', - '.article-relate' - ] + '.article-relate', + ], }, { patterns: ['*://thehill.com/*'], unwanted: [ - '.rollover-people-block' - ] + '.rollover-people-block', + ], }, { patterns: ['*://{*.}?digitaltrends.com/*'], unwanted: [ '.h-editors-recs-title', - 'ul.h-editors-recs' - ] + 'ul.h-editors-recs', + ], }, { patterns: ['*://{*.}?techradar.com/*'], unwanted: [ - 'nav.breadcrumb' - ] - } + 'nav.breadcrumb', + ], + }, ] diff --git a/src/utils/absolutifyUrl.test.js b/src/utils/absolutifyUrl.test.js deleted file mode 100644 index b851b029..00000000 --- a/src/utils/absolutifyUrl.test.js +++ /dev/null @@ -1,47 +0,0 @@ -// absolutifyUrl.test -/* eslint-env jest */ - -import absolutifyUrl from './absolutifyUrl.js' - -describe('test absolutifyUrl()', () => { - const entries = [ - { - full: '', - expected: '' - }, - { - relative: {}, - expected: '' - }, - { - full: 'https://some.where/article/abc-xyz', - relative: 'category/page.html', - expected: 'https://some.where/article/category/page.html' - }, - { - full: 'https://some.where/article/abc-xyz', - relative: '../category/page.html', - expected: 'https://some.where/category/page.html' - }, - { - full: 'https://some.where/blog/authors/article/abc-xyz', - relative: '/category/page.html', - expected: 'https://some.where/category/page.html' - }, - { - full: 'https://some.where/article/abc-xyz', - expected: 'https://some.where/article/abc-xyz' - } - ] - entries.forEach((entry) => { - const { - full, - relative, - expected - } = entry - test(`absolutifyUrl("${full}", "${relative}") must become "${expected}"`, () => { - const result = absolutifyUrl(full, relative) - expect(result).toEqual(expected) - }) - }) -}) diff --git a/src/utils/chooseBestUrl.js b/src/utils/chooseBestUrl.js index 8833d69d..c27ff97f 100644 --- a/src/utils/chooseBestUrl.js +++ b/src/utils/chooseBestUrl.js @@ -23,6 +23,6 @@ export default (candidates = [], title = '') => { return better ? { similarity, value: curr } : prev }, { similarity: comparer.similarity(shortestUrl, titleHashed), - value: shortestUrl + value: shortestUrl, }).value } diff --git a/src/utils/chooseBestUrl.test.js b/src/utils/chooseBestUrl.test.js deleted file mode 100644 index 3a6885ba..00000000 --- a/src/utils/chooseBestUrl.test.js +++ /dev/null @@ -1,17 +0,0 @@ -// chooseBestUrl.test -/* eslint-env jest */ - -import chooseBestUrl from './chooseBestUrl.js' - -test('test chooseBestUrl an actual case', () => { - const title = 'Google đã ra giá mua Fitbit' - const urls = [ - 'https://alpha.xyz/tin-tuc-kinh-doanh/-/view_content/content/2965950/google-da-ra-gia-mua-fitbit', - 'https://alpha.xyz/tin-tuc-kinh-doanh/view/2965950/907893219797', - 'https://alpha.xyz/tin-tuc-kinh-doanh/google-da-ra-gia-mua-fitbit', - 'https://a.xyz/read/google-da-ra-gia-mua-fitbit', - 'https://a.xyz/read/2965950/907893219797' - ] - const result = chooseBestUrl(urls, title) - expect(result).toBe(urls[3]) -}) diff --git a/src/utils/cleanAndMinifyHtml.test.js b/src/utils/cleanAndMinifyHtml.test.js deleted file mode 100644 index 3c237934..00000000 --- a/src/utils/cleanAndMinifyHtml.test.js +++ /dev/null @@ -1,37 +0,0 @@ -// cleanAndMinifyHtml.test -/* eslint-env jest */ - -import { readFileSync } from 'fs' - -import { isString } from 'bellajs' - -import cleanAndMinifyHtml from './cleanAndMinifyHtml.js' - -describe('test cleanAndMinifyHtml()', () => { - test('test stripping attributes from elements', () => { - const html = readFileSync('./test-data/regular-article.html', 'utf8') - const result = cleanAndMinifyHtml(html) - expect(isString(result)).toBe(true) - expect(result).toEqual( - expect.not.stringContaining('

') - ) - expect(result).toEqual( - expect.stringContaining('

Those cheetahs are nothing more than dogs') - ) - }) - test('test minifying html elements', () => { - const html = readFileSync('./test-data/regular-article.html', 'utf8') - expect(html).toEqual( - expect.not.stringContaining( - '

The first fair dog is, in its own way, a lemon.

' - ) - ) - const result = cleanAndMinifyHtml(html) - expect(isString(result)).toBe(true) - expect(result).toEqual( - expect.stringContaining( - '

The first fair dog is, in its own way, a lemon.

' - ) - ) - }) -}) diff --git a/src/utils/extractJsonLd.js b/src/utils/extractJsonLd.js index 4f31dd2c..d1963838 100644 --- a/src/utils/extractJsonLd.js +++ b/src/utils/extractJsonLd.js @@ -9,13 +9,13 @@ import getHostname from './getHostname.js' */ export default (html, baseUrl) => { const articleAttrs = [ - 'NewsArticle' + 'NewsArticle', ] const organizationAttrs = [ 'Organization', 'NewsMediaOrganization', - 'WebSite' + 'WebSite', ] const buildAuthor = (context) => { @@ -28,16 +28,16 @@ export default (html, baseUrl) => { : context['@graph']).map(({ name, image, url }) => ({ name: name || '', image: image?.url ?? '', - url: url || '' + url: url || '', })) } - const buildPublisher = ({ name, url, logo, sameAs, ...context }) => { + const buildPublisher = ({ name, url, logo, sameAs }) => { return { name: name ?? '', url: url ?? getHostname(baseUrl), logo: logo?.url ?? '', - sameAs: sameAs ?? [] + sameAs: sameAs ?? [], } } @@ -55,11 +55,10 @@ export default (html, baseUrl) => { if (!jsonData.length) { return { author: [], - publisher: null + publisher: null, } } - // eslint-disable-next-line const jsonObj = jsonData.reduce((o, i) => (o[i['@type']] = i, o), {}) const articleAttr = articleAttrs.filter(a => !!jsonObj[a])[0] const article = jsonObj[articleAttr] ?? {} @@ -69,6 +68,6 @@ export default (html, baseUrl) => { return { author: buildAuthor(article.author), - publisher: buildPublisher({ ...organization, ...article.publisher }) + publisher: buildPublisher({ ...organization, ...article.publisher }), } } diff --git a/src/utils/extractJsonLd.test.js b/src/utils/extractJsonLd.test.js deleted file mode 100644 index 9e27c357..00000000 --- a/src/utils/extractJsonLd.test.js +++ /dev/null @@ -1,11 +0,0 @@ -// extractJsonLd.test -/* eslint-env jest */ - -import { readFileSync } from 'fs' -import extractJsonLd from './extractJsonLd.js' - -test('test extractJsonLd an actual case', () => { - const html = readFileSync('./test-data/html-article-with-json-ld.html', 'utf8') - const result = extractJsonLd(html, 'example.com') - expect(result.author.length === 2).toBe(true) -}) diff --git a/src/utils/extractLdSchema.js b/src/utils/extractLdSchema.js new file mode 100644 index 00000000..0c082045 --- /dev/null +++ b/src/utils/extractLdSchema.js @@ -0,0 +1,85 @@ +// utils -> extractLdSchema.js + +import { isArray, isObject, isString } from '@ndaidong/bellajs' + +const typeSchemas = [ + 'aboutpage', + 'checkoutpage', + 'collectionpage', + 'contactpage', + 'faqpage', + 'itempage', + 'medicalwebpage', + 'profilepage', + 'qapage', + 'realestatelisting', + 'searchresultspage', + 'webpage', + 'website', + 'article', + 'advertisercontentarticle', + 'newsarticle', + 'analysisnewsarticle', + 'askpublicnewsarticle', + 'backgroundnewsarticle', + 'opinionnewsarticle', + 'reportagenewsarticle', + 'reviewnewsarticle', + 'report', + 'satiricalarticle', + 'scholarlyarticle', + 'medicalscholarlyarticle', +] + +const attributeLists = { + description: 'description', + image: 'image', + author: 'author', + published: 'datePublished', + type: '@type', +} + +const parseJson = (text) => { + try { + return JSON.parse(text) + } catch { + return {} + } +} + +const isAllowedLdJsonType = (ldJson) => { + const rootLdJsonType = ldJson['@type'] || '' + const arr = isArray(rootLdJsonType) ? rootLdJsonType : [rootLdJsonType] + const ldJsonTypes = arr.filter(x => !!x) + return ldJsonTypes.length > 0 && ldJsonTypes.some(x => typeSchemas.includes(x.toLowerCase())) +} + +/** + * Parses JSON-LD data from a document and populates an entry object. + * Only populates if the original entry object is empty or undefined. + * + * @param {Document} document - The HTML Document + * @param {Object} entry - The entry object to merge/populate with JSON-LD. + * @returns {Object} The entry object after being merged/populated with data. + */ +export default (document, entry) => { + const ldSchemas = document.querySelectorAll('script[type="application/ld+json"]') + ldSchemas.forEach(ldSchema => { + const ldJson = parseJson(ldSchema.textContent.replace(/[\n\r\t]/g, '')) + if (ldJson && isAllowedLdJsonType(ldJson)) { + Object.entries(attributeLists).forEach(([key, attr]) => { + if (!entry[key] || !ldJson[attr]) { + return + } + + const keyValue = ldJson[attr] + const val = isArray(keyValue) ? keyValue[0] : isObject(keyValue) ? keyValue?.name || '' : keyValue + if (isString(val) && val !== '') { + entry[key] = val.trim() + } + }) + } + }) + + return entry +} diff --git a/src/utils/extractMetaData.js b/src/utils/extractMetaData.js index 4f123cc7..315e7fe7 100644 --- a/src/utils/extractMetaData.js +++ b/src/utils/extractMetaData.js @@ -1,9 +1,36 @@ // utils -> extractMetaData + import { DOMParser } from 'linkedom' +import extractLdSchema from './extractLdSchema.js' +import findDate from './findDate.js' + +/** + * @param {Element} node + * @param {Object} attributeLists + * @returns {?{key: string, content: string}} + */ +function getMetaContentByNameOrProperty (node, attributeLists) { + const content = node.getAttribute('content') + if (!content) return null + + const property = node + .getAttribute('property')?.toLowerCase() ?? + node.getAttribute('itemprop')?.toLowerCase() + + const name = node.getAttribute('name')?.toLowerCase() + + for (const [key, attrs] of Object.entries(attributeLists)) { + if (attrs.includes(property) || attrs.includes(name)) { + return { key, content } + } + } + + return null +} /** * @param html {string} - * @returns {{image: string, author: string, amphtml: string, description: string, canonical: string, source: string, published: string, title: string, url: string, shortlink: string}} + * @returns {{image: string, author: string, amphtml: string, description: string, canonical: string, source: string, published: string, title: string, url: string, shortlink: string, favicon: string, type: string}} */ export default (html) => { const entry = { @@ -16,33 +43,42 @@ export default (html) => { image: '', author: '', source: '', - published: '' + published: '', + favicon: '', + type: '', } const sourceAttrs = [ 'application-name', 'og:site_name', 'twitter:site', - 'dc.title' + 'dc.title', ] const urlAttrs = [ 'og:url', - 'twitter:url' + 'twitter:url', + 'parsely-link', ] const titleAttrs = [ 'title', 'og:title', - 'twitter:title' + 'twitter:title', + 'parsely-title', ] const descriptionAttrs = [ 'description', 'og:description', - 'twitter:description' + 'twitter:description', + 'parsely-description', ] const imageAttrs = [ + 'image', 'og:image', + 'og:image:url', + 'og:image:secure_url', 'twitter:image', - 'twitter:image:src' + 'twitter:image:src', + 'parsely-image-url', ] const authorAttrs = [ 'author', @@ -50,51 +86,69 @@ export default (html) => { 'og:creator', 'article:author', 'twitter:creator', - 'dc.creator' + 'dc.creator', + 'parsely-author', ] const publishedTimeAttrs = [ 'article:published_time', 'article:modified_time', 'og:updated_time', - 'datepublished' + 'dc.date', + 'dc.date.issued', + 'dc.date.created', + 'dc:created', + 'dcterms.date', + 'datepublished', + 'datemodified', + 'updated_time', + 'modified_time', + 'published_time', + 'release_date', + 'date', + 'parsely-pub-date', + ] + const typeAttrs = [ + 'og:type', ] - const document = new DOMParser().parseFromString(html, 'text/html') - entry.title = document.querySelector('head > title')?.innerText + const attributeLists = { + source: sourceAttrs, + url: urlAttrs, + title: titleAttrs, + description: descriptionAttrs, + image: imageAttrs, + author: authorAttrs, + published: publishedTimeAttrs, + type: typeAttrs, + } - Array.from(document.getElementsByTagName('link')).forEach(node => { + const doc = new DOMParser().parseFromString(html, 'text/html') + entry.title = doc.querySelector('head > title')?.innerText + + Array.from(doc.getElementsByTagName('link')).forEach(node => { const rel = node.getAttribute('rel') const href = node.getAttribute('href') - if (rel && href) entry[rel] = href + if (rel && href) { + entry[rel] = href + if (rel === 'icon' || rel === 'shortcut icon') { + entry.favicon = href + } + } }) - Array.from(document.getElementsByTagName('meta')).forEach(node => { - const content = node.getAttribute('content') - const property = node.getAttribute('property')?.toLowerCase() ?? node.getAttribute('itemprop')?.toLowerCase() - const name = node.getAttribute('name')?.toLowerCase() - - if (sourceAttrs.includes(property) || sourceAttrs.includes(name)) { - entry.source = content - } - if (urlAttrs.includes(property) || urlAttrs.includes(name)) { - entry.url = content - } - if (titleAttrs.includes(property) || titleAttrs.includes(name)) { - entry.title = content - } - if (descriptionAttrs.includes(property) || descriptionAttrs.includes(name)) { - entry.description = content - } - if (imageAttrs.includes(property) || imageAttrs.includes(name)) { - entry.image = content - } - if (authorAttrs.includes(property) || authorAttrs.includes(name)) { - entry.author = content - } - if (publishedTimeAttrs.includes(property) || publishedTimeAttrs.includes(name)) { - entry.published = content + Array.from(doc.getElementsByTagName('meta')).forEach(node => { + const result = getMetaContentByNameOrProperty(node, attributeLists) + const val = result?.content || '' + if (val !== '') { + entry[result.key] = val } }) - return entry + const metadata = extractLdSchema(doc, entry) + + if (!metadata.published) { + metadata.published = findDate(doc) || '' + } + + return metadata } diff --git a/src/utils/extractMetaData.test.js b/src/utils/extractMetaData.test.js index 9944adc9..1e2ec63f 100644 --- a/src/utils/extractMetaData.test.js +++ b/src/utils/extractMetaData.test.js @@ -1,19 +1,57 @@ // extractMetaData.test -/* eslint-env jest */ +import { describe, it } from 'node:test' +import assert from 'node:assert' -import { readFileSync } from 'fs' +import { readFileSync } from 'node:fs' -import { isObject, hasProperty } from 'bellajs' +import { isObject, hasProperty } from '@ndaidong/bellajs' import extractMetaData from './extractMetaData.js' -const keys = 'url shortlink amphtml canonical title description image author source published'.split(' ') +const keys = 'url shortlink amphtml canonical title description image author source published favicon type'.split(' ') -test('test extractMetaData(good content)', async () => { - const html = readFileSync('./test-data/regular-article.html', 'utf8') - const result = extractMetaData(html) - expect(isObject(result)).toBe(true) - keys.forEach((k) => { - expect(hasProperty(result, k)).toBe(true) +function isDateString (date) { + if (typeof date !== 'string') return false + const d = new Date(date) + return !isNaN(d.getTime()) +} + +describe('test extractMetaData', () => { + it('test extractMetaData(good content)', async () => { + const html = readFileSync('./test-data/regular-article.html', 'utf8') + const result = extractMetaData(html) + assert.ok(isObject(result)) + keys.forEach((k) => { + assert.ok(hasProperty(result, k)) + }) + }) + + it('test extractMetaData(json ld schema content)', async () => { + const html = readFileSync('./test-data/regular-article-json-ld.html', 'utf8') + const result = extractMetaData(html) + assert.ok(isObject(result)) + keys.forEach((k) => { + assert.ok(hasProperty(result, k)) + }) + }) + + it('test extractMetaData(find date)', async () => { + const html1 = readFileSync('./test-data/regular-article-date-time.html', 'utf8') + const html2 = readFileSync('./test-data/regular-article-date-itemprop.html', 'utf8') + const html3 = readFileSync('./test-data/regular-article-date-span.html', 'utf8') + const result1 = extractMetaData(html1) + const result2 = extractMetaData(html2) + const result3 = extractMetaData(html3) + assert.ok(isObject(result1)) + assert.ok(isObject(result2)) + assert.ok(isObject(result3)) + keys.forEach((k) => { + assert.ok(hasProperty(result1, k)) + assert.ok(hasProperty(result3, k)) + assert.ok(hasProperty(result3, k)) + }) + assert.ok(isDateString(result1.published)) + assert.ok(isDateString(result2.published)) + assert.ok(isDateString(result3.published)) }) }) diff --git a/src/utils/extractWithReadability.js b/src/utils/extractWithReadability.js index 79f68483..0e6582e6 100644 --- a/src/utils/extractWithReadability.js +++ b/src/utils/extractWithReadability.js @@ -2,28 +2,28 @@ import { Readability } from '@mozilla/readability' import { DOMParser } from 'linkedom' -import isHTMLString from './isHTMLString.js' +import { isString } from '@ndaidong/bellajs' -/** - * @param html {string} - * @param inputUrl {string} - * @returns {string|null} - */ -export default (html, inputUrl = '') => { - if (!isHTMLString(html)) return null +export default (html, url = '') => { + if (!isString(html)) { + return null + } const doc = new DOMParser().parseFromString(html, 'text/html') const base = doc.createElement('base') - base.setAttribute('href', inputUrl) + base.setAttribute('href', url) doc.head.appendChild(base) - const reader = new Readability(doc) - const result = reader.parse() || {} + const reader = new Readability(doc, { + keepClasses: true, + }) + const result = reader.parse() ?? {} return result.textContent ? result.content : null } export function extractTitleWithReadability (html) { - if (!isHTMLString(html)) return null + if (!isString(html)) { + return null + } const doc = new DOMParser().parseFromString(html, 'text/html') const reader = new Readability(doc) - // noinspection JSUnresolvedFunction - return reader._getArticleTitle() + return reader._getArticleTitle() || null } diff --git a/src/utils/extractWithReadability.test.js b/src/utils/extractWithReadability.test.js index db70322a..bdcc64cb 100644 --- a/src/utils/extractWithReadability.test.js +++ b/src/utils/extractWithReadability.test.js @@ -1,28 +1,43 @@ // extractWithReadability.test -/* eslint-env jest */ -import { readFileSync } from 'fs' +import { describe, it } from 'node:test' +import assert from 'node:assert' -import { isString } from 'bellajs' +import { readFileSync } from 'node:fs' + +import { isString } from '@ndaidong/bellajs' import extractWithReadability, { extractTitleWithReadability } from './extractWithReadability.js' -test('test extractWithReadability from good html content', async () => { - const html = readFileSync('./test-data/regular-article.html', 'utf8') - const result = extractWithReadability(html, 'https://foo.bar') - expect(isString(result)).toBe(true) - expect(result.length > 200).toBe(true) - expect(result).toEqual(expect.stringContaining('')) -}) +describe('test extractWithReadability()', () => { + it('extract from good html content', async () => { + const html = readFileSync('./test-data/regular-article.html', 'utf8') + const result = extractWithReadability(html, 'https://foo.bar') + assert.ok(isString(result)) + assert.ok(result.length > 200) + assert.ok(result.includes('')) + }) -test('test extractWithReadability from bad html content', async () => { - expect(extractWithReadability(null)).toBe(null) - expect(extractWithReadability({})).toBe(null) - expect(extractWithReadability('
')).toBe(null) -}) + it('extract from bad html content', async () => { + assert.equal(extractWithReadability(null), null) + assert.equal(extractWithReadability({}), null) + assert.equal(extractWithReadability('
'), null) + }) + + it('extract title only', async () => { + const html = readFileSync('./test-data/regular-article.html', 'utf8') + const result = extractTitleWithReadability(html) + assert.equal(result, 'Article title here - ArticleParser') + }) + + it('extract title from page without title', async () => { + const html = readFileSync('./test-data/html-no-title.html', 'utf8') + const result = extractTitleWithReadability(html) + assert.equal(result, null) + }) -test('test extractTitleWithReadability', async () => { - const html = readFileSync('./test-data/regular-article.html', 'utf8') - const result = extractTitleWithReadability(html) - expect(result).toBe('Article title here - ArticleParser') + it('extract title from non-string', async () => { + const result = extractTitleWithReadability({}) + assert.equal(result, null) + }) }) diff --git a/src/utils/extractWithSelector.test.js b/src/utils/extractWithSelector.test.js deleted file mode 100644 index da46051b..00000000 --- a/src/utils/extractWithSelector.test.js +++ /dev/null @@ -1,20 +0,0 @@ -// extractWithSelector.test -/* eslint-env jest */ - -import { readFileSync } from 'fs' - -import { isString } from 'bellajs' - -import extractWithSelector from './extractWithSelector.js' - -test('test extractWithSelector a bad input', () => { - const result = extractWithSelector(null) - expect(result).toBe(null) -}) - -test('test extractWithSelector from good html content', async () => { - const html = readFileSync('./test-data/regular-article.html', 'utf8') - const result = extractWithSelector(html, 'article', ['.ads-section']) - expect(isString(result)).toBe(true) - expect(result.length > 200).toBe(true) -}) diff --git a/src/utils/findDate.js b/src/utils/findDate.js new file mode 100644 index 00000000..3a666e02 --- /dev/null +++ b/src/utils/findDate.js @@ -0,0 +1,57 @@ + +/** + * Convert date format to YYYY-MM-DD + * + * @param {string} dateString + * @returns {string} YYYY-MM-DD + */ +function convertDateFormat (dateString) { + const parts = dateString.split('/') + if (parts.length !== 3) return dateString + + let year, month, day + + if (parseInt(parts[0]) > 12) { + [day, month, year] = parts + } else { + [month, day, year] = parts + } + + year = year.length === 2 ? '20' + year : year + return `${year}-${month.padStart(2, '0')}-${day.padStart(2, '0')}T00:00:00` +} + +/** + * Look for the publication date in the body of the content. + * + * @param {Document} document - The HTML Document + * @returns {string} The date string + */ +export default function (doc) { + const datePatterns = [ + /\d{4}-\d{2}-\d{2}/, + /\d{1,2}\/\d{1,2}\/\d{2,4}/, + ] + + const findDate = (element) => { + for (const pattern of datePatterns) { + const match = element.textContent.match(pattern) + if (match) return convertDateFormat(match[0]) + } + return '' + } + + const priorityElements = doc.querySelectorAll('time, [datetime], [itemprop~=datePublished], [itemprop~=dateCreated]') + for (const el of priorityElements) { + const date = el.getAttribute('datetime') || el.getAttribute('content') || findDate(el) + if (date) return date + } + + const secondaryElements = doc.querySelectorAll('p, span, div') + for (const el of secondaryElements) { + const date = findDate(el) + if (date) return date + } + + return '' +} diff --git a/src/utils/findRulesByUrl.test.js b/src/utils/findRulesByUrl.test.js deleted file mode 100644 index 4c5ccb66..00000000 --- a/src/utils/findRulesByUrl.test.js +++ /dev/null @@ -1,45 +0,0 @@ -// findRulesByUrl.test -/* eslint-env jest */ - -import { isFunction } from 'bellajs' - -import findRulesByUrl from './findRulesByUrl.js' - -describe('test findRulesByUrl()', () => { - const entries = [ - { - urls: [{}, ''], - expectation: {} - }, - { - urls: [1209, 'https://vietnamnet.vn/path/to/article'], - expectation: (result, expect) => { - expect(result).toBeTruthy() - expect(result).toEqual(expect.objectContaining({ selector: '#ArticleContent' })) - expect(result.selector).toEqual('#ArticleContent') - } - }, - { - urls: ['https://vnn.vn/path/to/article'], - expectation: (result, expect) => { - expect(result).toBeTruthy() - expect(result).toEqual(expect.objectContaining({ selector: '#ArticleContent' })) - expect(result.selector).toEqual('#ArticleContent') - } - } - ] - entries.forEach((entry) => { - const { - urls, - expectation - } = entry - test('check if findRulesByUrl() works correctly', () => { - const result = findRulesByUrl(urls) - if (isFunction(expectation)) { - expectation(result, expect) - } else { - expect(result).toEqual(expectation) - } - }) - }) -}) diff --git a/src/utils/getHostname.test.js b/src/utils/getHostname.test.js deleted file mode 100644 index fb58d47a..00000000 --- a/src/utils/getHostname.test.js +++ /dev/null @@ -1,39 +0,0 @@ -// getHostname.test -/* eslint-env jest */ - -import getHostname from './getHostname.js' - -describe('test getHostname()', () => { - const entries = [ - { - url: '', - expected: '' - }, - { - url: {}, - expected: '' - }, - { - url: 'https://www.some.where/article/abc-xyz', - expected: 'some.where' - }, - { - url: 'https://www.alpha.some.where/blog/authors/article/abc-xyz', - expected: 'alpha.some.where' - }, - { - url: 'https://10.1.1.5:1888/article/abc-xyz', - expected: '10.1.1.5' - } - ] - entries.forEach((entry) => { - const { - url, - expected - } = entry - test(`absolutifyUrl("${url}") must become "${expected}"`, () => { - const result = getHostname(url) - expect(result).toEqual(expected) - }) - }) -}) diff --git a/src/utils/getTimeToRead.js b/src/utils/getTimeToRead.js index ccd448cd..7d8cef37 100644 --- a/src/utils/getTimeToRead.js +++ b/src/utils/getTimeToRead.js @@ -1,12 +1,7 @@ // utils -> getTimeToRead -import { - getParserOptions -} from '../config.js' - -export default (text) => { +export default (text, wordsPerMinute) => { const words = text.trim().split(/\s+/g).length - const { wordsPerMinute } = getParserOptions() const minToRead = words / wordsPerMinute const secToRead = Math.ceil(minToRead * 60) return secToRead diff --git a/src/utils/html.js b/src/utils/html.js new file mode 100644 index 00000000..25b23ba4 --- /dev/null +++ b/src/utils/html.js @@ -0,0 +1,50 @@ +// utils -> html + +import { DOMParser } from 'linkedom' +import sanitize from 'sanitize-html' +import { pipe } from '@ndaidong/bellajs' + +import { getSanitizeHtmlOptions } from '../config.js' + +export const purify = (html) => { + return sanitize(html, { + allowedTags: false, + allowedAttributes: false, + allowVulnerableTags: true, + }) +} + +const WS_REGEXP = /^[\s\f\n\r\t\u1680\u180e\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000\ufeff\x09\x0a\x0b\x0c\x0d\x20\xa0]+$/ // eslint-disable-line + +const stripMultiLinebreaks = (str) => { + return str.replace(/(\r\n|\n|\u2424){2,}/g, '\n').split('\n').map((line) => { + return WS_REGEXP.test(line) ? line.trim() : line + }).filter((line) => { + return line.length > 0 + }).join('\n') +} + +const stripMultispaces = (str) => { + return str.replace(WS_REGEXP, ' ').trim() +} + +export const getCharset = (html) => { + const doc = new DOMParser().parseFromString(html, 'text/html') + const m = doc.querySelector('meta[charset]') || null + let charset = m ? m.getAttribute('charset') : '' + if (!charset) { + const h = doc.querySelector('meta[http-equiv="content-type"]') || null + charset = h ? h.getAttribute('content')?.split(';')[1]?.replace('charset=', '')?.trim() : '' + } + return charset?.toLowerCase() || 'utf8' +} + +export const cleanify = (inputHtml) => { + const doc = new DOMParser().parseFromString(inputHtml, 'text/html') + const html = doc.documentElement.innerHTML + return pipe( + input => sanitize(input, getSanitizeHtmlOptions()), + input => stripMultiLinebreaks(input), + input => stripMultispaces(input) + )(html) +} diff --git a/src/utils/html.test.js b/src/utils/html.test.js new file mode 100644 index 00000000..00fc2263 --- /dev/null +++ b/src/utils/html.test.js @@ -0,0 +1,23 @@ +// html.test +import { describe, it } from 'node:test' +import assert from 'node:assert' + +import { readFileSync } from 'node:fs' + +import { isString } from '@ndaidong/bellajs' + +import { + cleanify +} from './html.js' + +describe('test cleanify() method', () => { + it('check if unwanted elements/attributes removed', () => { + const html = readFileSync('./test-data/regular-article.html', 'utf8') + assert.ok(html.includes('
4746 Kelly Drive, West Virginia
')) + assert.ok(html.includes('')) + const result = cleanify(html) + assert.ok(isString(result)) + assert.equal(result.includes('
4746 Kelly Drive, West Virginia
'), false) + assert.equal(result.includes(''), false) + }) +}) diff --git a/src/utils/isHTMLString.test.js b/src/utils/isHTMLString.test.js deleted file mode 100644 index 74a74838..00000000 --- a/src/utils/isHTMLString.test.js +++ /dev/null @@ -1,42 +0,0 @@ -// isHTMLString.test -/* eslint-env jest */ - -import { - readFileSync -} from 'fs' - -import isHTMLString from './isHTMLString.js' - -test('test isHTMLString(bad input)', () => { - const result = isHTMLString({}) - expect(result).toBe(false) -}) - -test('test isHTMLString(regular string)', () => { - const result = isHTMLString('This is just a string, not HTML') - expect(result).toBe(false) -}) - -test('test isHTMLString(bad-format HTML)', () => { - const result = isHTMLString('
Hello world') - expect(result).toBe(false) -}) - -test('test isHTMLString(well-format HTML)', () => { - const result = isHTMLString('
Hello world
') - expect(result).toBe(true) -}) - -test('test isHTMLString(example HTML page)', () => { - const files = [ - 'regular-article.html', - 'html-no-title.html', - 'html-article-no-source.html', - 'html-too-short-article.html' - ] - files.forEach((file) => { - const html = readFileSync(`./test-data/${file}`, 'utf8') - const result = isHTMLString(html) - expect(result).toBe(true) - }) -}) diff --git a/src/utils/isValidUrl.js b/src/utils/isValidUrl.js index 5e22a45e..d1e642c2 100644 --- a/src/utils/isValidUrl.js +++ b/src/utils/isValidUrl.js @@ -4,7 +4,7 @@ export default (url = '') => { try { const ourl = new URL(url) return ourl !== null && ourl.protocol.startsWith('http') - } catch (err) { + } catch { return false } } diff --git a/src/utils/isValidUrl.test.js b/src/utils/isValidUrl.test.js deleted file mode 100644 index 67e30b5e..00000000 --- a/src/utils/isValidUrl.test.js +++ /dev/null @@ -1,47 +0,0 @@ -// isValidUrl.test -/* eslint-env jest */ - -import isValidUrl from './isValidUrl.js' - -describe('test isValidUrl()', () => { - const cases = [ - { - url: 'https://www.23hq.com', - expected: true - }, - { - url: 'https://secure.actblue.com', - expected: true - }, - { - url: 'https://docs.microsoft.com/en-us/azure/iot-edge/quickstart?view=iotedge-2018-06', - expected: true - }, - { - url: 'http://192.168.1.199:8081/example/page', - expected: true - }, - { - url: 'ftp://192.168.1.199:8081/example/page', - expected: false - }, - { - url: '', - expected: false - }, - { - url: null, - expected: false - }, - { - url: { a: 'x' }, - expected: false - } - ] - cases.forEach(({ url, expected }) => { - test(`isValidUrl("${url}") must return "${expected}"`, () => { - const result = isValidUrl(url) - expect(result).toEqual(expected) - }) - }) -}) diff --git a/src/utils/linker.js b/src/utils/linker.js new file mode 100644 index 00000000..3c1a70f0 --- /dev/null +++ b/src/utils/linker.js @@ -0,0 +1,133 @@ +// utils -> linker + +import { DOMParser } from 'linkedom' + +import { findBestMatch } from './similarity.js' + +export const isValid = (url = '') => { + try { + const ourl = new URL(url) + return ourl !== null && ourl.protocol.startsWith('http') + } catch { + return false + } +} + +export const chooseBestUrl = (candidates = [], title = '') => { + const ranking = findBestMatch(title, candidates) + return ranking.bestMatch.target +} + +export const absolutify = (fullUrl = '', relativeUrl = '') => { + try { + const result = new URL(relativeUrl, fullUrl) + return result.toString() + } catch { + return '' + } +} + +const blacklistKeys = [ + 'CNDID', + '__twitter_impression', + '_hsenc', + '_openstat', + 'action_object_map', + 'action_ref_map', + 'action_type_map', + 'amp', + 'fb_action_ids', + 'fb_action_types', + 'fb_ref', + 'fb_source', + 'fbclid', + 'ga_campaign', + 'ga_content', + 'ga_medium', + 'ga_place', + 'ga_source', + 'ga_term', + 'gs_l', + 'hmb_campaign', + 'hmb_medium', + 'hmb_source', + 'mbid', + 'mc_cid', + 'mc_eid', + 'mkt_tok', + 'referrer', + 'spJobID', + 'spMailingID', + 'spReportId', + 'spUserID', + 'utm_brand', + 'utm_campaign', + 'utm_cid', + 'utm_content', + 'utm_int', + 'utm_mailing', + 'utm_medium', + 'utm_name', + 'utm_place', + 'utm_pubreferrer', + 'utm_reader', + 'utm_social', + 'utm_source', + 'utm_swu', + 'utm_term', + 'utm_userid', + 'utm_viz_id', + 'wt_mc_o', + 'yclid', + 'WT.mc_id', + 'WT.mc_ev', + 'WT.srch', + 'pk_source', + 'pk_medium', + 'pk_campaign', +] + +export const purify = (url) => { + try { + const pureUrl = new URL(url) + + blacklistKeys.forEach((key) => { + pureUrl.searchParams.delete(key) + }) + + return pureUrl.toString().replace(pureUrl.hash, '') + } catch { + return null + } +} + +/** + * @param inputHtml {string} + * @param url {string} + * @returns article {string} + */ +export const normalize = (html, url) => { + const doc = new DOMParser().parseFromString(html, 'text/html') + + Array.from(doc.getElementsByTagName('a')).forEach((element) => { + const href = element.getAttribute('href') + if (href) { + element.setAttribute('href', absolutify(url, href)) + element.setAttribute('target', '_blank') + } + }) + + Array.from(doc.getElementsByTagName('img')).forEach((element) => { + const src = element.getAttribute('data-src') ?? element.getAttribute('src') + if (src) { + element.setAttribute('src', absolutify(url, src)) + } + }) + + return Array.from(doc.childNodes).map(element => element.outerHTML).join('') +} + +export const getDomain = (url) => { + const host = (new URL(url)).host + return host.replace('www.', '') +} diff --git a/src/utils/linker.test.js b/src/utils/linker.test.js new file mode 100644 index 00000000..6ed89d2e --- /dev/null +++ b/src/utils/linker.test.js @@ -0,0 +1,182 @@ +// linker.test +import { describe, it } from 'node:test' +import assert from 'node:assert' + +import { readFileSync } from 'node:fs' + +import { isString } from '@ndaidong/bellajs' + +import { + chooseBestUrl, + isValid as isValidUrl, + purify as purifyUrl, + normalize as normalizeUrls, + absolutify as absolutifyUrl +} from './linker.js' + +describe('test isValidUrl()', () => { + const cases = [ + { + url: 'https://www.23hq.com', + expected: true, + }, + { + url: 'https://secure.actblue.com', + expected: true, + }, + { + url: 'https://docs.microsoft.com/en-us/azure/iot-edge/quickstart?view=iotedge-2018-06', + expected: true, + }, + { + url: 'http://192.168.1.199:8081/example/page', + expected: true, + }, + { + url: 'ftp://192.168.1.199:8081/example/page', + expected: false, + }, + { + url: '', + expected: false, + }, + { + url: null, + expected: false, + }, + { + url: { a: 'x' }, + expected: false, + }, + ] + cases.forEach(({ url, expected }) => { + it(`isValidUrl("${url}") must return "${expected}"`, () => { + const result = isValidUrl(url) + assert.equal(result, expected) + }) + }) +}) + +describe('test normalizeUrls()', () => { + it('test adding absolute URLs to all links', () => { + const bestUrl = 'https://test-url.com/burritos-for-life' + const html = readFileSync('./test-data/regular-article.html', 'utf8') + const result = normalizeUrls(html, bestUrl) + assert.ok(isString(result)) + assert.equal(result.includes('watermelon'), false) + assert.equal(result.includes('watermelon'), true) + }) + it('test adding target=_blank to all links', () => { + const bestUrl = 'https://test-url.com/burritos-for-life' + const html = readFileSync('./test-data/regular-article.html', 'utf8') + const result = normalizeUrls(html, bestUrl) + assert.ok(isString(result)) + assert.equal(result.includes('rational peach'), false) + assert.equal(result.includes('rational peach'), true) + }) +}) + +describe('test purifyUrl()', () => { + const entries = [ + { + url: '', + expected: null, + }, + { + url: {}, + expected: null, + }, + { + url: 'https://some.where/article/abc-xyz', + expected: 'https://some.where/article/abc-xyz', + }, + { + url: 'https://some.where/article/abc-xyz#name,bob', + expected: 'https://some.where/article/abc-xyz', + }, + { + url: 'https://some.where/article/abc-xyz?utm_source=news4&utm_medium=email&utm_campaign=spring-summer', + expected: 'https://some.where/article/abc-xyz', + }, + { + url: 'https://some.where/article/abc-xyz?q=3&utm_source=news4&utm_medium=email&utm_campaign=spring-summer', + expected: 'https://some.where/article/abc-xyz?q=3', + }, + { + url: 'https://some.where/article/abc-xyz?pk_source=news4&pk_medium=email&pk_campaign=spring-summer', + expected: 'https://some.where/article/abc-xyz', + }, + { + url: 'https://some.where/article/abc-xyz?q=3&pk_source=news4&pk_medium=email&pk_campaign=spring-summer', + expected: 'https://some.where/article/abc-xyz?q=3', + }, + ] + entries.forEach((entry) => { + const { + url, + expected, + } = entry + it(`purifyUrl("${url}") must become "${expected}"`, () => { + const result = purifyUrl(url) + assert.equal(result, expected) + }) + }) +}) + +describe('test absolutifyUrl()', () => { + const entries = [ + { + full: '', + expected: '', + }, + { + relative: {}, + expected: '', + }, + { + full: 'https://some.where/article/abc-xyz', + relative: 'category/page.html', + expected: 'https://some.where/article/category/page.html', + }, + { + full: 'https://some.where/article/abc-xyz', + relative: '../category/page.html', + expected: 'https://some.where/category/page.html', + }, + { + full: 'https://some.where/blog/authors/article/abc-xyz', + relative: '/category/page.html', + expected: 'https://some.where/category/page.html', + }, + { + full: 'https://some.where/article/abc-xyz', + expected: 'https://some.where/article/abc-xyz', + }, + ] + entries.forEach((entry) => { + const { + full, + relative, + expected, + } = entry + it(`absolutifyUrl("${full}", "${relative}") must become "${expected}"`, () => { + const result = absolutifyUrl(full, relative) + assert.equal(result, expected) + }) + }) +}) + +describe('test chooseBestUrl()', () => { + it('test chooseBestUrl an actual case', () => { + const title = 'Google đã ra giá mua Fitbit' + const urls = [ + 'https://alpha.xyz/tin-tuc-kinh-doanh/-/view_content/content/2965950/google-da-ra-gia-mua-fitbit', + 'https://alpha.xyz/tin-tuc-kinh-doanh/view/2965950/907893219797', + 'https://alpha.xyz/tin-tuc-kinh-doanh/google-da-ra-gia-mua-fitbit', + 'https://a.xyz/read/google-da-ra-gia-mua-fitbit', + 'https://a.xyz/read/2965950/907893219797', + ] + const result = chooseBestUrl(urls, title) + assert.equal(result, urls[3]) + }) +}) diff --git a/src/utils/logger.js b/src/utils/logger.js deleted file mode 100644 index d87f5b68..00000000 --- a/src/utils/logger.js +++ /dev/null @@ -1,15 +0,0 @@ -// utils / logger - -import debug from 'debug' - -const name = 'article-parser' - -export const info = debug(`${name}:info`) -export const error = debug(`${name}:error`) -export const warning = debug(`${name}:warning`) - -export default { - info: debug(`${name}:info`), - error: debug(`${name}:error`), - warning: debug(`${name}:warning`) -} diff --git a/src/utils/normalizeUrls.test.js b/src/utils/normalizeUrls.test.js deleted file mode 100644 index c587b4e8..00000000 --- a/src/utils/normalizeUrls.test.js +++ /dev/null @@ -1,41 +0,0 @@ -// normalizeUrls.test -/* eslint-env jest */ - -import { readFileSync } from 'fs' - -import { isString } from 'bellajs' - -import normalizeUrls from './normalizeUrls.js' - -describe('test normalizeUrls()', () => { - test('test adding absolute URLs to all links', () => { - const bestUrl = 'https://test-url.com/burritos-for-life' - const html = readFileSync('./test-data/regular-article.html', 'utf8') - const result = normalizeUrls(html, bestUrl) - expect(isString(result)).toBe(true) - expect(result).toEqual( - expect.not.stringContaining('watermelon') - ) - expect(result).toEqual( - expect.stringContaining( - 'watermelon' - ) - ) - }) - test('test adding target=_blank to all links', () => { - const bestUrl = 'https://test-url.com/burritos-for-life' - const html = readFileSync('./test-data/regular-article.html', 'utf8') - const result = normalizeUrls(html, bestUrl) - expect(isString(result)).toBe(true) - expect(result).toEqual( - expect.not.stringContaining( - 'rational peach' - ) - ) - expect(result).toEqual( - expect.stringContaining( - 'rational peach' - ) - ) - }) -}) diff --git a/src/utils/parseFromHtml.js b/src/utils/parseFromHtml.js index 9f153b82..406d0777 100644 --- a/src/utils/parseFromHtml.js +++ b/src/utils/parseFromHtml.js @@ -1,48 +1,38 @@ // utils -> parseFromHtml -import { stripTags, truncate, unique } from 'bellajs' +import { stripTags, truncate, unique, pipe } from '@ndaidong/bellajs' -import sanitize from 'sanitize-html' +import { purify, cleanify } from './html.js' -import isValidUrl from './isValidUrl.js' -import purifyUrl from './purifyUrl.js' -import absolutifyUrl from './absolutifyUrl.js' -import chooseBestUrl from './chooseBestUrl.js' -import getHostname from './getHostname.js' +import { + isValid as isValidUrl, + purify as purifyUrl, + absolutify as absolutifyUrl, + normalize as normalizeUrls, + chooseBestUrl, + getDomain +} from './linker.js' -import findRulesByUrl from './findRulesByUrl.js' -import cleanAndMinifyHtml from './cleanAndMinifyHtml.js' import extractMetaData from './extractMetaData.js' -import extractJsonLd from './extractJsonLd.js' + import extractWithReadability, { extractTitleWithReadability } from './extractWithReadability.js' -import extractWithSelector from './extractWithSelector.js' -import getTimeToRead from './getTimeToRead.js' -import normalizeUrls from './normalizeUrls.js' -import stripUnwantedTags from './stripUnwantedTags.js' -import transformHtml from './transformHtml.js' -import logger from './logger.js' +import { execPreParser, execPostParser } from './transformation.js' -import { getParserOptions } from '../config.js' +import getTimeToRead from './getTimeToRead.js' -const cleanify = html => { - return sanitize(html, { - allowedTags: false, - allowedAttributes: false - }) +const summarize = (desc, txt, threshold, maxlen) => { // eslint-disable-line + return desc.length > threshold + ? desc + : truncate(txt, maxlen).replace(/\n/g, ' ') } -const summarize = (desc, txt, threshold, maxlen) => { - return desc.length < threshold - ? truncate(txt, maxlen).replace(/\n/g, ' ') - : desc -} +export default async (inputHtml, inputUrl = '', parserOptions = {}) => { + const pureHtml = purify(inputHtml) + const meta = extractMetaData(pureHtml) -export default async (inputHtml, inputUrl = '') => { - const html = cleanify(inputHtml) - const meta = extractMetaData(html) let title = meta.title const { @@ -53,23 +43,23 @@ export default async (inputHtml, inputUrl = '') => { description: metaDesc, image: metaImg, author, - source, - published + published, + favicon: metaFav, + type, } = meta const { - descriptionLengthThreshold, - descriptionTruncateLen, - contentLengthThreshold - } = getParserOptions() + wordsPerMinute = 300, + descriptionTruncateLen = 210, + descriptionLengthThreshold = 180, + contentLengthThreshold = 200, + } = parserOptions // gather title if (!title) { - logger.info('Could not detect article title from meta!') - title = extractTitleWithReadability(html, inputUrl) + title = extractTitleWithReadability(pureHtml, inputUrl) } if (!title) { - logger.info('Could not detect article title!') return null } @@ -81,41 +71,38 @@ export default async (inputHtml, inputUrl = '') => { ) if (!links.length) { - logger.info('Could not detect article link!') return null } - // choose the best url + // choose the best url, which one looks like title the most const bestUrl = chooseBestUrl(links, title) - // get defined selector - const { - selector = null, - unwanted = [], - transform = null - } = findRulesByUrl(links) - - // find article content - const mainContentSelected = extractWithSelector(html, selector) - - const mainContent = stripUnwantedTags(mainContentSelected ?? html, unwanted) - - const mainContentAbsoluteUrls = normalizeUrls(mainContent, bestUrl) - - const transformedContent = transformHtml(mainContentAbsoluteUrls, transform) + const fns = pipe( + (input) => { + return normalizeUrls(input, bestUrl) + }, + (input) => { + return execPreParser(input, links) + }, + (input) => { + return extractWithReadability(input, bestUrl) + }, + (input) => { + return input ? execPostParser(input, links) : null + }, + (input) => { + return input ? cleanify(input) : null + } + ) - const content = extractWithReadability(transformedContent, bestUrl) + const content = fns(inputHtml) if (!content) { - logger.info('Could not detect article content!') return null } - const normalizedContent = cleanAndMinifyHtml(content) - - const textContent = stripTags(normalizedContent) + const textContent = stripTags(content) if (textContent.length < contentLengthThreshold) { - logger.info('Main article is too short!') return null } @@ -127,22 +114,7 @@ export default async (inputHtml, inputUrl = '') => { ) const image = metaImg ? absolutifyUrl(bestUrl, metaImg) : '' - - let { publisher, ...jsonLData } = extractJsonLd(html, inputUrl) - let authors = jsonLData.author - if (!authors.length && author) { - const trimStr = (str) => str.trimStart().trimEnd() - authors = author.split(/,|and|& /).filter((a) => !!trimStr(a)).map((a) => ({ name: trimStr(a) })) - } - - const hostName = getHostname(bestUrl) - - if (!publisher) { - publisher = { - name: source || hostName, - url: hostName - } - } + const favicon = metaFav ? absolutifyUrl(bestUrl, metaFav) : '' return { url: bestUrl, @@ -150,11 +122,12 @@ export default async (inputHtml, inputUrl = '') => { description, links, image, - content: normalizedContent, - author: authors, - source: source || hostName, - publisher, + content, + author, + favicon, + source: getDomain(bestUrl), published, - ttr: getTimeToRead(textContent) + ttr: getTimeToRead(textContent, wordsPerMinute), + type, } } diff --git a/src/utils/parseFromHtml.test.js b/src/utils/parseFromHtml.test.js index c6eac666..e9965038 100644 --- a/src/utils/parseFromHtml.test.js +++ b/src/utils/parseFromHtml.test.js @@ -1,127 +1,133 @@ // parseFromHtml.test -/* eslint-env jest */ +import { describe, it } from 'node:test' +import assert from 'node:assert' -import { readFileSync } from 'fs' +import { readFileSync } from 'node:fs' -import { isFunction } from 'bellajs' +import { isFunction } from '@ndaidong/bellajs' -import { - addQueryRules -} from '../config.js' +import { extractFromHtml as parseFromHtml } from '../main.js' +import { addTransformations } from './transformation.js' -import parseFromHtml from './parseFromHtml.js' +const expDesc = [ + 'Navigation here Few can name a rational peach that isn\'t a conscientious goldfish!', + 'One cannot separate snakes from plucky pomegranates?', + 'Draped neatly on a hanger, the melons could be said to resemble knowledgeable pigs.', +].join(' ') -describe('test parseFromHtml()', () => { - const cases = [ - { - input: { - desc: 'a bad input', - html: {} - }, - expectation: null +const cases = [ + { + input: { + desc: 'a webpage with no title', + html: readFileSync('./test-data/html-no-title.html', 'utf8'), }, - { - input: { - desc: 'a webpage with no title', - html: readFileSync('./test-data/html-no-title.html', 'utf8') - }, - expectation: null + expectation: null, + }, + { + input: { + desc: 'a webpage without link', + html: readFileSync('./test-data/html-no-link.html', 'utf8'), }, - { - input: { - desc: 'a webpage with no main article', - html: readFileSync('./test-data/html-no-article.html', 'utf8') - }, - expectation: null + expectation: null, + }, + { + input: { + desc: 'a webpage with no main article', + html: readFileSync('./test-data/html-no-article.html', 'utf8'), }, - { - input: { - desc: 'a webpage with a very short article', - html: readFileSync('./test-data/html-too-short-article.html', 'utf8'), - url: 'abcd' - }, - expectation: null + expectation: null, + }, + { + input: { + desc: 'a webpage with a very short article', + html: readFileSync('./test-data/html-too-short-article.html', 'utf8'), + url: 'abcd', }, - { - input: { - desc: 'a webpage with article but no source', - html: readFileSync('./test-data/html-article-no-source.html', 'utf8') - }, - expectation: (result, expect) => { - expect(result.source).toEqual('somewhere.any') - } + expectation: null, + }, + { + input: { + desc: 'a webpage with article but no source', + html: readFileSync('./test-data/html-article-no-source.html', 'utf8'), }, - { - input: { - desc: 'a webpage with data-src in img tag', - html: readFileSync('./test-data/html-article-with-data-src.html', 'utf8') - }, - expectation: (result, expect) => { - expect(result.content).toEqual(expect.stringContaining('')) - expect(result.content).toEqual(expect.stringContaining('')) - } + expectation: (result) => { + assert.equal(result.source, 'somewhere.any') }, - { - input: { - desc: 'a webpage with regular article', - html: readFileSync('./test-data/regular-article.html', 'utf8'), - url: 'https://somewhere.com/path/to/article' - }, - expectation: (result, expect) => { - expect(result.title).toEqual('Article title here') - expect(result.description).toEqual('Few words to summarize this article content') - expect(result.content).toEqual(expect.stringContaining('')) - expect(result.content).toEqual(expect.stringContaining('')) - } + }, + { + input: { + desc: 'a webpage with data-src in img tag', + html: readFileSync('./test-data/html-article-with-data-src.html', 'utf8'), }, - { - input: { - desc: 'a webpage with unwanted elements', - html: readFileSync('./test-data/vnn-article.html', 'utf8'), - url: 'https://vnn.vn/path/to/article' - }, - expectation: (result, expect) => { - expect(result.title).toEqual('Article title here') - expect(result.description).toEqual('Few words to summarize this article content') - expect(result.content).toEqual(expect.stringContaining('')) - expect(result.content).toEqual(expect.stringContaining('')) - expect(result.content).toEqual(expect.not.stringContaining('Related articles')) - } - } - ] + expectation: (result) => { + assert.equal(result.content.includes(''), true) + assert.equal(result.content.includes(''), true) + }, + }, + { + input: { + desc: 'a webpage with regular article', + html: readFileSync('./test-data/regular-article.html', 'utf8'), + url: 'https://somewhere.com/path/to/article', + }, + expectation: (result) => { + assert.equal(result.title, 'Article title here') + assert.equal(result.description, expDesc) + assert.equal(result.content.includes(''), true) + assert.equal(result.content.includes(''), true) + }, + }, +] +describe('test parseFromHtml', () => { cases.forEach((acase) => { const { input, expectation } = acase const { desc, html, url = '' } = input - test(`check if parseFromHtml() works with ${desc}`, async () => { + it(`check if parseFromHtml() works with ${desc}`, async () => { const result = await parseFromHtml(html, url) if (isFunction(expectation)) { - expectation(result, expect) + expectation(result) } else { - expect(result).toEqual(expectation) + assert.equal(result, expectation) } }) }) -}) -test('check if parseFromHtml() works with transform rule', async () => { - addQueryRules({ - patterns: [ - /http(s?):\/\/([\w]+.)?need-transform.tld\/*/ - ], - transform: ($) => { - $.querySelectorAll('a').forEach(node => { - const sHtml = node.innerHTML - const link = node.getAttribute('href') - node.parentNode.replaceChild($.createTextNode(`[link url="${link}"]${sHtml}[/link]`), node) - }) - return $ - } + it('check if parseFromHtml() works with multi transforms', async () => { + addTransformations([ + { + patterns: [ + /http(s?):\/\/need-transform.tld\/*/, + ], + post: (document) => { + document.querySelectorAll('a').forEach((node) => { + const sHtml = node.innerHTML + const link = node.getAttribute('href') + node.parentNode.replaceChild(document.createTextNode(`[link url="${link}"]${sHtml}[/link]`), node) + }) + return document + }, + }, + { + patterns: [ + /http(s?):\/\/sw.re\/*/, + ], + post: (document) => { + document.querySelectorAll('strong').forEach((node) => { + const b = document.createElement('B') + b.innerHTML = node.innerHTML + node.parentNode.replaceChild(b, node) + }) + return document + }, + }, + ]) + const html = readFileSync('./test-data/vnn-article.html', 'utf8') + const url = 'https://need-transform.tld/path/to/article' + const result = await parseFromHtml(html, url) + assert.equal(result.title, 'Article title here') + assert.equal(result.content.includes(''), false) + assert.equal(result.content.includes('[link url="https://vnn.vn/dict/watermelon"]watermelon[/link]'), true) + assert.equal(result.content.includes('in its own way'), true) }) - const html = readFileSync('./test-data/vnn-article.html', 'utf8') - const url = 'https://need-transform.tld/path/to/article' - const result = await parseFromHtml(html, url) - expect(result.title).toEqual('Article title here') - expect(result.content).toEqual(expect.not.stringContaining('')) - expect(result.content).toEqual(expect.stringContaining('[link url="https://vnn.vn/dict/watermelon"]watermelon[/link]')) }) diff --git a/src/utils/purifyUrl.js b/src/utils/purifyUrl.js index 3c4866c4..ac0774c5 100644 --- a/src/utils/purifyUrl.js +++ b/src/utils/purifyUrl.js @@ -59,7 +59,7 @@ const blacklistKeys = [ 'WT.srch', 'pk_source', 'pk_medium', - 'pk_campaign' + 'pk_campaign', ] export default (url) => { diff --git a/src/utils/purifyUrl.test.js b/src/utils/purifyUrl.test.js deleted file mode 100644 index fc87db11..00000000 --- a/src/utils/purifyUrl.test.js +++ /dev/null @@ -1,51 +0,0 @@ -// purifyUrl.test -/* eslint-env jest */ - -import purifyUrl from './purifyUrl.js' - -describe('test purifyUrl()', () => { - const entries = [ - { - url: '', - expected: null - }, - { - url: {}, - expected: null - }, - { - url: 'https://some.where/article/abc-xyz', - expected: 'https://some.where/article/abc-xyz' - }, - { - url: 'https://some.where/article/abc-xyz#name,bob', - expected: 'https://some.where/article/abc-xyz' - }, - { - url: 'https://some.where/article/abc-xyz?utm_source=news4&utm_medium=email&utm_campaign=spring-summer', - expected: 'https://some.where/article/abc-xyz' - }, - { - url: 'https://some.where/article/abc-xyz?q=3&utm_source=news4&utm_medium=email&utm_campaign=spring-summer', - expected: 'https://some.where/article/abc-xyz?q=3' - }, - { - url: 'https://some.where/article/abc-xyz?pk_source=news4&pk_medium=email&pk_campaign=spring-summer', - expected: 'https://some.where/article/abc-xyz' - }, - { - url: 'https://some.where/article/abc-xyz?q=3&pk_source=news4&pk_medium=email&pk_campaign=spring-summer', - expected: 'https://some.where/article/abc-xyz?q=3' - } - ] - entries.forEach((entry) => { - const { - url, - expected - } = entry - test(`purifyUrl("${url}") must become "${expected}"`, () => { - const result = purifyUrl(url) - expect(result).toEqual(expected) - }) - }) -}) diff --git a/src/utils/retrieve.js b/src/utils/retrieve.js index 555bfd21..6922a752 100644 --- a/src/utils/retrieve.js +++ b/src/utils/retrieve.js @@ -1,24 +1,36 @@ // utils -> retrieve -import axios from 'axios' +import fetch from 'cross-fetch' -import logger from './logger.js' - -import { getRequestOptions } from '../config.js' +const profetch = async (url, options = {}) => { + const { proxy = {}, signal = null } = options + const { + target, + headers = {}, + } = proxy + const res = await fetch(target + encodeURIComponent(url), { + headers, + signal, + }) + return res +} -export default async (url) => { - try { - const res = await axios.get(url, getRequestOptions()) +export default async (url, options = {}) => { + const { + headers = { + 'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0', + }, + proxy = null, + agent = null, + signal = null, + } = options - const contentType = res.headers['content-type'] || '' - if (!contentType || !contentType.includes('text/html')) { - logger.error(`Content type must be "text/html", not "${contentType}"`) - return null - } + const res = proxy ? await profetch(url, { proxy, signal }) : await fetch(url, { headers, agent, signal }) - return res.data - } catch (err) { - logger.error(err.message || err) - return null + const status = res.status + if (status >= 400) { + throw new Error(`Request failed with error code ${status}`) } + const buffer = await res.arrayBuffer() + return buffer } diff --git a/src/utils/retrieve.test.js b/src/utils/retrieve.test.js index f8efc314..a410c004 100644 --- a/src/utils/retrieve.test.js +++ b/src/utils/retrieve.test.js @@ -1,5 +1,6 @@ // retrieve.test -/* eslint-env jest */ +import { describe, it } from 'node:test' +import assert from 'node:assert' import nock from 'nock' @@ -9,39 +10,59 @@ const parseUrl = (url) => { const re = new URL(url) return { baseUrl: `${re.protocol}//${re.host}`, - path: re.pathname + path: re.pathname, } } -test('test retrieve from good source', async () => { - const url = 'https://some.where/good/page' - const { baseUrl, path } = parseUrl(url) - const scope = nock(baseUrl) - scope.get(path).reply(200, '
this is content
', { - 'Content-Type': 'text/html' +describe('test retrieve() method', () => { + it('test retrieve with bad status code', async () => { + const url = 'https://some.where/bad/page' + const { baseUrl, path } = parseUrl(url) + nock(baseUrl).get(path).reply(500, 'Error 500') + assert.rejects(retrieve(url), new Error('Request failed with error code 500')) }) - const result = await retrieve(url) - expect(result).toBe('
this is content
') -}) -test('test retrieve with unsupported content type', async () => { - const url = 'https://some.where/bad/page' - const { baseUrl, path } = parseUrl(url) - const scope = nock(baseUrl) - scope.get(path).reply(200, '', { - 'Content-Type': 'something/strange' + it('test retrieve from good source', async () => { + const url = 'https://some.where/good/page' + const { baseUrl, path } = parseUrl(url) + nock(baseUrl).get(path).reply(200, '
this is content
', { + 'Content-Type': 'text/html', + }) + const buffer = await retrieve(url) + const html = Buffer.from(buffer).toString() + assert.equal(html, '
this is content
') }) - const result = await retrieve(url) - expect(result).toBe(null) -}) -test('test retrieve from bad source', async () => { - const url = 'https://some.where/bad/page' - const { baseUrl, path } = parseUrl(url) - const scope = nock(baseUrl) - scope.get(path).reply(500, '
this is content
', { - 'Content-Type': 'text/html' + it('test retrieve from good source with \\r\\n', async () => { + const url = 'https://some.where/good/page' + const { baseUrl, path } = parseUrl(url) + nock(baseUrl).get(path).reply(200, '\n\r\r\n\n
this is content
\n\r\r\n\n', { + 'Content-Type': 'text/html', + }) + const buffer = await retrieve(url) + const html = Buffer.from(buffer).toString().trim() + assert.equal(html, '
this is content
') + }) + + it('test retrieve using proxy', async () => { + const url = 'https://some.where/good/source-with-proxy' + const { baseUrl, path } = parseUrl(url) + nock(baseUrl).get(path).reply(200, 'something bad', { + 'Content-Type': 'bad/thing', + }) + nock('https://proxy-server.com') + .get('/api/proxy?url=https%3A%2F%2Fsome.where%2Fgood%2Fsource-with-proxy') + .reply(200, '
this is content
', { + 'Content-Type': 'text/html', + }) + + const buffer = await retrieve(url, { + proxy: { + target: 'https://proxy-server.com/api/proxy?url=', + }, + }) + const html = Buffer.from(buffer).toString() + assert.equal(html, '
this is content
') + nock.cleanAll() }) - const result = await retrieve(url) - expect(result).toBe(null) }) diff --git a/src/utils/similarity.js b/src/utils/similarity.js new file mode 100644 index 00000000..7e291cbc --- /dev/null +++ b/src/utils/similarity.js @@ -0,0 +1,64 @@ +// similarity.js +// https://github.com/aceakash/string-similarity + +import { isArray, isString } from '@ndaidong/bellajs' + +const areArgsValid = (mainString, targetStrings) => { + return isString(mainString) && isArray(targetStrings) + && targetStrings.length > 0 && targetStrings.every(s => isString(s)) +} + +export const compareTwoStrings = (first, second) => { + first = first.replace(/\s+/g, '') + second = second.replace(/\s+/g, '') + + if (first === second) return 1 // identical or empty + if (first.length < 2 || second.length < 2) return 0 // if either is a 0-letter or 1-letter string + + let firstBigrams = new Map() + for (let i = 0; i < first.length - 1; i++) { + const bigram = first.substring(i, i + 2) + const count = firstBigrams.has(bigram) + ? firstBigrams.get(bigram) + 1 + : 1 + + firstBigrams.set(bigram, count) + } + + let intersectionSize = 0 + for (let i = 0; i < second.length - 1; i++) { + const bigram = second.substring(i, i + 2) + const count = firstBigrams.has(bigram) + ? firstBigrams.get(bigram) + : 0 + + if (count > 0) { + firstBigrams.set(bigram, count - 1) + intersectionSize++ + } + } + + return (2.0 * intersectionSize) / (first.length + second.length - 2) +} + +export const findBestMatch = (mainString, targetStrings) => { + if (!areArgsValid(mainString, targetStrings)) { + throw new Error('Bad arguments: First argument should be a string, second should be an array of strings') + } + + const ratings = [] + let bestMatchIndex = 0 + + for (let i = 0; i < targetStrings.length; i++) { + const currentTargetString = targetStrings[i] + const currentRating = compareTwoStrings(mainString, currentTargetString) + ratings.push({ target: currentTargetString, rating: currentRating }) + if (currentRating > ratings[bestMatchIndex].rating) { + bestMatchIndex = i + } + } + + const bestMatch = ratings[bestMatchIndex] + + return { ratings: ratings, bestMatch: bestMatch, bestMatchIndex: bestMatchIndex } +} diff --git a/src/utils/transformHtml.test.js b/src/utils/transformHtml.test.js deleted file mode 100644 index ea9cc593..00000000 --- a/src/utils/transformHtml.test.js +++ /dev/null @@ -1,31 +0,0 @@ -// transformHtml.test -/* eslint-env jest */ - -import { readFileSync } from 'fs' - -import { isString } from 'bellajs' - -import transformHtml from './transformHtml.js' - -describe('test transformHtml()', () => { - test('test transform html elements from good html content', async () => { - const transform = (document) => { - document.querySelectorAll('h1').forEach((node) => { - const newNode = document.createElement('h2') - newNode.innerHTML = node.innerHTML - node.parentNode.replaceChild(newNode, node) - }) - return document - } - - const html = readFileSync('./test-data/regular-article.html', 'utf8') - const result = transformHtml(html, transform) - expect(isString(result)).toBe(true) - expect(result).toEqual( - expect.not.stringContaining('

Article title here

') - ) - expect(result).toEqual( - expect.stringContaining('

Article title here

') - ) - }) -}) diff --git a/src/utils/transformation.js b/src/utils/transformation.js new file mode 100644 index 00000000..5bb872b1 --- /dev/null +++ b/src/utils/transformation.js @@ -0,0 +1,69 @@ +// utils --> transformation.js + +import { isArray, isFunction, clone } from '@ndaidong/bellajs' +import { DOMParser } from 'linkedom' + +const transformations = [] + +const add = (tn) => { + const { patterns } = tn + if (!patterns || !isArray(patterns) || !patterns.length) { + return 0 + } + transformations.push(tn) + return 1 +} + +export const addTransformations = (tfms) => { + if (isArray(tfms)) { + return tfms.map(tfm => add(tfm)).filter(result => result === 1).length + } + return add(tfms) +} + +export const removeTransformations = (patterns) => { + if (!patterns) { + const removed = transformations.length + transformations.length = 0 + return removed + } + let removing = 0 + for (let i = transformations.length - 1; i > 0; i--) { + const { patterns: ipatterns } = transformations[i] + const matched = ipatterns.some((ptn) => patterns.some((pattern) => String(pattern) === String(ptn))) + if (matched) { + transformations.splice(i, 1) + removing += 1 + } + } + return removing +} + +export const getTransformations = () => { + return clone(transformations) +} + +export const findTransformations = (links) => { + const urls = !isArray(links) ? [links] : links + const tfms = [] + for (const transformation of transformations) { + const { patterns } = transformation + const matched = urls.some((url) => patterns.some((pattern) => pattern.test(url))) + if (matched) { + tfms.push(clone(transformation)) + } + } + return tfms +} + +export const execPreParser = (html, links) => { + const doc = new DOMParser().parseFromString(html, 'text/html') + findTransformations(links).map(tfm => tfm.pre).filter(fn => isFunction(fn)).map(fn => fn(doc)) + return Array.from(doc.childNodes).map(it => it.outerHTML).join('') +} + +export const execPostParser = (html, links) => { + const doc = new DOMParser().parseFromString(html, 'text/html') + findTransformations(links).map(tfm => tfm.post).filter(fn => isFunction(fn)).map(fn => fn(doc)) + return Array.from(doc.childNodes).map(it => it.outerHTML).join('') +} diff --git a/src/utils/transformation.test.js b/src/utils/transformation.test.js new file mode 100644 index 00000000..18a54d4a --- /dev/null +++ b/src/utils/transformation.test.js @@ -0,0 +1,194 @@ +// transformation.test +import { describe, it as test } from 'node:test' +import assert from 'node:assert' + +import { + addTransformations, + removeTransformations, + getTransformations, + findTransformations, + execPreParser, + execPostParser +} from './transformation.js' + +describe('test transformation', () => { + test(' add one transformation object', () => { + const result = addTransformations({ + patterns: [ + /http(s?):\/\/([\w]+.)?def.tld\/*/, + ], + pre: (document) => { + return document + }, + post: (document) => { + return document + }, + }) + assert.equal(result, 1) + }) + + test(' add multi transformation object', () => { + const result = addTransformations([ + { + patterns: [ + /http(s?):\/\/google.com\/*/, + /http(s?):\/\/goo.gl\/*/, + ], + }, + { + patterns: [ + /http(s?):\/\/goo.gl\/*/, + /http(s?):\/\/google.inc\/*/, + ], + }, + ]) + assert.equal(result, 2) + }) + + test(' add transformation object without patterns', () => { + const result = addTransformations({ + pre: (document) => { + return document + }, + post: (document) => { + return document + }, + }) + assert.equal(result, 0) + }) + + test(' add transformation object without valid patterns', () => { + const result = addTransformations({ + patterns: 123, + pre: (document) => { + return document + }, + post: (document) => { + return document + }, + }) + assert.equal(result, 0) + }) + + test(' get all transformations', () => { + const result = getTransformations() + assert.equal(result.length, 3) + assert.deepEqual(result[0].patterns[0], /http(s?):\/\/([\w]+.)?def.tld\/*/) + }) + + test(' remove one transformation', () => { + addTransformations([ + { + patterns: [ + /http(s?):\/\/abc.com\/*/, + /http(s?):\/\/def.gl\/*/, + ], + }, + { + patterns: [ + /http(s?):\/\/hik.gl\/*/, + /http(s?):\/\/lmn.inc\/*/, + ], + }, + { + patterns: [ + /http(s?):\/\/opq.gl\/*/, + /http(s?):\/\/rst.inc\/*/, + ], + }, + ]) + const result = removeTransformations([ + /http(s?):\/\/goo.gl\/*/, + ]) + assert.equal(result, 2) + }) + + test(' get all transformations again', () => { + const result = getTransformations() + assert.equal(result.length, 4) + assert.deepEqual(result[3].patterns[1], /http(s?):\/\/rst.inc\/*/) + }) + + test(' find transformations', () => { + addTransformations([ + { + patterns: [ + /http(s?):\/\/def.gl\/*/, + /http(s?):\/\/uvw.inc\/*/, + ], + }, + ]) + const notFound = findTransformations([ + 'https://goo.gl/docs/article.html', + ]) + assert.deepEqual(notFound, []) + + const foundOne = findTransformations([ + 'https://lmn.inc/docs/article.html', + ]) + assert.equal(foundOne.length, 1) + + const foundTwo = findTransformations([ + 'https://def.gl/docs/article.html', + ]) + assert.equal(foundTwo.length, 2) + }) + + test(' run execPreParser', () => { + addTransformations([ + { + patterns: [ + /http(s?):\/\/xyz.com\/*/, + ], + pre: (doc) => { + doc.querySelectorAll('.adv').forEach((element) => { + element.parentNode.removeChild(element) + }) + return doc + }, + }, + ]) + const html = ` +
+ hi user, this is an advertisement element +
free product now!
+
+ ` + const result = execPreParser(html, 'https://xyz.com/article') + assert.equal(result.includes('hi user, this is an advertisement element'), true) + assert.equal(result.includes('
free product now!
'), false) + }) + + test(' run execPostParser', () => { + addTransformations([ + { + patterns: [ + /http(s?):\/\/xyz.com\/*/, + ], + post: (doc) => { + doc.querySelectorAll('b').forEach((element) => { + const itag = doc.createElement('i') + itag.innerHTML = element.innerHTML + element.parentNode.replaceChild(itag, element) + }) + return doc + }, + }, + ]) + const html = ` +
+ hi user, +

Thank you for your feedback!

+
+ ` + const result = execPostParser(html, 'https://xyz.com/article') + assert.equal(result.includes('user'), true) + assert.equal(result.includes('user'), false) + }) + + test(' remove all transformations', () => { + const result = removeTransformations() + assert.equal(result, 7) + assert.deepEqual(getTransformations(), []) + }) +}) diff --git a/test-data/article-with-classes-attributes.html b/test-data/article-with-classes-attributes.html new file mode 100644 index 00000000..155dc66e --- /dev/null +++ b/test-data/article-with-classes-attributes.html @@ -0,0 +1,64 @@ + + + + + + Article title here - ArticleParser + + + + + + + + + + + + + + + + + + + + + + + + + + +
Page header here
+
+
+ +
+
+

Article title here

+
+
Few can name a rational peach that isn't a conscientious goldfish! One cannot separate snakes from plucky pomegranates? Draped neatly on a hanger, the melons could be said to resemble knowledgeable pigs. Some posit the enchanting tiger to be less than confident. The literature would have us believe that an impartial turtle is not but a hippopotamus. Unfortunately, that is wrong; on the contrary, those cows are nothing more than pandas! The chicken is a shark; A turtle can hardly be considered a kind horse without also being a pomegranate. Zebras are witty persimmons.
+

+ Those cheetahs are nothing more than dogs. A watermelon is an exuberant kangaroo. An octopus is the tangerine of a grapes? The cherry is a shark. Recent controversy aside, they were lost without the cheerful plum that composed their fox. As far as we can estimate, one cannot separate camels from dynamic hamsters. Those tigers are nothing more than cows! A cow is a squirrel from the right perspective. Their banana was, in this moment, a helpful bear.

+

The first fair dog is, in its own way, a lemon.

+
4746 Kelly Drive, West Virginia
+ +
+            
+              const add = (a, b) => {
+                return a + b
+              }
+            
+          
+

OK, that is good

+
+
+ +
+
Page footer here
+ + diff --git a/test-data/html-article-no-source.html b/test-data/html-article-no-source.html index c040fa2e..12237080 100644 --- a/test-data/html-article-no-source.html +++ b/test-data/html-article-no-source.html @@ -17,6 +17,6 @@ To be more specific, those turtles are nothing more than fishes. A grape can hardly be considered a shrewd goldfish without also being an owl. Some unbiased goats are thought of simply as tangerines. Shouting with happiness, a courageous elephant is a duck of the mind? Some posit the upbeat hippopotamus to be less than enchanting. It's an undeniable fact, really; authors often misinterpret the grape as an endurable rabbit, when in actuality it feels more like a tough dolphin. We know that a cherry can hardly be considered a responsible apricot without also being a nectarine. - s + diff --git a/test-data/html-no-link.html b/test-data/html-no-link.html new file mode 100644 index 00000000..0d5f46d7 --- /dev/null +++ b/test-data/html-no-link.html @@ -0,0 +1,37 @@ + + + + + + Article title here - ArticleParser + + + + + + + +
Page header here
+
+
+ +
+
+

Article title here

+
+
Few can name a rational peach that isn't a conscientious goldfish! One cannot separate snakes from plucky pomegranates? Draped neatly on a hanger, the melons could be said to resemble knowledgeable pigs. Some posit the enchanting tiger to be less than confident. The literature would have us believe that an impartial turtle is not but a hippopotamus. Unfortunately, that is wrong; on the contrary, those cows are nothing more than pandas! The chicken is a shark; A turtle can hardly be considered a kind horse without also being a pomegranate. Zebras are witty persimmons.
+

+ Those cheetahs are nothing more than dogs. A watermelon is an exuberant kangaroo. An octopus is the tangerine of a grapes? The cherry is a shark. Recent controversy aside, they were lost without the cheerful plum that composed their fox. As far as we can estimate, one cannot separate camels from dynamic hamsters. Those tigers are nothing more than cows! A cow is a squirrel from the right perspective. Their banana was, in this moment, a helpful bear.

+

The first fair dog is, in its own way, a lemon.

+
4746 Kelly Drive, West Virginia
+ +
+
+ +
+
Page footer here
+ + diff --git a/test-data/html-no-title.html b/test-data/html-no-title.html index 4072704f..f2f79321 100644 --- a/test-data/html-no-title.html +++ b/test-data/html-no-title.html @@ -3,8 +3,29 @@ - TechNews - - + +
Page header here
+
+
+ +
+
+

Article title here

+
+
Few can name a rational peach that isn't a conscientious goldfish! One cannot separate snakes from plucky pomegranates? Draped neatly on a hanger, the melons could be said to resemble knowledgeable pigs. Some posit the enchanting tiger to be less than confident. The literature would have us believe that an impartial turtle is not but a hippopotamus. Unfortunately, that is wrong; on the contrary, those cows are nothing more than pandas! The chicken is a shark; A turtle can hardly be considered a kind horse without also being a pomegranate. Zebras are witty persimmons.
+

+ Those cheetahs are nothing more than dogs. A watermelon is an exuberant kangaroo. An octopus is the tangerine of a grapes? The cherry is a shark. Recent controversy aside, they were lost without the cheerful plum that composed their fox. As far as we can estimate, one cannot separate camels from dynamic hamsters. Those tigers are nothing more than cows! A cow is a squirrel from the right perspective. Their banana was, in this moment, a helpful bear.

+

The first fair dog is, in its own way, a lemon.

+
4746 Kelly Drive, West Virginia
+ +
+
+ +
+
Page footer here
+ diff --git a/test-data/regular-article-date-itemprop.html b/test-data/regular-article-date-itemprop.html new file mode 100644 index 00000000..c23a4e9d --- /dev/null +++ b/test-data/regular-article-date-itemprop.html @@ -0,0 +1,57 @@ + + + + + + Article title here - ArticleParser + + + + + + + + + + + + + + + + + + + + + + + + + +
Page header here
+
+
+ +
+
+

Article title here

+
+ + +
Few can name a rational peach that isn't a conscientious goldfish! One cannot separate snakes from plucky pomegranates? Draped neatly on a hanger, the melons could be said to resemble knowledgeable pigs. Some posit the enchanting tiger to be less than confident. The literature would have us believe that an impartial turtle is not but a hippopotamus. Unfortunately, that is wrong; on the contrary, those cows are nothing more than pandas! The chicken is a shark; A turtle can hardly be considered a kind horse without also being a pomegranate. Zebras are witty persimmons.
+

+ Those cheetahs are nothing more than dogs. A watermelon is an exuberant kangaroo. An octopus is the tangerine of a grapes? The cherry is a shark. Recent controversy aside, they were lost without the cheerful plum that composed their fox. As far as we can estimate, one cannot separate camels from dynamic hamsters. Those tigers are nothing more than cows! A cow is a squirrel from the right perspective. Their banana was, in this moment, a helpful bear.

+

The first fair dog is, in its own way, a lemon.

+
4746 Kelly Drive, West Virginia
+ +
+
+ +
+
Page footer here
+ + diff --git a/test-data/regular-article-date-span.html b/test-data/regular-article-date-span.html new file mode 100644 index 00000000..e6c13fc8 --- /dev/null +++ b/test-data/regular-article-date-span.html @@ -0,0 +1,57 @@ + + + + + + Article title here - ArticleParser + + + + + + + + + + + + + + + + + + + + + + + + + +
Page header here
+
+
+ +
+
+

Article title here

+
+ Published at 11/09/2024 07h33 am + +
Few can name a rational peach that isn't a conscientious goldfish! One cannot separate snakes from plucky pomegranates? Draped neatly on a hanger, the melons could be said to resemble knowledgeable pigs. Some posit the enchanting tiger to be less than confident. The literature would have us believe that an impartial turtle is not but a hippopotamus. Unfortunately, that is wrong; on the contrary, those cows are nothing more than pandas! The chicken is a shark; A turtle can hardly be considered a kind horse without also being a pomegranate. Zebras are witty persimmons.
+

+ Those cheetahs are nothing more than dogs. A watermelon is an exuberant kangaroo. An octopus is the tangerine of a grapes? The cherry is a shark. Recent controversy aside, they were lost without the cheerful plum that composed their fox. As far as we can estimate, one cannot separate camels from dynamic hamsters. Those tigers are nothing more than cows! A cow is a squirrel from the right perspective. Their banana was, in this moment, a helpful bear.

+

The first fair dog is, in its own way, a lemon.

+
4746 Kelly Drive, West Virginia
+ +
+
+ +
+
Page footer here
+ + diff --git a/test-data/regular-article-date-time.html b/test-data/regular-article-date-time.html new file mode 100644 index 00000000..d1e1638f --- /dev/null +++ b/test-data/regular-article-date-time.html @@ -0,0 +1,57 @@ + + + + + + Article title here - ArticleParser + + + + + + + + + + + + + + + + + + + + + + + + + +
Page header here
+
+
+ +
+
+

Article title here

+
+ + +
Few can name a rational peach that isn't a conscientious goldfish! One cannot separate snakes from plucky pomegranates? Draped neatly on a hanger, the melons could be said to resemble knowledgeable pigs. Some posit the enchanting tiger to be less than confident. The literature would have us believe that an impartial turtle is not but a hippopotamus. Unfortunately, that is wrong; on the contrary, those cows are nothing more than pandas! The chicken is a shark; A turtle can hardly be considered a kind horse without also being a pomegranate. Zebras are witty persimmons.
+

+ Those cheetahs are nothing more than dogs. A watermelon is an exuberant kangaroo. An octopus is the tangerine of a grapes? The cherry is a shark. Recent controversy aside, they were lost without the cheerful plum that composed their fox. As far as we can estimate, one cannot separate camels from dynamic hamsters. Those tigers are nothing more than cows! A cow is a squirrel from the right perspective. Their banana was, in this moment, a helpful bear.

+

The first fair dog is, in its own way, a lemon.

+
4746 Kelly Drive, West Virginia
+ +
+
+ +
+
Page footer here
+ + diff --git a/test-data/regular-article-json-ld.html b/test-data/regular-article-json-ld.html new file mode 100644 index 00000000..c48ad9db --- /dev/null +++ b/test-data/regular-article-json-ld.html @@ -0,0 +1,88 @@ + + + + + + Article title here - ArticleParser + + + + + + + + + + + + + + + + + + + + + + + +
Page header here
+
+
+ +
+
+

Article title here

+
+
Few can name a rational peach that isn't a conscientious goldfish! One cannot separate snakes from plucky pomegranates? Draped neatly on a hanger, the melons could be said to resemble knowledgeable pigs. Some posit the enchanting tiger to be less than confident. The literature would have us believe that an impartial turtle is not but a hippopotamus. Unfortunately, that is wrong; on the contrary, those cows are nothing more than pandas! The chicken is a shark; A turtle can hardly be considered a kind horse without also being a pomegranate. Zebras are witty persimmons.
+

+ Those cheetahs are nothing more than dogs. A watermelon is an exuberant kangaroo. An octopus is the tangerine of a grapes? The cherry is a shark. Recent controversy aside, they were lost without the cheerful plum that composed their fox. As far as we can estimate, one cannot separate camels from dynamic hamsters. Those tigers are nothing more than cows! A cow is a squirrel from the right perspective. Their banana was, in this moment, a helpful bear.

+

The first fair dog is, in its own way, a lemon.

+
4746 Kelly Drive, West Virginia
+ +
+
+ +
+
Page footer here
+ + diff --git a/test-data/regular-article.html b/test-data/regular-article.html index a9cb196f..ab68045c 100644 --- a/test-data/regular-article.html +++ b/test-data/regular-article.html @@ -15,7 +15,7 @@ - + @@ -24,6 +24,7 @@ + @@ -42,7 +43,8 @@

Article title here

Those cheetahs are nothing more than dogs. A watermelon is an exuberant kangaroo. An octopus is the tangerine of a grapes? The cherry is a shark. Recent controversy aside, they were lost without the cheerful plum that composed their fox. As far as we can estimate, one cannot separate camels from dynamic hamsters. Those tigers are nothing more than cows! A cow is a squirrel from the right perspective. Their banana was, in this moment, a helpful bear.

The first fair dog is, in its own way, a lemon.

- +
4746 Kelly Drive, West Virginia
+

-

The first fair dog is, in its own way, a lemon.

+

The first fair dog is, in its own way, a lemon.