Skip to content

Commit c1d42bc

Browse files
coturivdaveschumakerSettingDustndaidongwillwashburn
authored
Merge extractus/main
* chore: update dist with latest build * chore: update param documentation * chore: update dependencies (extractus#257) * v6.0.0 - Change to ES6 Module format - Change and update dependencies - Also update core logic Related pr: extractus#219, extractus#220, extractus#222, extractus#224, extractus#227, extractus#228, extractus#232, extractus#238, extractus#240, extractus#241 extractus#243, extractus#245 * v6.0.1 - Change code analysis to GitHub CodeQuality - Update dependencies * fix: can't fetch html from document on browser waiting for WebReflection/linkedom#146 for better solution * v6.0.2 - Merge pr extractus#265 by @SettingDust (related issue extractus#264) - Update dependencies * v6.0.2 - Rebuild * chore: update `urlpattern-polyfill` fix extractus#266 * v6.0.3 - Merge pr extractus#269 by @SettingDust (issue extractus#266) - Fix coding style * v6.0.3 - Rebuild * v6.0.4 - Update more parser config - Improve README & fix expired API key for example * v6.0.4 - Improve README * v6.0.4 - Add more test - Improve README * v6.0.4 - Improve README * Update README - Fix link to default rules * Update README * v6.0.5 - Use `test` to match url with patterns (instead `exec`) - Add more test - Update README * v6.0.5 - Update README * v6.0.6 - Fix potential problem with query rules - Apply multi transformation from all matched query rules - Add more guide about query rules * v7.0.0rc1 - Update processing logic - Replace `queryRule` with `transformation` - Re-organize source code structure * Update README.md * v7.0.0rc2 * v7.0.0rc3 - Add default `Accept-Encoding` to request options - Update default sanitizeHtml options - Update dependencies * v7.0.0rc3 - rebuild * Change method to deal with `source` and `description` - Use `tldts` to get domain, used this value as `source` - Increase `description` length, tend to take summary from content, remove unneccessary parts * v7.0.0 - Official release v7 with new concept `transformation` - Simplify error throwing from axios * v7.0.1 - Fix function to get description - Update dependencies * Update README * v7.0.2 - Update dependencies - Add button "Deploy to Deta" * v7.0.2 - Update dependencies - Add button "Deploy to Deta" - Use Deta service for example faas - Copy types definition to cjs dist (extractus#287) * v7.0.3 - Update dependencies - Remove depending on `tldts` - Use [conditional exports](https://nodejs.org/api/packages.html#conditional-exports) - Improve pre-defined options * v7.1.0 - To work with `bun` and `deno` - Replace `axios` with `cross-fetch` - Remove 4 API methods relating to axios and htmlcrush * Update types definition * v7.1.1 - Fix problem with cross-fetch on deno * v7.1.1 - Conditional urlpattern * v7.2.0rc1 - Stop depending on `urlpattern-polyfill` for running on deno/bun - Replace URLPattern syntax with regular RegExp * Update README refer links * v7.2.0rc2 - Rebuild * Update README * v7.2.0rc3 - Update type definition * v7.2.0rc4 - Replace `string-comparison` with `string-similarity` to fix `bun` error * v7.2.0rc5 - Use internal string-similarity file to by pass bun.js resolve error * Add examples with node, deno, bun, tsnode * Remove bun.lockb * Rebuild * v7.2.0 - Refactor some parts to run on deno, bun and tsnode - Add some examples for each platform - Remove some rarely used configuration methods * Update examples * v7.2.1-rc1 - Try to use external `string-similarity` again - Update build script - Improve fetch control - Fix typo error on naming example packages * v7.2.1 - Rebuild * v7.2.2-rc1 - Replace global config with on-request `parserOptions` - Add new param `fetchOptions` to extract() - Allow to pass request to proxy - Fix problem while building esm version for browser - Add example for browser usage * Update dependencies * Update README * v7.2.2-rc2 - Remove dependency `html-crush` * v7.2.2 - Add options to extract method - Remove unnecessary dependencies for reduce bundle size - Add more examples * v7.2.3 - Optimize performance by removing html validation * Update README * Add option to keep/remove line breaks - Update README * v7.2.4 - Improve space/newline processing - no longer remove all linebreaks but multi empty lines are stripped - Add folder for evaluation - Update README * v7.2.5 - Update dependencies * Update README * Add more specs for meta data extraction Related issues: extractus#311 * Add security policy * Add ci test with node 19.x * Update security policy. * Update security contact * Add contributing guide - Update ci settings * Update README - Move Deta block to Usage section * Update SECURITY.md * v7.2.6 - Migrate to extractus org - Update links and docs (extractus#322) * Update README - Fix badge link * Update coveralls github action * v7.2.7 - Update dependencies - Update docs - Update CI settings * Update CI settings * Update CI config * Fix CI settings * Update CI settings * Update README * Add image to docs * Update README - Change badges link * v7.2.8 - Expose new API method `extractFromHtml()` - Update dependencies - Change coding style (remove standardjs) Related issues: extractus#321, extractus#326 * Update README * v7.2.9 - Fix issue extractus#329 - Update dependencies - Improve unit test * v7.2.10 - Fix issue extractus#331 - Update dependencies - Remove unnecessary watermark * Add null to response types * v7.2.11 - Merge pr extractus#333 - Update dependencies * v7.2.12 - Set default user-agent - Avoid error if parserOptions is null - Update dependencies * Update ci config * v7.2.13rc1 - Fix issue on Deno platform * v7.2.13 - Fix some issue while fetching data on Deno platform * Rebuild v7.2.13 * v7.2.14 - Add support parsely meta tags - Update dependencies * Change string array to dictionary * v7.2.15 - Fix unsupported package `string-similarity` - Update deps * v7.2.15 - Merge with changes from pr extractus#341 * v7.2.16 - Fix issue extractus#347 - Update dependencies * Add favicon to meta data * GNU nano 6.4 /workspace/node/article-extractor/.git/COMMIT_EDITMSG Modified v7.2.17 - Merge pr extractus#350 by @LarchLiu - Add `agent` to `fetchOptions` - Update CI to test with Node 20 - Update dependencies - Update README * v7.2.17 * v7.2.17 * v7.2.17 * v7.2.18 - Add test for proxy `agent` - Update dependencies * v7.3.0 - Add support to `signal` - Stop support Node < 15 - Stop support commonjs version - Remove build script - Update examples code - Update dependencies * Update README * v8.0.0 - Bump version - Add deno.json & import sections - Update deps - Improve README * Update README * Update README * v8.0.1 - Update dependencies - Update imports section * Update dependencies * Use `childNodes` instead of `children` To get it work as same as Deno DOM * Update README * Fix ParserOptions typing * v8.0.3 - Update deno example (extractus#368) * Stop ci test with node < 16 because EOL * Feat: extract pagetype from og:type or ld+json * v8.0.8 - Merge pr extractus#374 by @andremacola (issue extractus#373) - Update dependencies - Update CI config - Fix function call in eval.js * Update examples * v8.0.5 - Fix error while parsing ldjson - Update dependencies Related issues: extractus#378, extractus#374, extractus#373 * Fix CI issue with coverall * Fix CI issue * Fix CI problem * Change ci event * Update CI event * Fix CI problem * Fix CI issue * Fix CI coverall * v8.0.6 - Update dependencies - Update security email * v8.0.7 - Update dependencies Related issue: extractus#382 * v8.0.8 - Decode content using detected charset - Update dependencies - Update eslint config Related issues: extractus#386, extractus#320 * Add node 22 to ci * Update examples & test with pupperteer * v8.0.9 - Stop using purified HTML to extract content (extractus#388) * v8.0.10 - Fix importing issue * chore: Improvements in handling LD+JSON data * v8.0.11 - Merge pr extractus#400 by @andremacola - Replace jest with native node test runner - Update dependencies * Add test coverage * fix: Cannot read properties of undefined in ld+json * fix: more tests on ld+json * v8.0.12 - Merge pr extractus#403 by @andremacola * Improvements to find dates * v8.0.13 - Merge pr extractus#405 by @andremacola * v8.0.14 - Fix inconsistent output (extractus#407) - Modify some stuff at LdJson extraction (extractus#405) - Only use value from LdJson if missed from meta tags - Only accept string value from LdJson - Stop converting LdJson value to lowercase * fix: adjustment of poorly formatted ldjson error * v8.0.15 - Merge pr extractus#410 by @andremacola * v8.0.16 - Fix issue extractus#412 - Update dependencies * v8.0.17 - Update dependencies * Update eval script * 8.0.18 - Update dependencies - Update CI config - Update README * Update README * Update README * v8.0.19 - Fix image lossing while ldjson overwrite meta data - Update dependencies * Add test with node 24 * v8.0.20 - Update dependencies * Remove examples - To stop dependencies outdated warning * v8.0.20 - Update packages * chore: package rename to @arbitral/article-parser and metadata update * chore: package.json only change name to @arbitral/article-parser (keep upstream author/homepage/repo) * chore: regenerate package-lock.json after package name change * fix: satisfy eslint comma-dangle in build.js and configs --------- Co-authored-by: Dave Schumaker <[email protected]> Co-authored-by: SettingDust <[email protected]> Co-authored-by: Dong Nguyen <[email protected]> Co-authored-by: Will Washburn <[email protected]> Co-authored-by: mphill <[email protected]> Co-authored-by: Alex.Liu <[email protected]> Co-authored-by: Ranmocy <[email protected]> Co-authored-by: andremacola <[email protected]>
1 parent 4aca0a7 commit c1d42bc

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

70 files changed

+4261
-1383
lines changed

.github/workflows/ci-test.yml

Lines changed: 7 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -8,40 +8,33 @@ on: [push, pull_request]
88
jobs:
99
test:
1010

11-
runs-on: ubuntu-20.04
11+
runs-on: ubuntu-latest
1212

1313
strategy:
1414
matrix:
15-
node_version: [14.x, 15.x, 16.x, 17.x, 18.x]
15+
node_version: [20.x, 22.x, 24.x]
1616

1717
steps:
18-
- uses: actions/checkout@v2
18+
- uses: actions/checkout@v4
1919

2020
- name: setup Node.js v${{ matrix.node_version }}
21-
uses: actions/setup-node@v2
21+
uses: actions/setup-node@v4
2222
with:
2323
node-version: ${{ matrix.node_version }}
2424

2525
- name: run npm scripts
26+
env:
27+
PROXY_SERVER: ${{ secrets.PROXY_SERVER }}
2628
run: |
27-
npm i -g standard
2829
npm install
2930
npm run lint
3031
npm run build --if-present
3132
npm run test
3233
33-
- name: sync to coveralls
34-
uses: coverallsapp/[email protected]
35-
with:
36-
github-token: ${{ secrets.GITHUB_TOKEN }}
37-
3834
- name: cache node modules
39-
uses: actions/cache@v2
35+
uses: actions/cache@v4
4036
with:
4137
path: ~/.npm
4238
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
4339
restore-keys: |
4440
${{ runner.os }}-node-
45-
46-
47-

.github/workflows/codeql-analysis.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ jobs:
3838

3939
steps:
4040
- name: Checkout repository
41-
uses: actions/checkout@v3
41+
uses: actions/checkout@v4
4242

4343
# Initializes the CodeQL tools for scanning.
4444
- name: Initialize CodeQL

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,5 +15,8 @@ coverage
1515
yarn.lock
1616
coverage.lcov
1717
pnpm-lock.yaml
18+
lcov.info
1819

19-
dist/
20+
deno.lock
21+
22+
evaluation

.npmignore

Lines changed: 6 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,7 @@
1-
node_modules/
2-
src/
3-
test-data/
4-
.idea/
5-
coverage/
6-
.vscode/
7-
8-
.DS_Store
9-
yarn.lock
10-
coverage.lcov
1+
node_modules
2+
coverage
3+
.github
114
pnpm-lock.yaml
12-
13-
*.js
14-
*.cjs
15-
*.js.map
16-
17-
!dist/**/*.js
18-
!index.js
5+
examples
6+
test-data
7+
lcov.info

CONTRIBUTING.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Contributing to `@extractus/article-extractor`
2+
3+
Glad to see you here.
4+
5+
Collaborations and pull requests are always welcomed, though larger proposals should be discussed first.
6+
7+
As an OSS, it's better to follow the Unix philosophy: "do one thing and do it well".
8+
9+
## Third-party libraries
10+
11+
Please avoid using libaries other than those available in the standard library, unless necessary.
12+
13+
This library needs to be simple and flexible to run on multiple platforms such as Deno, Bun, or even browser.
14+
15+
16+
## Coding convention
17+
18+
Make sure your code lints before opening a pull request.
19+
20+
21+
```bash
22+
cd article-extractor
23+
24+
# check coding convention issue
25+
npm run lint
26+
27+
# auto fix coding convention issue
28+
npm run lint:fix
29+
```
30+
31+
*When you run `npm test`, the linting process will be triggered at first.*
32+
33+
34+
## Testing
35+
36+
Be sure to run the unit test suite before opening a pull request. An example test run is shown below.
37+
38+
```bash
39+
cd article-extractor
40+
npm test
41+
```
42+
43+
![article-extractor unit test](https://i.imgur.com/TbRCUSS.png?110222)
44+
45+
If test coverage decreased, please check test scripts and try to improve this number.
46+
47+
48+
## Documentation
49+
50+
If you've changed APIs, please update README and [the examples](examples).
51+
52+
53+
## Clean commit histories
54+
55+
When you open a pull request, please ensure the commit history is clean.
56+
Squash the commits into logical blocks, perhaps a single commit if that makes sense.
57+
58+
What you want to avoid is commits such as "WIP" and "fix test" in the history.
59+
This is so we keep history on master clean and straightforward.
60+
61+
For people new to git, please refer the following guides:
62+
63+
- [Writing good commit messages](https://github.com/erlang/otp/wiki/writing-good-commit-messages)
64+
- [Commit Message Guidelines](https://gist.github.com/robertpainsi/b632364184e70900af4ab688decf6f53)
65+
66+
67+
## License
68+
69+
By contributing to `@extractus/article-extractor`, you agree that your contributions will be licensed under its [MIT license](LICENSE).
70+
71+
---

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
The MIT License (MIT)
22

3-
Copyright (c) 2016 Dong Nguyen
3+
Copyright (c) 2016 Extractus
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

0 commit comments

Comments
 (0)