Skip to content

Commit 67e65c0

Browse files
committed
Rewrite package to emit simplified structure
1 parent ab5bf1a commit 67e65c0

File tree

603 files changed

+187688
-71172
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

603 files changed

+187688
-71172
lines changed

.eslintrc.js

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
module.exports = {
2+
parser: "@typescript-eslint/parser",
3+
extends: [
4+
// "eslint:recommended",
5+
// "plugin:@typescript-eslint/eslint-recommended",
6+
"plugin:@typescript-eslint/recommended-requiring-type-checking",
7+
"prettier/@typescript-eslint",
8+
"plugin:prettier/recommended"
9+
],
10+
parserOptions: {
11+
ecmaVersion: 2018,
12+
sourceType: "module",
13+
project: "tsconfig.json"
14+
}
15+
};

.gitignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,3 @@ coverage/
44
node_modules/
55
npm-debug.log
66
dist/
7-
typings/

.travis.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ notifications:
77
on_failure: change
88

99
node_js:
10+
- "10"
1011
- "stable"
1112

1213
after_script: "npm install coveralls@2 && cat ./coverage/lcov.info | coveralls"

README.md

Lines changed: 20 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -4,66 +4,49 @@
44
[![NPM downloads](https://img.shields.io/npm/dm/scrappy.svg?style=flat)](https://npmjs.org/package/scrappy)
55
[![Build status](https://img.shields.io/travis/blakeembrey/node-scrappy.svg?style=flat)](https://travis-ci.org/blakeembrey/node-scrappy)
66
[![Test coverage](https://img.shields.io/coveralls/blakeembrey/node-scrappy.svg?style=flat)](https://coveralls.io/r/blakeembrey/node-scrappy?branch=master)
7-
[![Greenkeeper badge](https://badges.greenkeeper.io/blakeembrey/node-scrappy.svg)](https://greenkeeper.io/)
87

98
> Extract rich metadata from URLs.
109
1110
[Try it using Runkit!](https://runkit.com/blakeembrey/scrappy)
1211

1312
## Installation
1413

15-
```sh
14+
```
1615
npm install scrappy --save
1716
```
1817

1918
## Usage
2019

21-
**Scrappy** uses a simple two step process to extract the metadata from any URL or file. First, it runs through plugin-able `scrapeStream` middleware to extract metadata about the file itself. With the result in hand, it gets passed on to a plugin-able `extract` pipeline to format the metadata for presentation and extract additional metadata about related entities.
22-
23-
### Scraping
24-
25-
#### `scrapeUrl`
20+
**Scrappy** attempts to parse and extract rich structured metadata from URLs.
2621

27-
```ts
28-
function scrapeUrl(url: string, plugin?: Plugin): Promise<ScrapeResult>
22+
```js
23+
import { scraper, urlScraper } from "scrappy";
2924
```
3025

31-
Makes the HTTP request and passes the response into `scrapeResponse`.
32-
33-
#### `scrapeResponse`
34-
35-
```ts
36-
function scrapeResponse (res: Response, plugin?: Plugin): Promise<ScrapeResult>
37-
```
26+
### Scraper
3827

39-
Accepts a HTTP response object and transforms it into `scrapeStream`.
28+
Accepts a `request` function and optional `plugins` array. The request is expected to return a "page" object, which is the same shape as the input to `scrape(page)`.
4029

41-
#### `scrapeStream`
30+
```js
31+
const scrape = scraper({ request });
32+
const res = await fetch("http://example.com"); // E.g. `popsicle`.
4233

43-
```ts
44-
function scrapeStream (stream: Readable, input: ScrapeResult, abort?: () => void, plugin = DEFAULT_SCRAPER): Promise<ScrapeResult>
34+
await scrape({
35+
url: res.url,
36+
status: res.status,
37+
headers: res.headers.asObject(),
38+
body: res.stream() // Must stream the request instead of buffering to support large responses.
39+
});
4540
```
4641

47-
Accepts a readable stream and input scrape result (at a minimum should have `url`, but could add other known metadata - e.g. from HTTP headers), and returns the scrape result after running through the plugin function. It also accepts an `abort` function, which can be used to close the stream early.
48-
49-
The default plugins are in the [`plugins/` directory](src/scrape/plugins) and combined into a single pipeline using `compose` (based on `throwback`, but calls `next(stream)` to pass a stream forward).
50-
51-
### Extraction
52-
53-
Extraction is based on a single function, `extract`. It accepts the scrape result, and an optional array of helpers. The default extraction maps the scrape result into a proprietary format useful for applications to visualize. After the extraction is done, it iterates over each of the helper functions to transform the extracted snippet.
54-
55-
Some built-in extraction helpers are available in the [`helpers/` directory](src/extract/helpers), including a default favicon selector and image dimension extraction.
56-
57-
### Example
58-
59-
This example uses [`scrapeAndExtract`](src/index.ts) (a simple wrapper around `scrapeUrl` and `extract`) to retrieve metadata from a webpage. In your own application, you may want to write your own `makeRequest` function or override other parts of the pipeline (e.g. to enable caching or customize the user-agent, etc).
42+
### URL Scraper
6043

61-
```ts
62-
import { scrapeAndExtract } from 'scrappy'
44+
Simpler wrapper around `scraper` that automatically makes a `request(url)` for the page.
6345

64-
const url = 'https://medium.com/slack-developer-blog/everything-you-ever-wanted-to-know-about-unfurling-but-were-afraid-to-ask-or-how-to-make-your-e64b4bb9254#.a0wjf4ltt'
46+
```js
47+
const scrape = urlScraper({ request });
6548

66-
scrapeAndExtract(url).then(console.log.bind(console))
49+
await scrape("http://example.com");
6750
```
6851

6952
## License

fixtures/http!cloudinary.com!pricing/body

Lines changed: 10 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
{
2+
"url": "https://cloudinary.com/pricing",
3+
"headers": {
4+
":status": "200",
5+
"date": "Sun, 02 Feb 2020 00:36:52 GMT",
6+
"content-type": "text/html",
7+
"etag": "W/\"5e361202-693c\"",
8+
"last-modified": "Sun, 02 Feb 2020 00:04:18 GMT",
9+
"strict-transport-security": "max-age=86400",
10+
"cf-cache-status": "DYNAMIC",
11+
"expect-ct": "max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\"",
12+
"alt-svc": "h3-24=\":443\"; ma=86400, h3-23=\":443\"; ma=86400",
13+
"server": "cloudflare",
14+
"cf-ray": "55e817e66f3ded87-SJC",
15+
"content-encoding": "br"
16+
},
17+
"status": 200
18+
}

fixtures/http!cnn.com/body

Lines changed: 189 additions & 0 deletions
Large diffs are not rendered by default.

fixtures/http!cnn.com/meta.json

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
{
2+
"url": "https://www.cnn.com/",
3+
"headers": {
4+
":status": "200",
5+
"content-type": "text/html; charset=utf-8",
6+
"x-servedbyhost": "::ffff:127.0.0.1",
7+
"access-control-allow-origin": "*",
8+
"cache-control": "max-age=60",
9+
"content-security-policy": "default-src 'self' blob: https://*.cnn.com:* http://*.cnn.com:* *.cnn.io:* *.cnn.net:* *.turner.com:* *.turner.io:* *.ugdturner.com:* courageousstudio.com *.vgtf.net:*; script-src 'unsafe-eval' 'unsafe-inline' 'self' *; style-src 'unsafe-inline' 'self' blob: *; child-src 'self' blob: *; frame-src 'self' *; object-src 'self' *; img-src 'self' data: blob: *; media-src 'self' data: blob: *; font-src 'self' data: *; connect-src 'self' *; frame-ancestors 'self' https://*.cnn.com:* http://*.cnn.com https://*.cnn.io:* http://*.cnn.io:* *.turner.com:* courageousstudio.com;",
10+
"x-content-type-options": "nosniff",
11+
"x-xss-protection": "1; mode=block",
12+
"content-encoding": "gzip",
13+
"via": "1.1 varnish, 1.1 varnish",
14+
"accept-ranges": "bytes",
15+
"date": "Sun, 02 Feb 2020 00:38:12 GMT",
16+
"age": "135",
17+
"set-cookie": [
18+
"countryCode=US; Domain=.cnn.com; Path=/; SameSite=Lax",
19+
"geoData=san francisco|CA|94103|US|NA|-800|broadband; Domain=.cnn.com; Path=/; SameSite=Lax",
20+
"FastAB=0=5517,1=3181,2=7341,3=4965,4=4299,5=0796,6=9752,7=1669,8=6378,9=3267; Domain=.cnn.com; Path=/; Expires=Thu Jul 01 2021 00:00:00 GMT; SameSite=Lax",
21+
"tryThing01=9516; Domain=.cnn.com; Path=/; Expires=Sun Mar 01 2020 00:00:00 GMT; SameSite=Lax",
22+
"tryThing02=5782; Domain=.cnn.com; Path=/; Expires=Wed Jan 01 2020 00:00:00 GMT; SameSite=Lax"
23+
],
24+
"x-served-by": "cache-iad2127-IAD, cache-sea4437-SEA",
25+
"x-cache": "HIT, HIT",
26+
"x-cache-hits": "5, 37",
27+
"x-timer": "S1580603892.073164,VS0,VE0",
28+
"vary": "Accept-Encoding",
29+
"content-length": "154194"
30+
},
31+
"status": 200
32+
}

fixtures/http!d.pr!a!q3z9/body

Lines changed: 54 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
{
2+
"url": "https://d.pr/a/q3z9",
3+
"headers": {
4+
":status": "200",
5+
"date": "Sun, 02 Feb 2020 00:36:36 GMT",
6+
"content-type": "text/html; charset=utf-8",
7+
"content-length": "26392",
8+
"set-cookie": [
9+
"AWSALB=flDMeldhwHQSmDOX9UeHGfURtsUc3F8JPwOUWgc4ijHNPMOxCrHpSJEoJpW/2fFTjEOVvSZwRkHuH/Sk0fPhAjoka9tKsqr289S99UNUPljkQRhaqt2iPK7GDIQL; Expires=Sun, 09 Feb 2020 00:36:36 GMT; Path=/",
10+
"AWSALBCORS=flDMeldhwHQSmDOX9UeHGfURtsUc3F8JPwOUWgc4ijHNPMOxCrHpSJEoJpW/2fFTjEOVvSZwRkHuH/Sk0fPhAjoka9tKsqr289S99UNUPljkQRhaqt2iPK7GDIQL; Expires=Sun, 09 Feb 2020 00:36:36 GMT; Path=/; SameSite=None; Secure"
11+
],
12+
"server": "nginx/1.15.7",
13+
"content-security-policy": "frame-ancestors d.pr http://d.pr https://d.pr",
14+
"etag": "W/\"6718-Wb1t0BmfkgcqUNKOJLrxMXkkH+M\""
15+
},
16+
"status": 200
17+
}

0 commit comments

Comments
 (0)