Skip to content

Commit f0b8146

Browse files
committed
Update documentation in README to latest
1 parent e1383e3 commit f0b8146

File tree

1 file changed

+41
-33
lines changed

1 file changed

+41
-33
lines changed

README.md

Lines changed: 41 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -15,44 +15,52 @@ npm install scrappy --save
1515

1616
## Usage
1717

18-
Starting from `extractFromUrl`, **scrappy** creates a HTTP request (`scrapeUrl`) and streams the response into the scraper (`scrapeStream`). The scraper extracts metadata based on various specifications and standards, including HTML, RDFa, JSON-LD, Microdata, Open Graph and OEmbed. With all the relevant metadata, it uses `extract` to select the appropriate snippet. If you need snippets in a different format, you can create your own extraction method which accepts the scraped metadata.
18+
**Scrappy** uses a simple two step process to extract the metadata from any URL or file. First, it runs through plugin-able `scrapeStream` middleware to extract metadata about the file itself. With the result in hand, it gets passed on to a plugin-able `extract` pipeline to format the metadata for presentation and extract additional metadata about related entities.
19+
20+
### Scraping
21+
22+
#### `scrapeUrl`
23+
24+
```ts
25+
function scrapeUrl(url: string, plugin?: Plugin): Promise<ScrapeResult>
26+
```
27+
28+
Makes the HTTP request and passes the response into `scrapeResponse`.
29+
30+
#### `scrapeResponse`
31+
32+
```ts
33+
function scrapeResponse (res: Response, plugin?: Plugin): Promise<ScrapeResult>
34+
```
35+
36+
Accepts a HTTP response object and transforms it into `scrapeStream`.
37+
38+
#### `scrapeStream`
39+
40+
```ts
41+
function scrapeStream (stream: Readable, input: ScrapeResult, abort?: () => void, plugin = DEFAULT_SCRAPER): Promise<ScrapeResult>
42+
```
43+
44+
Accepts a readable stream and input scrape result (at a minimum should have `url`, but could add other known metadata - e.g. from HTTP headers), and returns the scrape result after running through the plugin function. It also accepts an `abort` function, which can be used to close the stream early.
45+
46+
The default plugins are in the [`plugins/` directory](src/scrape/plugins) and combined into a single pipeline using `compose` (based on `throwback`, but calls `next(stream)` to pass a stream forward).
47+
48+
### Extraction
49+
50+
Extraction is based on a single function, `extract`. It accepts the scrape result, and an optional array of helpers. The default extraction maps the scrape result into a proprietary format useful for applications to visualize. After the extraction is done, it iterates over each of the helper functions to transform the extracted snippet.
51+
52+
Some built-in extraction helpers are available in the [`helpers/` directory](src/extract/helpers), including a default favicon selector and image dimension extraction.
53+
54+
### Example
55+
56+
This example uses [`scrapeAndExtract`](src/index.ts) (a simple wrapper around `scrapeUrl` and `extract`) to retrieve metadata from a webpage. In your own application, you may want to write your own `makeRequest` function or override other parts of the pipeline (e.g. to enable caching or customize the user-agent, etc).
1957

2058
```ts
21-
import { scrapeUrl, scrapeStream, extract, extractFromUrl } from 'scrappy'
59+
import { scrapeAndExtract } from 'scrappy'
2260
2361
const url = 'https://medium.com/slack-developer-blog/everything-you-ever-wanted-to-know-about-unfurling-but-were-afraid-to-ask-or-how-to-make-your-e64b4bb9254#.a0wjf4ltt'
2462
25-
extractFromUrl(url).then(function (snippet) {
26-
// {
27-
// "type": "summary",
28-
// "imageUrl": "https://cdn-images-1.medium.com/max/1200/1*QOMaDLcO8rExD0ctBV3BWg.png",
29-
// "contentUrl": "https://medium.com/slack-developer-blog/everything-you-ever-wanted-to-know-about-unfurling-but-were-afraid-to-ask-or-how-to-make-your-e64b4bb9254",
30-
// "originalUrl": "https://medium.com/slack-developer-blog/everything-you-ever-wanted-to-know-about-unfurling-but-were-afraid-to-ask-or-how-to-make-your-e64b4bb9254#.a0wjf4ltt",
31-
// "encodingFormat": "html",
32-
// "headline": "Everything you ever wanted to know about unfurling but were afraid to ask /or/ How to make your… — Slack Platform Blog",
33-
// "caption": "Let’s start with the most obvious question first. This is what an “unfurl” is:",
34-
// "siteName": "Medium",
35-
// "author": "Matt Haughey",
36-
// "publisher": "https://www.facebook.com/medium",
37-
// "apps": {
38-
// "iphone": {
39-
// "id": "828256236",
40-
// "name": "Medium",
41-
// "url": "medium://p/e64b4bb9254"
42-
// },
43-
// "ipad": {
44-
// "id": "828256236",
45-
// "name": "Medium",
46-
// "url": "medium://p/e64b4bb9254"
47-
// },
48-
// "android": {
49-
// "id": "com.medium.reader",
50-
// "name": "Medium",
51-
// "url": "medium://p/e64b4bb9254"
52-
// }
53-
// }
54-
// }
55-
})
63+
scrapeAndExtract(url).then(console.log.bind(console))
5664
```
5765

5866
## Development

0 commit comments

Comments
 (0)