Skip to content

Commit 423792d

Browse files
authored
Update Readme.md (#189)
1 parent 52fc13e commit 423792d

File tree

1 file changed

+106
-97
lines changed

1 file changed

+106
-97
lines changed

README.md

Lines changed: 106 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -31,129 +31,117 @@ var options = {
3131
directory: '/path/to/save/',
3232
};
3333

34-
// with callback
35-
scrape(options, function (error, result) {
34+
// with promise
35+
scrape(options).then((result) => {
36+
/* some code here */
37+
}).catch((err) => {
3638
/* some code here */
3739
});
3840

39-
// or with promise
40-
scrape(options).then(function (result) {
41+
// or with callback
42+
scrape(options, (error, result) => {
4143
/* some code here */
4244
});
4345
```
4446

45-
## API
46-
### scrape(options, callback)
47-
Makes requests to `urls` and saves all files found with `sources` to `directory`.
48-
49-
**options** - object containing next options:
50-
51-
- `urls`: array of urls to load and filenames for them *(required, see example below)*
52-
- `directory`: path to save loaded files *(required)*
53-
- `sources`: array of objects to load, specifies selectors and attribute values to select files for loading *(optional, see example below)*
54-
- `recursive`: boolean, if `true` scraper will follow anchors in html files. Don't forget to set `maxDepth` to avoid infinite downloading *(optional, see example below)*
55-
- `maxDepth`: positive number, maximum allowed depth for dependencies *(optional, see example below)*
56-
- `request`: object, custom options for [request](https://github.com/request/request#requestoptions-callback) *(optional, see example below)*
57-
- `subdirectories`: array of objects, specifies subdirectories for file extensions. If `null` all files will be saved to `directory` *(optional, see example below)*
58-
- `defaultFilename`: filename for index page *(optional, default: 'index.html')*
59-
- `prettifyUrls`: whether urls should be 'prettified', by having the `defaultFilename` removed *(optional, default: false)*
60-
- `ignoreErrors`: boolean, if `true` scraper will continue downloading resources after error occured, if `false` - scraper will finish process and return error *(optional, default: true)*
61-
- `urlFilter`: function which is called for each url to check whether it should be scraped. *(optional, see example below)*
62-
- `filenameGenerator`: name of one of the bundled filenameGenerators, or a custom filenameGenerator function *(optional, default: 'byType')*
63-
- `httpResponseHandler`: function which is called on each response, allows to customize resource or reject its downloading *(optional, see example below)*
47+
## options
48+
* [urls](#urls) - urls to download, *required*
49+
* [directory](#directory) - path to save files, *required*
50+
* [sources](#sources) - selects which resources should be downloaded
51+
* [recursive](#recursive) - follow anchors in html files
52+
* [maxDepth](#maxdepth) - maximum depth for dependencies
53+
* [request](#request) - custom options for for [request](https://github.com/request/request)
54+
* [subdirectories](#subdirectories) - subdirectories for file extensions
55+
* [defaultFilename](#defaultfilename) - filename for index page
56+
* [prettifyUrls](#prettifyurls) - prettify urls
57+
* [ignoreErrors](#ignoreerrors) - whether to ignore errors on resource downloading
58+
* [urlFilter](#urlfilter) - skip some urls
59+
* [filenameGenerator](#filenamegenerator) - generate filename for downloaded resource
60+
* [httpResponseHandler](#httpresponsehandler) - customize http response handling
6461

6562
Default options you can find in [lib/config/defaults.js](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/config/defaults.js).
6663

67-
68-
**callback** - callback function *(optional)*, includes following parameters:
69-
70-
- `error`: if error - `Error` object, if success - `null`
71-
- `result`: if error - `null`, if success - array of [Resource](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/resource.js) objects containing:
72-
- `url`: url of loaded page
73-
- `filename`: filename where page was saved (relative to `directory`)
74-
- `children`: array of children Resources
75-
76-
### Filename Generators
77-
The filename generator determines where the scraped files are saved.
78-
79-
#### byType (default)
80-
When the `byType` filenameGenerator is used the downloaded files are saved by type (as defined by the `subdirectories` setting)
81-
or directly in the `directory` folder, if no subdirectory is specified for the specific type.
82-
83-
#### bySiteStructure
84-
When the `bySiteStructure` filenameGenerator is used the downloaded files are saved in `directory` using same structure as on the website:
85-
- `/` => `DIRECTORY/index.html`
86-
- `/about` => `DIRECTORY/about/index.html`
87-
- `/resources/javascript/libraries/jquery.min.js` => `DIRECTORY/resources/javascript/libraries/jquery.min.js`
88-
89-
### Http Response Handlers
90-
HttpResponseHandler is used to reject resource downloading or customize resource text based on response data (for example, status code, content type, etc.)
91-
Function takes `response` argument - response object of [request](https://github.com/request/request) module and should return resolved `Promise` if resource should be downloaded or rejected with Error `Promise` if it should be skipped.
92-
Promise should be resolved with:
93-
* `string` which contains response body
94-
* or object with properies `body` (response body, string) and `metadata` - everything you want to save for this resource (like headers, original text, timestamps, etc.), scraper will not use this field at all, it is only for result.
95-
96-
See [example of using httpResponseHandler](#example-5-rejecting-resources-with-404-status-and-adding-metadata).
97-
98-
## Examples
99-
#### Example 1
100-
Let's scrape some pages from [http://nodejs.org/](http://nodejs.org/) with images, css, js files and save them to `/path/to/save/`.
101-
Imagine we want to load:
102-
- [Home page](http://nodejs.org/) to `index.html`
103-
- [About page](http://nodejs.org/about/) to `about.html`
104-
- [Blog](http://blog.nodejs.org/) to `blog.html`
105-
106-
and separate files into directories:
107-
108-
- `img` for .jpg, .png, .svg (full path `/path/to/save/img`)
109-
- `js` for .js (full path `/path/to/save/js`)
110-
- `css` for .css (full path `/path/to/save/css`)
111-
64+
#### urls
65+
Array of objects which contain urls to download and filenames for them. **_Required_**.
11266
```javascript
113-
var scrape = require('website-scraper');
11467
scrape({
11568
urls: [
11669
'http://nodejs.org/', // Will be saved with default filename 'index.html'
11770
{url: 'http://nodejs.org/about', filename: 'about.html'},
11871
{url: 'http://blog.nodejs.org/', filename: 'blog.html'}
11972
],
73+
directory: '/path/to/save'
74+
}).then(console.log).catch(console.log);
75+
```
76+
77+
#### directory
78+
String, absolute path to directory where downloaded files will be saved. Directory should not exist. It will be created by scraper. **_Required_**.
79+
80+
#### sources
81+
Array of objects to download, specifies selectors and attribute values to select files for downloading. By default scraper tries to download all possible resources.
82+
```javascript
83+
// Downloading images, css files and scripts
84+
scrape({
85+
urls: ['http://nodejs.org/'],
12086
directory: '/path/to/save',
121-
subdirectories: [
122-
{directory: 'img', extensions: ['.jpg', '.png', '.svg']},
123-
{directory: 'js', extensions: ['.js']},
124-
{directory: 'css', extensions: ['.css']}
125-
],
12687
sources: [
12788
{selector: 'img', attr: 'src'},
12889
{selector: 'link[rel="stylesheet"]', attr: 'href'},
12990
{selector: 'script', attr: 'src'}
130-
],
91+
]
92+
}).then(console.log).catch(console.log);
93+
```
94+
95+
#### recursive
96+
Boolean, if `true` scraper will follow anchors in html files. Don't forget to set `maxDepth` to avoid infinite downloading. Defaults to `false`.
97+
98+
#### maxDepth
99+
Positive number, maximum allowed depth for dependencies. Defaults to `null` - no maximum depth set.
100+
101+
#### request
102+
Object, custom options for [request](https://github.com/request/request#requestoptions-callback). Allows to set cookies, userAgent, etc.
103+
```javascript
104+
scrape({
105+
urls: ['http://example.com/'],
106+
directory: '/path/to/save',
131107
request: {
132108
headers: {
133109
'User-Agent': 'Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 4 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19'
134110
}
135111
}
136-
}).then(function (result) {
137-
console.log(result);
138-
}).catch(function(err){
139-
console.log(err);
140-
});
112+
}).then(console.log).catch(console.log);
141113
```
142114

143-
#### Example 2. Recursive downloading
115+
#### subdirectories
116+
Array of objects, specifies subdirectories for file extensions. If `null` all files will be saved to `directory`.
144117
```javascript
145-
// Links from example.com will be followed
146-
// Links from links will be ignored because theirs depth = 2 is greater than maxDepth
147-
var scrape = require('website-scraper');
118+
/* Separate files into directories:
119+
- `img` for .jpg, .png, .svg (full path `/path/to/save/img`)
120+
- `js` for .js (full path `/path/to/save/js`)
121+
- `css` for .css (full path `/path/to/save/css`)
122+
*/
148123
scrape({
149-
urls: ['http://example.com/'],
124+
urls: ['http://example.com'],
150125
directory: '/path/to/save',
151-
recursive: true,
152-
maxDepth: 1
126+
subdirectories: [
127+
{directory: 'img', extensions: ['.jpg', '.png', '.svg']},
128+
{directory: 'js', extensions: ['.js']},
129+
{directory: 'css', extensions: ['.css']}
130+
]
153131
}).then(console.log).catch(console.log);
154132
```
155133

156-
#### Example 3. Filtering out external resources
134+
#### defaultFilename
135+
String, filename for index page. Defaults to `index.html`.
136+
137+
#### prettifyUrls
138+
Boolean, whether urls should be 'prettified', by having the `defaultFilename` removed. Defaults to `false`.
139+
140+
#### ignoreErrors
141+
Boolean, if `true` scraper will continue downloading resources after error occured, if `false` - scraper will finish process and return error. Defaults to `true`.
142+
143+
#### urlFilter
144+
Function which is called for each url to check whether it should be scraped. Defaults to `null` - no url filter will be applied.
157145
```javascript
158146
// Links to other websites are filtered out by the urlFilter
159147
var scrape = require('website-scraper');
@@ -166,28 +154,40 @@ scrape({
166154
}).then(console.log).catch(console.log);
167155
```
168156

169-
#### Example 4. Downloading an entire website
157+
#### filenameGenerator
158+
String, name of one of the bundled filenameGenerators, or a custom filenameGenerator function. Filename generator determines where the scraped files are saved.
159+
160+
###### byType (default)
161+
When the `byType` filenameGenerator is used the downloaded files are saved by type (as defined by the `subdirectories` setting) or directly in the `directory` folder, if no subdirectory is specified for the specific type.
162+
163+
###### bySiteStructure
164+
When the `bySiteStructure` filenameGenerator is used the downloaded files are saved in `directory` using same structure as on the website:
165+
- `/` => `DIRECTORY/index.html`
166+
- `/about` => `DIRECTORY/about/index.html`
167+
- `/resources/javascript/libraries/jquery.min.js` => `DIRECTORY/resources/javascript/libraries/jquery.min.js`
168+
170169
```javascript
171-
// Downloads all the crawlable files of example.com.
172-
// The files are saved in the same structure as the structure of the website, by using the `bySiteStructure` filenameGenerator.
170+
// Downloads all the crawlable files. The files are saved in the same structure as the structure of the website
173171
// Links to other websites are filtered out by the urlFilter
174172
var scrape = require('website-scraper');
175173
scrape({
176174
urls: ['http://example.com/'],
177-
urlFilter: function(url){
178-
return url.indexOf('http://example.com') === 0;
179-
},
175+
urlFilter: function(url){ return url.indexOf('http://example.com') === 0; },
180176
recursive: true,
181177
maxDepth: 100,
182-
prettifyUrls: true,
183178
filenameGenerator: 'bySiteStructure',
184179
directory: '/path/to/save'
185180
}).then(console.log).catch(console.log);
186181
```
187182

188-
#### Example 5. Rejecting resources with 404 status and adding metadata
183+
#### httpResponseHandler
184+
Function which is called on each response, allows to customize resource or reject its downloading.
185+
It takes 1 argument - response object of [request](https://github.com/request/request) module and should return resolved `Promise` if resource should be downloaded or rejected with Error `Promise` if it should be skipped.
186+
Promise should be resolved with:
187+
* `string` which contains response body
188+
* or object with properies `body` (response body, string) and `metadata` - everything you want to save for this resource (like headers, original text, timestamps, etc.), scraper will not use this field at all, it is only for result.
189189
```javascript
190-
var scrape = require('website-scraper');
190+
// Rejecting resources with 404 status and adding metadata to other resources
191191
scrape({
192192
urls: ['http://example.com/'],
193193
directory: '/path/to/save',
@@ -207,6 +207,15 @@ scrape({
207207
}
208208
}).then(console.log).catch(console.log);
209209
```
210+
Scrape function resolves with array of [Resource](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/resource.js) objects which contain `metadata` property from `httpResponseHandler`.
211+
212+
## callback
213+
Callback function, optional, includes following parameters:
214+
- `error`: if error - `Error` object, if success - `null`
215+
- `result`: if error - `null`, if success - array of [Resource](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/resource.js) objects containing:
216+
- `url`: url of loaded page
217+
- `filename`: filename where page was saved (relative to `directory`)
218+
- `children`: array of children Resources
210219

211220
## Log and debug
212221
This module uses [debug](https://github.com/visionmedia/debug) to log events. To enable logs you should use environment variable `DEBUG`.

0 commit comments

Comments
 (0)