You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+38-6Lines changed: 38 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -49,17 +49,18 @@ Makes requests to `urls` and saves all files found with `sources` to `directory`
49
49
**options** - object containing next options:
50
50
51
51
-`urls`: array of urls to load and filenames for them *(required, see example below)*
52
-
-`urlFilter`: function which is called for each url to check whether it should be scraped. *(optional, see example below)*
53
52
-`directory`: path to save loaded files *(required)*
54
-
-`filenameGenerator`: name of one of the bundled filenameGenerators, or a custom filenameGenerator function *(optional, default: 'byType')*
55
-
-`defaultFilename`: filename for index page *(optional, default: 'index.html')*
56
-
-`prettifyUrls`: whether urls should be 'prettified', by having the `defaultFilename` removed *(optional, default: false)*
57
53
-`sources`: array of objects to load, specifies selectors and attribute values to select files for loading *(optional, see example below)*
58
-
-`subdirectories`: array of objects, specifies subdirectories for file extensions. If `null` all files will be saved to `directory`*(optional, see example below)*
59
-
-`request`: object, custom options for [request](https://github.com/request/request#requestoptions-callback)*(optional, see example below)*
60
54
-`recursive`: boolean, if `true` scraper will follow anchors in html files. Don't forget to set `maxDepth` to avoid infinite downloading *(optional, see example below)*
61
55
-`maxDepth`: positive number, maximum allowed depth for dependencies *(optional, see example below)*
56
+
-`request`: object, custom options for [request](https://github.com/request/request#requestoptions-callback)*(optional, see example below)*
57
+
-`subdirectories`: array of objects, specifies subdirectories for file extensions. If `null` all files will be saved to `directory`*(optional, see example below)*
58
+
-`defaultFilename`: filename for index page *(optional, default: 'index.html')*
59
+
-`prettifyUrls`: whether urls should be 'prettified', by having the `defaultFilename` removed *(optional, default: false)*
62
60
-`ignoreErrors`: boolean, if `true` scraper will continue downloading resources after error occured, if `false` - scraper will finish process and return error *(optional, default: true)*
61
+
-`urlFilter`: function which is called for each url to check whether it should be scraped. *(optional, see example below)*
62
+
-`filenameGenerator`: name of one of the bundled filenameGenerators, or a custom filenameGenerator function *(optional, default: 'byType')*
63
+
-`httpResponseHandler`: function which is called on each response, allows to customize resource or reject its downloading *(optional, see example below)*
63
64
64
65
Default options you can find in [lib/config/defaults.js](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/config/defaults.js).
65
66
@@ -85,6 +86,14 @@ When the `bySiteStructure` filenameGenerator is used the downloaded files are sa
HttpResponseHandler is used to reject resource downloading or customize resource text based on response data (for example, status code, content type, etc.)
91
+
Function takes `response` argument - response object of [request](https://github.com/request/request) module and should return resolved `Promise` if resource should be downloaded or rejected with Error `Promise` if it should be skipped.
92
+
Promise should be resolved with:
93
+
*`string` which contains response body
94
+
* or object with properies `body` (response body, string) and `metadata` - everything you want to save for this resource (like headers, original text, timestamps, etc.), scraper will not use this field at all, it is only for result.
95
+
96
+
See [example of using httpResponseHandler](#example-5-rejecting-resources-with-404-status-and-adding-metadata).
88
97
89
98
## Examples
90
99
#### Example 1
@@ -176,6 +185,29 @@ scrape({
176
185
}).then(console.log).catch(console.log);
177
186
```
178
187
188
+
#### Example 5. Rejecting resources with 404 status and adding metadata
189
+
```javascript
190
+
var scrape =require('website-scraper');
191
+
scrape({
192
+
urls: ['http://example.com/'],
193
+
directory:'/path/to/save',
194
+
httpResponseHandler: (response) => {
195
+
if (response.statusCode===404) {
196
+
returnPromise.reject(newError('status is 404'));
197
+
} else {
198
+
// if you don't need metadata - you can just return Promise.resolve(response.body)
199
+
returnPromise.resolve({
200
+
body:response.body,
201
+
metadata: {
202
+
headers:response.headers,
203
+
someOtherData: [ 1, 2, 3 ]
204
+
}
205
+
});
206
+
}
207
+
}
208
+
}).then(console.log).catch(console.log);
209
+
```
210
+
179
211
## Log and debug
180
212
This module uses [debug](https://github.com/visionmedia/debug) to log events. To enable logs you should use environment variable `DEBUG`.
181
213
Next command will log everything from website-scraper
0 commit comments