You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Makes requests to `urls` and saves all files found with `sources` to `directory`.
48
-
49
-
**options** - object containing next options:
50
-
51
-
-`urls`: array of urls to load and filenames for them *(required, see example below)*
52
-
-`directory`: path to save loaded files *(required)*
53
-
-`sources`: array of objects to load, specifies selectors and attribute values to select files for loading *(optional, see example below)*
54
-
-`recursive`: boolean, if `true` scraper will follow anchors in html files. Don't forget to set `maxDepth` to avoid infinite downloading *(optional, see example below)*
55
-
-`maxDepth`: positive number, maximum allowed depth for dependencies *(optional, see example below)*
56
-
-`request`: object, custom options for [request](https://github.com/request/request#requestoptions-callback)*(optional, see example below)*
57
-
-`subdirectories`: array of objects, specifies subdirectories for file extensions. If `null` all files will be saved to `directory`*(optional, see example below)*
58
-
-`defaultFilename`: filename for index page *(optional, default: 'index.html')*
59
-
-`prettifyUrls`: whether urls should be 'prettified', by having the `defaultFilename` removed *(optional, default: false)*
60
-
-`ignoreErrors`: boolean, if `true` scraper will continue downloading resources after error occured, if `false` - scraper will finish process and return error *(optional, default: true)*
61
-
-`urlFilter`: function which is called for each url to check whether it should be scraped. *(optional, see example below)*
62
-
-`filenameGenerator`: name of one of the bundled filenameGenerators, or a custom filenameGenerator function *(optional, default: 'byType')*
63
-
-`httpResponseHandler`: function which is called on each response, allows to customize resource or reject its downloading *(optional, see example below)*
47
+
## options
48
+
*[urls](#urls) - urls to download, *required*
49
+
*[directory](#directory) - path to save files, *required*
50
+
*[sources](#sources) - selects which resources should be downloaded
51
+
*[recursive](#recursive) - follow anchors in html files
52
+
*[maxDepth](#maxdepth) - maximum depth for dependencies
53
+
*[request](#request) - custom options for for [request](https://github.com/request/request)
54
+
*[subdirectories](#subdirectories) - subdirectories for file extensions
55
+
*[defaultFilename](#defaultfilename) - filename for index page
56
+
*[prettifyUrls](#prettifyurls) - prettify urls
57
+
*[ignoreErrors](#ignoreerrors) - whether to ignore errors on resource downloading
58
+
*[urlFilter](#urlfilter) - skip some urls
59
+
*[filenameGenerator](#filenamegenerator) - generate filename for downloaded resource
Default options you can find in [lib/config/defaults.js](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/config/defaults.js).
66
63
67
-
68
-
**callback** - callback function *(optional)*, includes following parameters:
69
-
70
-
-`error`: if error - `Error` object, if success - `null`
71
-
-`result`: if error - `null`, if success - array of [Resource](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/resource.js) objects containing:
72
-
-`url`: url of loaded page
73
-
-`filename`: filename where page was saved (relative to `directory`)
74
-
-`children`: array of children Resources
75
-
76
-
### Filename Generators
77
-
The filename generator determines where the scraped files are saved.
78
-
79
-
#### byType (default)
80
-
When the `byType` filenameGenerator is used the downloaded files are saved by type (as defined by the `subdirectories` setting)
81
-
or directly in the `directory` folder, if no subdirectory is specified for the specific type.
82
-
83
-
#### bySiteStructure
84
-
When the `bySiteStructure` filenameGenerator is used the downloaded files are saved in `directory` using same structure as on the website:
HttpResponseHandler is used to reject resource downloading or customize resource text based on response data (for example, status code, content type, etc.)
91
-
Function takes `response` argument - response object of [request](https://github.com/request/request) module and should return resolved `Promise` if resource should be downloaded or rejected with Error `Promise` if it should be skipped.
92
-
Promise should be resolved with:
93
-
*`string` which contains response body
94
-
* or object with properies `body` (response body, string) and `metadata` - everything you want to save for this resource (like headers, original text, timestamps, etc.), scraper will not use this field at all, it is only for result.
95
-
96
-
See [example of using httpResponseHandler](#example-5-rejecting-resources-with-404-status-and-adding-metadata).
97
-
98
-
## Examples
99
-
#### Example 1
100
-
Let's scrape some pages from [http://nodejs.org/](http://nodejs.org/) with images, css, js files and save them to `/path/to/save/`.
101
-
Imagine we want to load:
102
-
-[Home page](http://nodejs.org/) to `index.html`
103
-
-[About page](http://nodejs.org/about/) to `about.html`
104
-
-[Blog](http://blog.nodejs.org/) to `blog.html`
105
-
106
-
and separate files into directories:
107
-
108
-
-`img` for .jpg, .png, .svg (full path `/path/to/save/img`)
109
-
-`js` for .js (full path `/path/to/save/js`)
110
-
-`css` for .css (full path `/path/to/save/css`)
111
-
64
+
#### urls
65
+
Array of objects which contain urls to download and filenames for them. **_Required_**.
112
66
```javascript
113
-
var scrape =require('website-scraper');
114
67
scrape({
115
68
urls: [
116
69
'http://nodejs.org/', // Will be saved with default filename 'index.html'
String, absolute path to directory where downloaded files will be saved. Directory should not exist. It will be created by scraper. **_Required_**.
79
+
80
+
#### sources
81
+
Array of objects to download, specifies selectors and attribute values to select files for downloading. By default scraper tries to download all possible resources.
String, filename for index page. Defaults to `index.html`.
136
+
137
+
#### prettifyUrls
138
+
Boolean, whether urls should be 'prettified', by having the `defaultFilename` removed. Defaults to `false`.
139
+
140
+
#### ignoreErrors
141
+
Boolean, if `true` scraper will continue downloading resources after error occured, if `false` - scraper will finish process and return error. Defaults to `true`.
142
+
143
+
#### urlFilter
144
+
Function which is called for each url to check whether it should be scraped. Defaults to `null` - no url filter will be applied.
157
145
```javascript
158
146
// Links to other websites are filtered out by the urlFilter
159
147
var scrape =require('website-scraper');
@@ -166,28 +154,40 @@ scrape({
166
154
}).then(console.log).catch(console.log);
167
155
```
168
156
169
-
#### Example 4. Downloading an entire website
157
+
#### filenameGenerator
158
+
String, name of one of the bundled filenameGenerators, or a custom filenameGenerator function. Filename generator determines where the scraped files are saved.
159
+
160
+
###### byType (default)
161
+
When the `byType` filenameGenerator is used the downloaded files are saved by type (as defined by the `subdirectories` setting) or directly in the `directory` folder, if no subdirectory is specified for the specific type.
162
+
163
+
###### bySiteStructure
164
+
When the `bySiteStructure` filenameGenerator is used the downloaded files are saved in `directory` using same structure as on the website:
#### Example 5. Rejecting resources with 404 status and adding metadata
183
+
#### httpResponseHandler
184
+
Function which is called on each response, allows to customize resource or reject its downloading.
185
+
It takes 1 argument - response object of [request](https://github.com/request/request) module and should return resolved `Promise` if resource should be downloaded or rejected with Error `Promise` if it should be skipped.
186
+
Promise should be resolved with:
187
+
*`string` which contains response body
188
+
* or object with properies `body` (response body, string) and `metadata` - everything you want to save for this resource (like headers, original text, timestamps, etc.), scraper will not use this field at all, it is only for result.
189
189
```javascript
190
-
var scrape =require('website-scraper');
190
+
// Rejecting resources with 404 status and adding metadata to other resources
191
191
scrape({
192
192
urls: ['http://example.com/'],
193
193
directory:'/path/to/save',
@@ -207,6 +207,15 @@ scrape({
207
207
}
208
208
}).then(console.log).catch(console.log);
209
209
```
210
+
Scrape function resolves with array of [Resource](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/resource.js) objects which contain `metadata` property from `httpResponseHandler`.
211
+
212
+
## callback
213
+
Callback function, optional, includes following parameters:
214
+
-`error`: if error - `Error` object, if success - `null`
215
+
-`result`: if error - `null`, if success - array of [Resource](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/resource.js) objects containing:
216
+
-`url`: url of loaded page
217
+
-`filename`: filename where page was saved (relative to `directory`)
218
+
-`children`: array of children Resources
210
219
211
220
## Log and debug
212
221
This module uses [debug](https://github.com/visionmedia/debug) to log events. To enable logs you should use environment variable `DEBUG`.
0 commit comments