|
1 | 1 | ##Introduction |
2 | | -Node.js module for website's scraping with images, css, js, etc. Uses cheerio, request, bluebird, fs-extra, underscore. |
| 2 | +Node.js module for website's scraping with images, css, js, etc. |
| 3 | + |
| 4 | +[](https://travis-ci.org/s0ph1e/node-website-scraper) |
| 5 | +[](https://codeclimate.com/github/s0ph1e/node-website-scraper) |
| 6 | +[](https://www.npmjs.org/package/website-scraper) |
| 7 | +[](https://www.npmjs.org/package/website-scraper) |
| 8 | +[](https://david-dm.org/s0ph1e/node-website-scraper) |
| 9 | + |
| 10 | +[](https://www.npmjs.org/package/website-scraper) |
3 | 11 |
|
4 | 12 | ##Installation |
5 | 13 | `npm install website-scraper` |
6 | 14 |
|
7 | 15 | ##Usage |
8 | 16 | ```javascript |
9 | 17 | var scraper = require('website-scraper'); |
10 | | -scraper.scrape({ |
| 18 | +var options = { |
11 | 19 | url: 'http://nodejs.org/', |
12 | | - path: '/path/to/save/', |
13 | | -}, function (error, result){ |
| 20 | + directory: '/path/to/save/', |
| 21 | +}; |
| 22 | + |
| 23 | +// with callback |
| 24 | +scraper.scrape(options, function (error, result) { |
| 25 | + /* some code here */ |
| 26 | +}); |
| 27 | + |
| 28 | +// or with promise |
| 29 | +scraper.scrape(options).then(function (result) { |
14 | 30 | /* some code here */ |
15 | 31 | }); |
16 | 32 | ``` |
17 | 33 |
|
18 | 34 | ##API |
19 | 35 | ### scrape(options, callback) |
20 | | -Makes request to `url` and saves all files found with `srcToLoad` to `path`. |
| 36 | +Makes request to `url` and saves all files found with `srcToLoad` to `directory`. |
21 | 37 |
|
22 | 38 | **options** - object containing next options: |
23 | 39 |
|
24 | 40 | - `url:` url to load *(required)* |
25 | | - - `path:` path to save loaded files *(required)* |
| 41 | + - `directory:` path to save loaded files *(required)* |
| 42 | + - `paths:` array of objects, contains urls or relative paths to load and filenames for them (if is not set only `url` will be loaded) *(optional, see example below)* |
26 | 43 | - `log:` boolean indicates whether to write the log to console *(optional, default: false)* |
27 | 44 | - `indexFile:` filename for index page *(optional, default: 'index.html')* |
28 | 45 | - `srcToLoad:` array of objects to load, specifies selectors and attribute values to select files for loading *(optional, see default value in `lib/defaults.js`)* |
29 | | - - `directories:` array of objects, specifies relative directories for extensions. If `null` all files will be saved to `path` *(optional, see example below)* |
| 46 | + - `subdirectories:` array of objects, specifies subdirectories for extensions. If `null` all files will be saved to `directory` *(optional, see example below)* |
30 | 47 |
|
31 | 48 |
|
32 | 49 | **callback** - callback function *(optional)*, includes following parameters: |
33 | 50 |
|
34 | 51 | - `error:` if error - `Error object`, if success - `null` |
35 | | - - `result:` if error - `null`, if success - object containing: |
36 | | - - `html:` html code of index page |
| 52 | + - `result:` if error - `null`, if success - array if objects containing: |
| 53 | + - `url:` url of loaded page |
| 54 | + - `filename:` absolute filename where page was saved |
37 | 55 |
|
38 | 56 |
|
39 | 57 | ##Examples |
40 | | -Let's scrape [http://nodejs.org/](http://nodejs.org/) with images, css, js files and save them to `/path/to/save/`. Index page will be named 'myIndex.html', files will be separated into directories: |
| 58 | +Let's scrape some pages from [http://nodejs.org/](http://nodejs.org/) with images, css, js files and save them to `/path/to/save/`. |
| 59 | +Imagine we want to load: |
| 60 | + - [Home page](http://nodejs.org/) to `index.html` |
| 61 | + - [About page](http://nodejs.org/about/) to `about.html` |
| 62 | + - [Blog](http://blog.nodejs.org/) to `blog.html` |
| 63 | + |
| 64 | +and separate files into directories: |
41 | 65 |
|
42 | | - - `img` for .jpg, .png (full path `/path/to/save/img`) |
| 66 | + - `img` for .jpg, .png, .svg (full path `/path/to/save/img`) |
43 | 67 | - `js` for .js (full path `/path/to/save/js`) |
44 | 68 | - `css` for .css (full path `/path/to/save/css`) |
45 | | - - `font` for .ttf, .woff, .eot, .svg (full path `/path/to/save/font`) |
46 | 69 |
|
47 | 70 | ```javascript |
48 | 71 | scraper.scrape({ |
49 | 72 | url: 'http://nodejs.org/', |
50 | | - path: '/path/to/save', |
51 | | - indexFile: 'myIndex.html', |
| 73 | + directory: '/path/to/save', |
| 74 | + paths: [ |
| 75 | + {path: '/', filename: 'index.html'}, |
| 76 | + {path: '/about', filename: 'about.html'}, |
| 77 | + {url: 'http://blog.nodejs.org/', filename: 'blog.html'} |
| 78 | + ], |
| 79 | + subdirectories: [ |
| 80 | + {directory: 'img', extensions: ['.jpg', '.png', '.svg']}, |
| 81 | + {directory: 'js', extensions: ['.js']}, |
| 82 | + {directory: 'css', extensions: ['.css']} |
| 83 | + ], |
52 | 84 | srcToLoad: [ |
53 | 85 | {selector: 'img', attr: 'src'}, |
54 | 86 | {selector: 'link[rel="stylesheet"]', attr: 'href'}, |
55 | 87 | {selector: 'script', attr: 'src'} |
56 | | - ], |
57 | | - directories: [ |
58 | | - {directory: 'img', extensions: ['.jpg', '.png']}, |
59 | | - {directory: 'js', extensions: ['.js']}, |
60 | | - {directory: 'css', extensions: ['.css']}, |
61 | | - {directory: 'fonts', extensions: ['.ttf', '.woff', '.eot', '.svg']} |
62 | 88 | ] |
63 | | -}, function (error, result){ |
64 | | - console.log(result); |
| 89 | +}).then(function (result) { |
| 90 | + console.log(result); |
65 | 91 | }); |
66 | 92 | ``` |
67 | 93 |
|
|
0 commit comments