Update Readme.md (#189)

s0ph1e · web-flow · commit 423792d059d4 · 2017-03-07T21:30:55.000+02:00
diff --git a/README.md b/README.md
@@ -31,129 +31,117 @@ var options = {
   directory: '/path/to/save/',
 };
 
-// with callback
-scrape(options, function (error, result) {
+// with promise
+scrape(options).then((result) => {
+	/* some code here */
+}).catch((err) => {
 	/* some code here */
 });
 
-// or with promise
-scrape(options).then(function (result) {
+// or with callback
+scrape(options, (error, result) => {
 	/* some code here */
 });
 ```
 
-## API
-### scrape(options, callback)
-Makes requests to `urls` and saves all files found with `sources` to `directory`.
-
-**options** - object containing next options:
-
- - `urls`: array of urls to load and filenames for them *(required, see example below)*
- - `directory`: path to save loaded files *(required)*
- - `sources`: array of objects to load, specifies selectors and attribute values to select files for loading *(optional, see example below)*
- - `recursive`: boolean, if `true` scraper will follow anchors in html files. Don't forget to set `maxDepth` to avoid infinite downloading *(optional, see example below)*
- - `maxDepth`: positive number, maximum allowed depth for dependencies *(optional, see example below)*
- - `request`: object, custom options for [request](https://github.com/request/request#requestoptions-callback) *(optional, see example below)*
- - `subdirectories`: array of objects, specifies subdirectories for file extensions. If `null` all files will be saved to `directory` *(optional, see example below)*
- - `defaultFilename`: filename for index page *(optional, default: 'index.html')*
- - `prettifyUrls`: whether urls should be 'prettified', by having the `defaultFilename` removed *(optional, default: false)*
- - `ignoreErrors`: boolean, if `true` scraper will continue downloading resources after error occured, if `false` - scraper will finish process and return error *(optional, default: true)*
- - `urlFilter`: function which is called for each url to check whether it should be scraped. *(optional, see example below)*
- - `filenameGenerator`: name of one of the bundled filenameGenerators, or a custom filenameGenerator function *(optional, default: 'byType')*
- - `httpResponseHandler`: function which is called on each response, allows to customize resource or reject its downloading *(optional, see example below)*
+## options
+* [urls](#urls) - urls to download, *required*
+* [directory](#directory) - path to save files, *required*
+* [sources](#sources) - selects which resources should be downloaded
+* [recursive](#recursive) - follow anchors in html files
+* [maxDepth](#maxdepth) - maximum depth for dependencies
+* [request](#request) - custom options for for [request](https://github.com/request/request)
+* [subdirectories](#subdirectories) - subdirectories for file extensions
+* [defaultFilename](#defaultfilename) - filename for index page
+* [prettifyUrls](#prettifyurls) - prettify urls
+* [ignoreErrors](#ignoreerrors) - whether to ignore errors on resource downloading
+* [urlFilter](#urlfilter) - skip some urls
+* [filenameGenerator](#filenamegenerator) - generate filename for downloaded resource
+* [httpResponseHandler](#httpresponsehandler) - customize http response handling
  
 Default options you can find in [lib/config/defaults.js](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/config/defaults.js).
 
-
-**callback** - callback function *(optional)*, includes following parameters:
-
-  - `error`: if error - `Error` object, if success - `null`
-  - `result`: if error - `null`, if success - array of [Resource](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/resource.js) objects containing:
-    - `url`: url of loaded page
-    - `filename`: filename where page was saved (relative to `directory`)
-    - `children`: array of children Resources
-
-### Filename Generators
-The filename generator determines where the scraped files are saved.
-
-#### byType (default)
-When the `byType` filenameGenerator is used the downloaded files are saved by type (as defined by the `subdirectories` setting) 
-or directly in the `directory` folder, if no subdirectory is specified for the specific type.
-
-#### bySiteStructure
-When the `bySiteStructure` filenameGenerator is used the downloaded files are saved in `directory` using same structure as on the website:
-- `/` => `DIRECTORY/index.html`
-- `/about` => `DIRECTORY/about/index.html`
-- `/resources/javascript/libraries/jquery.min.js` => `DIRECTORY/resources/javascript/libraries/jquery.min.js`
-
-### Http Response Handlers
-HttpResponseHandler is used to reject resource downloading or customize resource text based on response data (for example, status code, content type, etc.)
-Function takes `response` argument - response object of [request](https://github.com/request/request) module and should return resolved `Promise` if resource should be downloaded or rejected with Error `Promise` if it should be skipped.
-Promise should be resolved with:
-* `string` which contains response body
-* or object with properies `body` (response body, string) and `metadata` - everything you want to save for this resource (like headers, original text, timestamps, etc.), scraper will not use this field at all, it is only for result.
-
-See [example of using httpResponseHandler](#example-5-rejecting-resources-with-404-status-and-adding-metadata).
-
-## Examples
-#### Example 1
-Let's scrape some pages from [http://nodejs.org/](http://nodejs.org/) with images, css, js files and save them to `/path/to/save/`.
-Imagine we want to load:
-  - [Home page](http://nodejs.org/) to `index.html`
-  - [About page](http://nodejs.org/about/) to `about.html`
-  - [Blog](http://blog.nodejs.org/) to `blog.html`
-
-and separate files into directories:
-
-  - `img` for .jpg, .png, .svg (full path `/path/to/save/img`)
-  - `js` for .js (full path `/path/to/save/js`)
-  - `css` for .css (full path `/path/to/save/css`)
-
+#### urls
+Array of objects which contain urls to download and filenames for them. **_Required_**.
 ```javascript
-var scrape = require('website-scraper');
 scrape({
   urls: [
     'http://nodejs.org/',	// Will be saved with default filename 'index.html'
     {url: 'http://nodejs.org/about', filename: 'about.html'},
     {url: 'http://blog.nodejs.org/', filename: 'blog.html'}
   ],
+  directory: '/path/to/save'
+}).then(console.log).catch(console.log);
+```
+
+#### directory
+String, absolute path to directory where downloaded files will be saved. Directory should not exist. It will be created by scraper. **_Required_**.
+
+#### sources
+Array of objects to download, specifies selectors and attribute values to select files for downloading. By default scraper tries to download all possible resources.
+```javascript
+// Downloading images, css files and scripts
+scrape({
+  urls: ['http://nodejs.org/'],
   directory: '/path/to/save',
-  subdirectories: [
-    {directory: 'img', extensions: ['.jpg', '.png', '.svg']},
-    {directory: 'js', extensions: ['.js']},
-    {directory: 'css', extensions: ['.css']}
-  ],
   sources: [
     {selector: 'img', attr: 'src'},
     {selector: 'link[rel="stylesheet"]', attr: 'href'},
     {selector: 'script', attr: 'src'}
-  ],
+  ]
+}).then(console.log).catch(console.log);
+```
+
+#### recursive
+Boolean, if `true` scraper will follow anchors in html files. Don't forget to set `maxDepth` to avoid infinite downloading. Defaults to `false`.
+
+#### maxDepth
+Positive number, maximum allowed depth for dependencies. Defaults to `null` - no maximum depth set.
+
+#### request
+Object, custom options for [request](https://github.com/request/request#requestoptions-callback). Allows to set cookies, userAgent, etc.
+```javascript
+scrape({
+  urls: ['http://example.com/'],
+  directory: '/path/to/save',
   request: {
     headers: {
       'User-Agent': 'Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 4 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19'
     }
   }
-}).then(function (result) {
-  console.log(result);
-}).catch(function(err){
-  console.log(err);
-});
+}).then(console.log).catch(console.log);
 ```
 
-#### Example 2. Recursive downloading
+#### subdirectories
+Array of objects, specifies subdirectories for file extensions. If `null` all files will be saved to `directory`.
 ```javascript
-// Links from example.com will be followed
-// Links from links will be ignored because theirs depth = 2 is greater than maxDepth
-var scrape = require('website-scraper');
+/* Separate files into directories:
+  - `img` for .jpg, .png, .svg (full path `/path/to/save/img`)
+  - `js` for .js (full path `/path/to/save/js`)
+  - `css` for .css (full path `/path/to/save/css`)
+*/
 scrape({
-  urls: ['http://example.com/'],
+  urls: ['http://example.com'],
   directory: '/path/to/save',
-  recursive: true,
-  maxDepth: 1
+  subdirectories: [
+    {directory: 'img', extensions: ['.jpg', '.png', '.svg']},
+    {directory: 'js', extensions: ['.js']},
+    {directory: 'css', extensions: ['.css']}
+  ]
 }).then(console.log).catch(console.log);
 ```
 
-#### Example 3. Filtering out external resources
+#### defaultFilename
+String, filename for index page. Defaults to `index.html`.
+
+#### prettifyUrls
+Boolean, whether urls should be 'prettified', by having the `defaultFilename` removed. Defaults to `false`.
+
+#### ignoreErrors
+Boolean, if `true` scraper will continue downloading resources after error occured, if `false` - scraper will finish process and return error. Defaults to `true`.
+
+#### urlFilter
+Function which is called for each url to check whether it should be scraped. Defaults to `null` - no url filter will be applied.
 ```javascript
 // Links to other websites are filtered out by the urlFilter
 var scrape = require('website-scraper');
@@ -166,28 +154,40 @@ scrape({
 }).then(console.log).catch(console.log);
 ```
 
-#### Example 4. Downloading an entire website
+#### filenameGenerator
+String, name of one of the bundled filenameGenerators, or a custom filenameGenerator function. Filename generator determines where the scraped files are saved.
+
+###### byType (default)
+When the `byType` filenameGenerator is used the downloaded files are saved by type (as defined by the `subdirectories` setting) or directly in the `directory` folder, if no subdirectory is specified for the specific type.
+
+###### bySiteStructure
+When the `bySiteStructure` filenameGenerator is used the downloaded files are saved in `directory` using same structure as on the website:
+- `/` => `DIRECTORY/index.html`
+- `/about` => `DIRECTORY/about/index.html`
+- `/resources/javascript/libraries/jquery.min.js` => `DIRECTORY/resources/javascript/libraries/jquery.min.js`
+
 ```javascript
-// Downloads all the crawlable files of example.com.
-// The files are saved in the same structure as the structure of the website, by using the `bySiteStructure` filenameGenerator.
+// Downloads all the crawlable files. The files are saved in the same structure as the structure of the website
 // Links to other websites are filtered out by the urlFilter
 var scrape = require('website-scraper');
 scrape({
   urls: ['http://example.com/'],
-  urlFilter: function(url){
-      return url.indexOf('http://example.com') === 0;
-  },
+  urlFilter: function(url){ return url.indexOf('http://example.com') === 0; },
   recursive: true,
   maxDepth: 100,
-  prettifyUrls: true,
   filenameGenerator: 'bySiteStructure',
   directory: '/path/to/save'
 }).then(console.log).catch(console.log);
 ```
 
-#### Example 5. Rejecting resources with 404 status and adding metadata
+#### httpResponseHandler
+Function which is called on each response, allows to customize resource or reject its downloading.
+It takes 1 argument - response object of [request](https://github.com/request/request) module and should return resolved `Promise` if resource should be downloaded or rejected with Error `Promise` if it should be skipped.
+Promise should be resolved with:
+* `string` which contains response body
+* or object with properies `body` (response body, string) and `metadata` - everything you want to save for this resource (like headers, original text, timestamps, etc.), scraper will not use this field at all, it is only for result.
 ```javascript
-var scrape = require('website-scraper');
+// Rejecting resources with 404 status and adding metadata to other resources
 scrape({
   urls: ['http://example.com/'],
   directory: '/path/to/save',
@@ -207,6 +207,15 @@ scrape({
   }
 }).then(console.log).catch(console.log);
 ```
+Scrape function resolves with array of [Resource](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/resource.js) objects which contain `metadata` property from `httpResponseHandler`. 
+
+## callback 
+Callback function, optional, includes following parameters:
+  - `error`: if error - `Error` object, if success - `null`
+  - `result`: if error - `null`, if success - array of [Resource](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/resource.js) objects containing:
+    - `url`: url of loaded page
+    - `filename`: filename where page was saved (relative to `directory`)
+    - `children`: array of children Resources
 
 ## Log and debug
 This module uses [debug](https://github.com/visionmedia/debug) to log events. To enable logs you should use environment variable `DEBUG`.