Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
9b43338
add random delay function between requests
Euphorbium Jan 25, 2015
f48ac58
allow Start URL to be a local address
3flex Apr 2, 2015
3fd1383
Recursive getStartUrls
jackburridge Jun 26, 2015
ff66107
Update Sitemap.js
jackburridge Jun 26, 2015
e938a28
adding options to allow altering CSV output
mohamnag Feb 9, 2016
c6daaa1
adding necessary inputs for getting CSV output data from user
mohamnag Feb 9, 2016
4b71997
controller also to support CSV output options
mohamnag Feb 9, 2016
c5a965f
minor visual change on CSV generation UI
mohamnag Feb 9, 2016
4e41a0a
Select regex group
RuneHL Mar 11, 2016
316bc5f
Added Limit for clicking
panna-ahmed Dec 7, 2016
6b5018e
Do JSON export instead of CSV
Feb 24, 2017
2f9decb
Support date pattern in URL's
Feb 28, 2017
498e390
Limit in pagination selectors (click and scroll)
Feb 28, 2017
7b1f721
README.md updated
Feb 28, 2017
de0c7b9
Create valid CSV export
Feb 28, 2017
5bc0b91
Remove old CSV code and remove logs
Feb 28, 2017
789e0be
Refactor
Feb 28, 2017
71690dd
Added support vor vertical tables
jwillmer May 21, 2017
88825ad
Removed commented code block
jwillmer May 22, 2017
b58aa03
Merge of date pattern and pagination implementation
jwillmer May 22, 2017
b90bdc7
Merge branch 'master' of https://github.com/Euphorbium/web-scraper-ch…
jwillmer May 22, 2017
2ac1229
Merge click limit
jwillmer May 22, 2017
a321147
Merge regex groups for selection in Text
jwillmer May 22, 2017
c344b66
Merge branch 'patch-1' of https://github.com/3flex/web-scraper-chrome…
jwillmer May 22, 2017
c506039
Merge of valid CSV export
jwillmer May 22, 2017
114e3f2
Merged CSV options
jwillmer May 22, 2017
7fe1f63
Version bump and readme update
jwillmer May 22, 2017
4790ad9
Added lookup of image in element style
jwillmer May 23, 2017
9b2b6c5
Updated readme with installation instructions
jwillmer May 23, 2017
52e0cfe
Clone computed styles on clone
jwillmer May 23, 2017
8899d19
Added comment in code
jwillmer May 23, 2017
ce79af2
Added missing file ref in manifest
jwillmer May 23, 2017
04a64d2
Refactored edit view to fit in a smaller area.
jwillmer May 23, 2017
cc045ad
Implemented string manipulation features. close #2
jwillmer May 23, 2017
3348baa
Moved regex into string manipulation area
jwillmer May 23, 2017
5f1f369
Update readme feature list
jwillmer May 23, 2017
53f9e73
Fixed exceptions in regex and string validation
jwillmer May 24, 2017
21848a8
Allowing regex in content replacement field
jwillmer May 24, 2017
abc1792
Added text manipulation tools to group selector
jwillmer May 24, 2017
3a1a080
Some unit tests for DateUtils modules
May 24, 2017
9b7c5a2
Moved data manipulation function in Selector scope
jwillmer May 24, 2017
e894fc2
Enable string manipulation in group
jwillmer May 24, 2017
4641893
Implemented style extraction type
jwillmer May 24, 2017
333e747
Update readme feature list
jwillmer May 24, 2017
1bd10b6
Added and fixed unit tests. Refactored selector code.
jwillmer May 24, 2017
fd3a63f
Merge branch 'master' of https://github.com/codoff/web-scraper-chrome…
jwillmer May 24, 2017
dd2d76c
Refactored unit tests
jwillmer May 24, 2017
17b4eca
Merged recursive getStartUrl used when several identical patterns are…
jwillmer May 24, 2017
c97b0a7
Summarized Readme
jwillmer May 24, 2017
36060ae
Fixed validation in selector.js
jwillmer May 25, 2017
be7ce67
Merge branch 'master' of https://github.com/jwillmer/web-scraper-chro…
jwillmer May 25, 2017
4f332e7
Refactored text manipulation in selector.js
jwillmer May 25, 2017
e0ed642
Fixed broken tests
jwillmer May 25, 2017
cb38664
Removed click limit since pagination option does the same (duplicate …
jwillmer May 25, 2017
bf1e59d
Improved documentation
jwillmer May 25, 2017
7455e3b
Implemented workaround in unit test for certain systems.
jwillmer May 25, 2017
dbc210d
Implemented refresh button for table header row
jwillmer May 26, 2017
bea5f21
Fixed Navbar from disapearing
jwillmer May 26, 2017
efd612c
Added string replacement to image and link selector
jwillmer May 26, 2017
f965b5d
Fixed replacement string in image and link selector
jwillmer May 26, 2017
8ac99cd
Added playground for tables
jwillmer May 26, 2017
30c2703
Fixed text manipulation order of applied actions.
jwillmer May 27, 2017
58d10b7
- fixed randomness interval
jwillmer Jun 2, 2017
e0ad5fd
Code refactoring
jwillmer Jun 3, 2017
cb3d802
Fixed display of scraped results
jwillmer Jun 3, 2017
1e85c12
Refactored sitemap start url GUI
jwillmer Jun 3, 2017
d9b3ad7
Updated version features
jwillmer Jun 3, 2017
698e056
Fixed url bug #13
jwillmer Jul 15, 2017
74d8918
Update Readme
jwillmer Jul 15, 2017
2932f99
Updated Readme with release info
jwillmer Jul 15, 2017
2cd9a5b
Added value input (selector)
jwillmer Feb 19, 2018
cba6351
Added test to value input element
jwillmer Feb 19, 2018
3732955
Update README.md
jwillmer Mar 9, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,9 @@
projectFilesBackup
extension.zip

/.vs/web-scraper-chrome-extension/v15/.suo
/.vs/web-scraper-chrome-extension/v15
/.vs/VSWorkspaceState.json
/.vs/slnx.sqlite
/.vs/ProjectSettings.json
/.vs/config/applicationhost.config
57 changes: 31 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,35 +5,28 @@ should be traversed and what should be extracted. Using these sitemaps the
Web Scraper will navigate the site accordingly and extract all data. Scraped
data later can be exported as CSV.

Install the extension from [Chrome store] [chrome-store]

### Features

1. Scrape multiple pages
2. Sitemaps and scraped data are stored in browsers local storage or in CouchDB
3. Multiple data selection types
4. Extract data from dynamic pages (JavaScript+AJAX)
5. Browse scraped data
6. Export scraped data as CSV
7. Import, Export sitemaps
8. Depends only on Chrome browser

### Help

Documentation and tutorials are available on [webscraper.io] [webscraper.io]

Ask for help, submit bugs, suggest features on [google groups] [google-groups]
#### Latest Version
To run the latest version you need to [download the project][latest-releases] to your system and [follow the description on Google][get-started-chrome]) (select the `extension` folder).

Submit bugs and suggest features on [bug tracker] [github-issues]

#### Bugs
When submitting a bug please attach an exported sitemap if possible.

## License
LGPLv3

## Changelog

### v0.3
* Enabled pasting of multible start URLs (by [@jwillmer](https://github.com/jwillmer))
* Added scraping of dynamic table columns (by [@jwillmer](https://github.com/jwillmer))
* Added style extraction type (by [@jwillmer](https://github.com/jwillmer))
* Added text manipulation (trim, replace, prefix, suffix, remove HTML) (by [@jwillmer](https://github.com/jwillmer))
* Added image improvements to find images in div background (by [@jwillmer](https://github.com/jwillmer))
* Added support for vertical tables (by [@jwillmer](https://github.com/jwillmer))
* Added random delay function between requests (by [@Euphorbium](https://github.com/Euphorbium))
* Start URL can now also be a local URL (by [@3flex](https://github.com/3flex))
* Added CSV export options (by [@mohamnag](https://github.com/mohamnag))
* Added Regex group for select (by [@RuneHL](https://github.com/RuneHL))
* JSON export/import of settings (by [@haisi](https://github.com/haisi))
* Added date and number pattern in URL (by [@codoff](https://github.com/codoff))
* Added pagination selector limit (by [@codoff](https://github.com/codoff))
* Improved CSV export (by [@haisi](https://github.com/haisi))
* Added click limit option (by [@panna-ahmed](https://github.com/panna-ahmed))

### v0.2
* Added Element click selector
* Added Element scroll down selector
Expand All @@ -55,7 +48,19 @@ LGPLv3
* Added ranged start urls
* Fixed bug which made selector tree not to show on some operating systems

#### Bugs
When submitting a bug please attach an exported sitemap if possible.

#### Development
Read the [Development Instructions](/docs/Development.md) before you start.

## License
LGPLv3

[chrome-store]: https://chrome.google.com/webstore/detail/web-scraper/jnhgnonknehpejjnehehllkliplmbmhn
[webscraper.io]: http://webscraper.io/
[google-groups]: https://groups.google.com/forum/#!forum/web-scraper
[github-issues]: https://github.com/martinsbalodis/web-scraper-chrome-extension/issues
[get-started-chrome]: https://developer.chrome.com/extensions/getstarted#unpacked
[issue-14]: https://github.com/jwillmer/web-scraper-chrome-extension/issues/14
[latest-releases]: https://github.com/jwillmer/web-scraper-chrome-extension/releases
56 changes: 56 additions & 0 deletions docs/Development.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Development Instructions

## Selector Development

This section demonstrates all steps that are needed in order to create or extend a selector for the web scraper. In this example we are creating a "Select All" selector.

### Create Selector Logic
You can skip the file creation steps if you intend to extend other selectors with functionallity.

- Duplicate the file `SelectorElementStyle.js` in `scripts/Selector/`
- Rename the duplicated file to `SelectorAll.js`
- Modify the `getData` method to return all content
- Specify which features you like to have enabled in the `getFeatures` function
- Implement the logic for the enabled features (Feature `textmanipulation` will work out of the box)

### Create Selector Controls

- Add a section into the `SelectorEdit.html` file in `devtools/views/`
- Add section class `form-group feature feature-AllSelector`
- You can use `{{#selectorName}}` and `{{/selectorName}}` to prevent content from displaying (used for checkobx controls)
- Use `{{selector.selectorAll}}` to define a variable


### Set references to your selector

#### Controler

- Open the `Controler.js` in `scripts/`
- Add a variable in the function `getCurrentlyEditedSelector` to select your HTML section value
- Add the variable to the `newSelector` object (every selector in `scripts/Selector/` that references this feature can access the value)
- Add validation rules to your variable in the function `initSelectorValidation`


#### File reference

- Add a reference in `extension/manifest.json` in the section `content_scripts` and `scripts`
- Add a reference to `extension\devtools\devtools_scraper_panel.html`
- Add a eference to `playgrounds\extension\index.html`
- Add a reference to `tests\SpecRunner.html`


### Testing

For testing you need to run a web server. Personally I use [Web Server for Chrome](https://chrome.google.com/webstore/detail/web-server-for-chrome/ofhbbkphhbklhfoeikjpcbhemlocgigb) and reference the working directory of the project.

- Duplicate a test file in `tests/Selector` and rename it
- Write your tests for your selector
- Run the tests by opening `tests/SpecRunner.html`
- Try you implementation by opening `playgrounds/extension/index.html`
- Extend the playground if it does not cover your scenario

### Documentation

- Create a `md` file in `docs/selectors`
- Describe the usage, options, etc

5 changes: 5 additions & 0 deletions docs/Selectors/Element attribute selector.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,11 @@ this link: `<a href="#" title="my title">link<a>`.
* multiple - multiple records are being extracted.
* attribute name - the attribute that is going to be extracted. For example
`title`, `data-id`.
* remove HTML
* trim text
* replace text - regular expression in the replace field possible
* text prefix/suffix
* delay - delay the extraction

## Use cases
See [Text selector] [text-selector] use cases.
Expand Down
1 change: 1 addition & 0 deletions docs/Selectors/Element click selector.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ events triggered by the button.
be clicked to load more elements.
* click type - type of how the selector knows when there will be no new
elements and clicking should stop.
* pagination limit - the number of clicks you want the selector to perform.
* click element uniqueness - type of how selector knows which buttons are
already clicked.
* multiple - multiple records are being extracted (almost always should be
Expand Down
1 change: 1 addition & 0 deletions docs/Selectors/Element scroll down selector.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ infinitely then this selector will be stuck in an infinite loop.
should usually be specified because the data won't be loaded immediately from
the server after scrolling down. More than 2000 ms might be a good choice if
you you don't want to loose data because the server didn't respond fast enough.
* pagination limit - the number of clicks you want the selector to perform.

## Use cases
See [Element selector] [element-selector] use cases.
Expand Down
1 change: 1 addition & 0 deletions docs/Selectors/Element selector.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ on a button then you should try these selectors:
be used as parent elements for child selectors.
* multiple - multiple records are being extracted (almost always should be
checked). Multiple option for child selectors usually should not be checked.
* delay - delay the extraction

## Use cases

Expand Down
21 changes: 21 additions & 0 deletions docs/Selectors/Element style selector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Element style selector
Element style selector can extract an style value of an HTML element.
For example you could use this selector to extract the with attribute from
this div: `<div style="width: 20px;"><div>`.

## Configuration options
* selector - [CSS selector] [css-selector] for the element.
* multiple - multiple records are being extracted.
* style name - the attribute that is going to be extracted. For example
`width`, `background-image`.
* remove HTML
* trim text
* replace text - regular expression in the replace field possible
* text prefix/suffix
* delay - delay the extraction

## Use cases
See [Text selector] [text-selector] use cases.

[text-selector]: Text%20selector.md
[css-selector]: ../CSS%20selector.md
5 changes: 5 additions & 0 deletions docs/Selectors/Grouped selector.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,11 @@ The extracted data will be stored as JSON.
* attribute name - optionally this selector can extract an attribute of the
selected element. If specified the extractor will also add this attribute to
the resulting JSON.
* remove HTML
* trim text
* replace text - regular expression in the replace field possible
* text prefix/suffix
* delay - delay the extraction

## Use cases

Expand Down
5 changes: 5 additions & 0 deletions docs/Selectors/HTML selector.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,11 @@ inner HTML of the element will be extracted.
* selector - [CSS selector] [css-selector] for the element whose inner HTML
will be extracted.
* multiple - multiple records are being extracted.
* remove HTML
* trim text
* replace text - regular expression in the replace field possible
* text prefix/suffix
* delay - delay the extraction

## Use cases
See [Text selector] [text-selector] use cases.
Expand Down
1 change: 1 addition & 0 deletions docs/Selectors/Image selector.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ report it as a bug.
checked for Image selector.
* download image - downloads and store images on local drive. When CouchDB
storage back end is used the image is also stored locally.
* delay - delay the extraction

## Use cases
See [Text selector] [text-selector] use cases.
Expand Down
1 change: 1 addition & 0 deletions docs/Selectors/Link selector.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ link selector is not working for you then you can try these workarounds:
* selector - [CSS selector] [css-selector] for the link element from which the
link for navigation will be extracted.
* multiple - multiple records are being extracted. Usually should be checked.
* delay - delay the extraction

## Use cases

Expand Down
1 change: 1 addition & 0 deletions docs/Selectors/Table selector.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ shows what you should select when extracting data from a table.
* data rows selector - [CSS selector] [css-selector] for table data rows.
* multiple - multiple records are being extracted. Usually should be
checked for Table selector because you are extracting multiple rows.
* delay - delay the extraction

## Use cases
See [Text selector] [text-selector] use cases.
Expand Down
5 changes: 5 additions & 0 deletions docs/Selectors/Text selector.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,11 @@ resulting data.
multiple checked then you might actually need
[Element selector] [element-selector].
* regex - regular expression to extract a substring from the result.
* remove HTML
* trim text
* replace text - regular expression in the replace field possible
* text prefix/suffix
* delay - delay the extraction

### Regex

Expand Down
Loading