Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
9b43338
add random delay function between requests
Euphorbium Jan 25, 2015
5aa090d
Update background_script.js
hejiheji001 Jun 25, 2018
d43e657
Support Restful MySQL
hejiheji001 Aug 2, 2018
6088dd0
Support Restful MySQL
hejiheji001 Aug 2, 2018
059f34d
Support Restful MySQL
hejiheji001 Aug 2, 2018
1de8869
Data filter, Distinct json
hejiheji001 Aug 14, 2018
8426e94
bug fix for 'Distinct'
hejiheji001 Aug 14, 2018
fa0d1f8
Create README.md
hejiheji001 Aug 16, 2018
99f980c
Update README.md
hejiheji001 Aug 16, 2018
c3fd7b8
Update README.md
hejiheji001 Aug 16, 2018
de2b326
bug fix & custom columns
hejiheji001 Aug 17, 2018
5928010
Merge branch 'master' of https://github.com/hejiheji001/web-scraper-c…
hejiheji001 Aug 20, 2018
20e369f
Update README.md
hejiheji001 Aug 20, 2018
abb4c1d
Update README.md
hejiheji001 Aug 20, 2018
9016f78
Update README.md
hejiheji001 Aug 20, 2018
559ebd2
bug fix for Sitemap
hejiheji001 Aug 21, 2018
a06a907
Update README.md
hejiheji001 Aug 21, 2018
af3a6de
Create Wiki
hejiheji001 Aug 21, 2018
68da68f
Add files via upload
hejiheji001 Aug 21, 2018
1a2a17c
Add files via upload
hejiheji001 Aug 21, 2018
5e24214
Add files via upload
hejiheji001 Aug 21, 2018
d073489
Add files via upload
hejiheji001 Aug 21, 2018
022bdcd
Update README.md
hejiheji001 Aug 21, 2018
6452409
Update README.md
hejiheji001 Aug 21, 2018
861e481
Update README.md
hejiheji001 Aug 21, 2018
b727ce6
Add files via upload
hejiheji001 Aug 22, 2018
7a80991
Add files via upload
hejiheji001 Aug 22, 2018
4ad7b08
Delete Selector.png
hejiheji001 Aug 22, 2018
fa412fb
Add files via upload
hejiheji001 Aug 22, 2018
e8d1d52
wiki
hejiheji001 Aug 22, 2018
b48d5cd
Update README.md
hejiheji001 Aug 22, 2018
e7cbf6b
create sitemap from pages
hejiheji001 Aug 22, 2018
408795c
Update README.md
hejiheji001 Aug 22, 2018
1023638
Update README.md
hejiheji001 Aug 22, 2018
87a6d8f
Merge pull request #1 from Euphorbium/master
hejiheji001 Aug 22, 2018
0d6056c
create sitemap from pages
hejiheji001 Aug 23, 2018
e28d357
Revert "add random delay function between requests"
hejiheji001 Aug 23, 2018
5d9a6f6
Merge pull request #2 from hejiheji001/revert-1-master
hejiheji001 Aug 23, 2018
c4fa38b
Revert "Revert "add random delay function between requests""
hejiheji001 Aug 23, 2018
04d1506
random delay
hejiheji001 Aug 23, 2018
5874a81
Merge branch 'master' of https://github.com/hejiheji001/web-scraper-c…
hejiheji001 Aug 23, 2018
444983f
Update README.md
hejiheji001 Aug 23, 2018
cf1fb33
Update README.md
hejiheji001 Aug 23, 2018
50cadc1
add neww feature
hejiheji001 Aug 23, 2018
9eefe78
bug fix
hejiheji001 Aug 27, 2018
62410c7
bug fix
hejiheji001 Oct 22, 2018
db14ad8
Merge pull request #3 from hejiheji001/revert-2-revert-1-master
hejiheji001 Oct 24, 2018
66ad280
improve performance
hejiheji001 Oct 24, 2018
6567e4a
bug fix
hejiheji001 Oct 29, 2018
9c05fc4
Merge branch 'master' of https://github.com/hejiheji001/web-scraper-c…
hejiheji001 Oct 29, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 0 additions & 4 deletions .gitignore

This file was deleted.

3 changes: 0 additions & 3 deletions .gitmodules

This file was deleted.

165 changes: 0 additions & 165 deletions LICENSE

This file was deleted.

71 changes: 37 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,27 @@
# Web Scraper
Web Scraper is a chrome browser extension built for data extraction from web
# Web Scraper Plus
Web Scraper Plus is a chrome browser extension built for data extraction from web
pages. Using this extension you can create a plan (sitemap) how a web site
should be traversed and what should be extracted. Using these sitemaps the
Web Scraper will navigate the site accordingly and extract all data. Scraped
data later can be exported as CSV.

Install the extension from [Chrome store] [chrome-store]
Install the extension from [chrome-store]

### Features
Document for new features: [wiki]

#### This tool is forked form [Web-Scraper] with many more features

### New Features
1. [CLI Support]: Start scraping from CMD/Terminal
2. [MySQL Support]: Support MySQL database (v5.7+)
3. [Anti Lazy-Loading]: Anti Lazy-Loading feature on pages
4. [Data Filter]: Support user defined JS code for data preprocess and much more
5. [Distinct]: Remove dulplicate data before the end of every task.
6. [Custom Columns]: Define the columns you want to display, please use this feature together with [Data Filter]
7. [Easy Scrape]: Create & scrape sitemap in a more easily way. (Based on https://github.com/aagiss)
8. Random Interval: Add a random delay between requests. (Provided by https://github.com/Euphorbium)

### Features(Forked from original work)

1. Scrape multiple pages
2. Sitemaps and scraped data are stored in browsers local storage or in CouchDB
Expand All @@ -20,42 +34,31 @@ Install the extension from [Chrome store] [chrome-store]

### Help

Documentation and tutorials are available on [webscraper.io] [webscraper.io]
Basic documentation and tutorials are available on [webscraper.io]

Ask for help, submit bugs, suggest features on [google groups] [google-groups]

Submit bugs and suggest features on [bug tracker] [github-issues]
Submit bugs and suggest features on [github-issues]

#### Bugs
When submitting a bug please attach an exported sitemap if possible.

## License
LGPLv3

## Changelog

### v0.2
* Added Element click selector
* Added Element scroll down selector
* Added Link popup selector
* Improved table selector to work with any html markup
* Added Image download
* Added keyboard shortcuts when selecting elements
* Added configurable delay before using selector
* Added configurable delay between page visiting
* Added multiple start url configuration
* Added form field validation
* Fixed a lot of bugs

### v0.1.3
* Added Table selector
* Added HTML selector
* Added HTML attribute selector
* Added data preview
* Added ranged start urls
* Fixed bug which made selector tree not to show on some operating systems

[chrome-store]: https://chrome.google.com/webstore/detail/web-scraper/jnhgnonknehpejjnehehllkliplmbmhn
[Web-Scraper]: https://github.com/martinsbalodis/web-scraper-chrome-extension
[chrome-store]: https://chrome.google.com/webstore/detail/pbbfbmlnpackgeofecdfncmmdbodkhma
[webscraper.io]: http://webscraper.io/
[google-groups]: https://groups.google.com/forum/#!forum/web-scraper
[github-issues]: https://github.com/martinsbalodis/web-scraper-chrome-extension/issues
[github-issues]: https://github.com/hejiheji001/web-scraper-chrome-extension/issues
[wiki]: https://github.com/hejiheji001/web-scraper-chrome-extension/wiki
[MySQL Support]: https://github.com/hejiheji001/web-scraper-chrome-extension/wiki/MySQL-Support

[CLI Support]: https://github.com/hejiheji001/web-scraper-chrome-extension/wiki/CLI-Support

[Anti Lazy-Loading]: https://github.com/hejiheji001/web-scraper-chrome-extension/wiki/Anti-Lazy-Loading

[Data Filter]: https://github.com/hejiheji001/web-scraper-chrome-extension/wiki/Data-Filter

[Distinct]: https://github.com/hejiheji001/web-scraper-chrome-extension/wiki/Distinct

[Custom Columns]: https://github.com/hejiheji001/web-scraper-chrome-extension/wiki/Custom-Columns

[Easy Scrape]: https://github.com/hejiheji001/web-scraper-chrome-extension/wiki/Easy-Scrape
Binary file added docs/wiki/Anti LazyLoading.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/wiki/CustomColumns1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/wiki/CustomColumns2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/wiki/DataFilter1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/wiki/DataFilter2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/wiki/DataFilter3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/wiki/DataFilter4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/wiki/Distinct.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/wiki/MySQL Support.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/wiki/Selector.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added extension/assets/images/1400.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added extension/assets/images/440.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added extension/assets/images/920g.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified extension/assets/images/icon128.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified extension/assets/images/icon16.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified extension/assets/images/icon19.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified extension/assets/images/icon38.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified extension/assets/images/icon48.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
35 changes: 29 additions & 6 deletions extension/background_page/background_script.js
Original file line number Diff line number Diff line change
Expand Up @@ -27,12 +27,22 @@ var sendToActiveTab = function(request, callback) {
});
};

var currentChildURLs=null;

chrome.runtime.onMessage.addListener(
function (request, sender, sendResponse) {

console.log("chrome.runtime.onMessage", request);

if (request.createSitemap) {
if (request.setCurrentChildURLs) {
currentChildURLs=request.urls;
sendResponse(request);
return true;
}else if (request.getCurrentChildURLs) {
request.urls=currentChildURLs;
sendResponse(request);
return true;
}else if (request.createSitemap) {
store.createSitemap(request.sitemap, sendResponse);
return true;
}
Expand All @@ -51,16 +61,22 @@ chrome.runtime.onMessage.addListener(
else if (request.sitemapExists) {
store.sitemapExists(request.sitemapId, sendResponse);
return true;
}else if (request.findSitemap) {
store.findSitemap(request.sitemapId, sendResponse);
return true;
}
else if (request.getSitemapData) {
store.getSitemapData(new Sitemap(request.sitemap), sendResponse);
store.getSitemapData(new Sitemap(request.sitemap), sendResponse);
return true;
}
else if (request.scrapeSitemap) {
else if (request.scrapeSitemap) {//TODO
var sitemap = new Sitemap(request.sitemap);
var queue = new Queue();
var browser = new ChromePopupBrowser({
pageLoadDelay: request.pageLoadDelay
pageLoadDelay: request.pageLoadDelay,
scrollToBottom: request.scrollToBottom,
urls: sitemap.getStartUrls()
});

var scraper = new Scraper({
Expand All @@ -72,8 +88,8 @@ chrome.runtime.onMessage.addListener(
});

try {
scraper.run(function () {
browser.close();
const callback = function(){
//browser.close();
var notification = chrome.notifications.create("scraping-finished", {
type: 'basic',
iconUrl: 'assets/images/icon128.png',
Expand All @@ -83,7 +99,14 @@ chrome.runtime.onMessage.addListener(
// notification showed
});
sendResponse();
});
}

const finalize = function(finalResult){
store.removeDuplicate(sitemap._id, request.distinct, finalResult);
callback();
}

scraper.run(finalize);
}
catch (e) {
console.log("Scraper execution cancelled".e);
Expand Down
4 changes: 4 additions & 0 deletions extension/content_script/content_script.js
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ chrome.runtime.onMessage.addListener(
console.log("received data-preview extraction request", request);
var extractor = new DataExtractor(request);
var deferredData = extractor.getSingleSelectorData(request.parentSelectorIds, request.selectorId);

deferredData.done(function(data){
console.log("dataextractor data", data);
sendResponse(data);
Expand All @@ -36,6 +37,9 @@ chrome.runtime.onMessage.addListener(
});

return true;
}else if(request.notice){
console.log("Notification: ", request.message);

}
}
);
13 changes: 13 additions & 0 deletions extension/content_script/scrollToBottom.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
chrome.runtime.onMessage.addListener(function(message, sender, sendResponse) {
if(message.run){
var i = 0;
var x = setInterval(function(){
window.scrollTo(0, i * 100);
i++;
if(i * 100 >= document.body.scrollHeight){
clearInterval(x);
chrome.runtime.sendMessage({antilazyloading: true});
}
}, 100);
}
});
2 changes: 1 addition & 1 deletion extension/devtools/devtools_init_page.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions extension/devtools/devtools_scraper_panel.html
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@
<script src="../scripts/Controller.js"></script>

<script src="../scripts/App.js"></script>

<script src="../scripts/RunTask.js"></script>
</head>
<body></body>
</html>
Loading