Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
177 commits
Select commit Hold shift + click to select a range
1ecac9e
Update README.md
blahah May 26, 2014
e05e2a3
update example
blahah May 27, 2014
a35fecd
fix README formatting and typo
blahah May 27, 2014
acb9e7c
add html and text special attributes to README
blahah May 29, 2014
a2a6d26
Update README.md
blahah Jun 1, 2014
ebea30a
Create science_direct.json
ianthe Jun 19, 2014
f6380f9
Merge pull request #5 from ianthe/master
blahah Jun 19, 2014
40b80a4
travis setup
Jun 22, 2014
13ef95b
auto test generation script
Jun 22, 2014
073a4da
move scrapers to subdir
Jun 22, 2014
4962bab
test generator script fixes
Jun 22, 2014
cdb542b
self-populating tests and peerj example
Jun 22, 2014
3118404
Merge branch 'master' of github.com:ContentMine/journal-scrapers
Jun 22, 2014
0ee30be
move sciencedirect to scrapers
Jun 22, 2014
c1cafdd
fix tmpdir use
Jun 22, 2014
1cdbd3b
debug test generator tmpdir error
Jun 22, 2014
9c82fa0
test set for peerj scraper
Jun 22, 2014
4e94869
fix test generator - now working
Jun 22, 2014
ad734f3
fix test runner - now working
Jun 22, 2014
86c3735
attempted fix for travis dependency install
Jun 22, 2014
1ce66d6
remove unneeded prints from tests
Jun 22, 2014
bf2c926
tests for plos scraper
Jun 22, 2014
9b1f20d
another attempted travis install fix
Jun 22, 2014
aa0cfc6
delete wayward results file
Jun 22, 2014
998b0de
add .gitignore
Jun 22, 2014
fb463a8
add travis badge and explanation to README
Jun 22, 2014
ab88faf
tidy formatting in README
blahah Jun 22, 2014
8e9d287
add science direct tests
Jun 22, 2014
04902e3
Merge branch 'master' of https://github.com/ContentMine/journal-scrapers
Jun 22, 2014
da2a0f2
add CC0 license
blahah Jun 23, 2014
5ce546e
matching badges
Jun 23, 2014
d1f3ddf
fix badge address
Jun 23, 2014
d5f7b08
coverage reporting for scrapers
Jun 23, 2014
ac46bc2
fix coverage reporting
Jun 23, 2014
29b9cce
another coveralls fix
Jun 23, 2014
ed42173
another coveralls fix
Jun 23, 2014
622191f
mend broken curl command
Jun 23, 2014
1a82584
remove empty file
Jun 23, 2014
396b658
fix coveralls CURL command
Jun 23, 2014
641c260
add coveralls to README
Jun 23, 2014
a443c18
make travis badges consistent
Jun 23, 2014
09c6410
another CURL cmd fix
Jun 23, 2014
04d5300
add contribution instructions
Jun 23, 2014
3d6e3a7
fix typo; finalise self-testing (fixes #4)
Jun 23, 2014
2340fa3
add TOC to README
Jun 23, 2014
222a00a
prettify TOC
Jun 23, 2014
9d189bc
tidy TOC
Jun 23, 2014
9ffe438
coveralls submission recognises travis environment
Jun 24, 2014
34c4606
typo
Jun 24, 2014
e60fe56
peerj scraper now implements all ContentMine fields
Jul 2, 2014
7317635
fix contributing doc links
blahah Jul 2, 2014
905c04c
run tests in debug mode
Jul 3, 2014
4370307
Merge branch 'master' of github.com:ContentMine/journal-scrapers
Jul 3, 2014
7198106
install libfontconfig before running travis tests
Jul 4, 2014
a5746a3
fix broken peerj tests
blahah Jul 13, 2014
2901ab4
link out to scraperJSON
Jul 13, 2014
3746fcc
handle mac MD5hash
Jul 13, 2014
187846b
MDPI full
Jul 13, 2014
4f239fa
Extract abstract from PLOS pages
CristianCantoro Jul 16, 2014
62752cd
Merge pull request #9 from CristianCantoro/master
blahah Jul 17, 2014
f64dd8d
add renaming to all scrapers
Jul 17, 2014
cec7cf3
make test line counting more accurate
Jul 17, 2014
a736045
Merge branch 'master' of github.com:ContentMine/journal-scrapers
Jul 17, 2014
dd8f6e0
update PLOS with fulltext xml and new tests
Jul 17, 2014
f50850e
add fulltext_xml to MDPI
Jul 17, 2014
edb7635
add fulltext xml to compatible scrapers
Jul 17, 2014
0ecc002
generate tests for MDPI
blahah Jul 17, 2014
fac1fdc
elife scraper
Jul 17, 2014
ab4c14b
Merge branch 'master' of github.com:ContentMine/journal-scrapers
Jul 17, 2014
1c3648c
elife tests
blahah Jul 17, 2014
0f25cdc
update test coverage calculation with element names only
Jul 17, 2014
d5f611d
Merge branch 'master' of github.com:ContentMine/journal-scrapers
Jul 17, 2014
34dafb0
Use Ruby Digest::MD5 instead of md5sum
mcs07 Jul 22, 2014
d0a3652
added Acta Cryst E scraper; probably works with other IUCr journals b…
petermr Jul 22, 2014
aa5a6dd
added ACS(JACS) scraper; fails on resolving relative/absolute links
petermr Jul 23, 2014
1cf4191
remove broken scrapers to PRs
Jul 24, 2014
0803636
ACS Scraper - requires relative URL parsing from quickscrape
pbulsink Sep 1, 2014
db82379
AAAS (Science) Scraper - requires relative URL parsing from quickscrape
pbulsink Sep 1, 2014
a9e20aa
Nature Scraper - requires relative URL parsing from quickscrape
pbulsink Sep 1, 2014
6c721ff
Additional use of meta tags
pbulsink Sep 1, 2014
79bbf52
Wiley Scraper
pbulsink Sep 1, 2014
6194072
removed renames from download - doesn't properly work in urllist cases
pbulsink Sep 2, 2014
16da68f
Added Springer Scraper
pbulsink Sep 2, 2014
957baa2
Taylor & Francis Scraper - Requires relative paths from quickscrape
pbulsink Sep 2, 2014
f124a0a
URLLists for written scrapers
pbulsink Sep 2, 2014
c9ef714
Built test .json files
pbulsink Sep 2, 2014
53700b0
Nature updates & Test
pbulsink Sep 2, 2014
d99cece
Removed Nature, ready for pull
pbulsink Sep 2, 2014
e6d6636
Removed all new scrapers except for Nature
pbulsink Sep 2, 2014
b6ba671
Added PNAS scraper & test
pbulsink Sep 3, 2014
159ec31
RSC Scrapers
pbulsink Sep 3, 2014
e6c738d
JAMA and related sub-journal scraper
pbulsink Sep 3, 2014
6ab5d56
Merge pull request #23 from pbulsink/ready_to_pull
Sep 9, 2014
709796c
Merge pull request #24 from pbulsink/nature
Oct 2, 2014
062c419
acs must be headless
blahah Oct 2, 2014
f1b7463
Merge pull request #17 from mcs07/filehash
Oct 2, 2014
97efc91
Merge branch 'master' of github.com:ContentMine/journal-scrapers
blahah Oct 2, 2014
b211d99
Merge branch 'master' of github.com:ContentMine/journal-scrapers
blahah Oct 2, 2014
38d63e0
temporarily remove broken jama scraper
blahah Oct 2, 2014
15bd207
remove broken science direct scraper
blahah Oct 2, 2014
03dfee1
update tests for latest quickscrape
blahah Oct 2, 2014
fbfc8c5
specify quickscrape version
blahah Oct 2, 2014
f0f8c10
bump quickscrape dependency
Oct 6, 2014
28c7415
don't include file hash results in coverage
blahah Oct 6, 2014
f5c22d0
Merge branch 'master' of github.com:ContentMine/journal-scrapers
blahah Oct 6, 2014
83d7ee0
migrate to nvm
Oct 7, 2014
a59d3f3
restart bash after nvm install
Oct 7, 2014
d0bf3f1
travis comes with nvm!
Oct 7, 2014
5a0ede7
need latest 0.10.32
Oct 7, 2014
e32730b
use latest npm
Oct 7, 2014
15e7f35
peerj scraper is complete - showcase of latest scraperJSON
blahah Jan 11, 2015
b5992bb
Merge branch 'master' of github.com:ContentMine/journal-scrapers
blahah Jan 11, 2015
6dba241
bump quickscrape dependency version
blahah Jan 11, 2015
120cd2a
fix license path, improve copyright
blahah Jan 11, 2015
8bb4071
tests for license and copyright improvements
blahah Jan 11, 2015
23cbdd4
added bmc and trialsjournal scrapers
petermr Jan 12, 2015
6ed03c4
Adding the first version of my Acta Cryst. E scrapers. Four scrapers
Jan 23, 2015
20e4c77
Adding 5 test URLs from Acta Cryst. E to test scrapers for the
Jan 23, 2015
6543844
Adding 5 more test URLs for IUCr Acta Cryst. E papers (this time
Jan 23, 2015
1d189b8
Updating the Acta Cryst. E body scraper (it now correctly picks out
Jan 23, 2015
1c8db30
Fixing the last IUCr Acta Cryst. E scraper, acta-e-scripts.json, which
Jan 23, 2015
520f518
Adding most of the metadata extraction "rules" to the
Jan 24, 2015
5d861c1
Changing "author_emails" to "corresponding_author_email" in
Jan 24, 2015
2b3967d
Adding a rule to extract HTML text with citations from the Acta
Jan 24, 2015
d98f6dd
Adding rule to downoad figures (aka schemes) from Acta Cryst. E
Jan 24, 2015
b2239d3
Now longer downloading the frameset HTML in
Jan 24, 2015
db3f043
Reformatting the "scrapers/acta-e-doi.json" code using Emacs.
sauliusg Jan 24, 2015
4154c9a
Transfering all DC based metadata tag extraction from
sauliusg Jan 24, 2015
52e0dde
Completing "scrapers/acta-e-index.json" to extract DC.* and citation_*
sauliusg Jan 24, 2015
62774ee
Adding all DC.* extraction rules to "scrapers/acta-e-scripts.json".
sauliusg Jan 24, 2015
90e99f8
Adidng comment on the "scrapers/acta-e-scripts.json" scraper input.
sauliusg Jan 24, 2015
6919497
Adding test URLs from other (non-open-access) Acta Cryst. journals.
sauliusg Jan 24, 2015
19935c8
Deleting the test file with non-OA IUCr journal URLs, since tests on
sauliusg Jan 24, 2015
62cc846
trunk/ (saulius@koala.ibt.lt)
sauliusg Jan 26, 2015
1f519fa
trunk/ (saulius@koala.ibt.lt)
sauliusg Jan 26, 2015
16275a0
trunk/ (saulius@koala.ibt.lt)
sauliusg Jan 26, 2015
bd636a5
updated plos scraper
blahah Feb 4, 2015
2b62bb4
Merge branch 'master' of github.com:ContentMine/journal-scrapers
blahah Feb 4, 2015
2081cb6
Update PLoS scraper to fix fulltext xml capture
blahah Apr 11, 2015
f2d4d6f
Fixed typo in MDPI scraper
robintw Apr 18, 2015
59ed0f2
Merge pull request #32 from robintw/robintw-mdpi-typo
Apr 18, 2015
1e2c214
Merge pull request #26 from sauliusg/master
petermr Jun 20, 2015
a21a5d3
Adding a template for new IUCr journal layout (http://journals.new.iu…
merkys Jun 20, 2015
bceca75
Implementing the extraction of abstract for new IUCr layout.
merkys Jun 20, 2015
91f3782
Extracting references.
merkys Jun 21, 2015
5502981
Fixing 'fulltext_html' and 'description' fields, adding 'abstract_html'.
merkys Jun 21, 2015
6d44bb8
Bringing back some of the elements from acta-e-body.json.
merkys Jun 21, 2015
e8cbafe
Adding 'htmlBodyAuthors' and 'htmlBodyAuthorUrls'.
merkys Jun 21, 2015
03a1898
Ceasing to download fulltext HTML, since the download link points to …
merkys Jun 21, 2015
ccaad3c
Downloading supplementary CIF files.
merkys Jun 21, 2015
4369f16
Adjusting name of the downloaded CIF file to one used in acta-e-doi.j…
merkys Jun 21, 2015
de25fd2
first pass at Nature scraper
cnjr2 Jun 21, 2015
62a892d
Merge pull request #36 from merkys/master
petermr Jun 21, 2015
038e950
Merge pull request #39 from cnjr2/master
petermr Jun 21, 2015
4782e3d
Rename PNAS fulltext downloads (ContentMine/quickscrape#54)
blahah Aug 8, 2015
e674803
Fix fulltext HTML capture for PNAS
blahah Aug 8, 2015
f79917c
IJSEM scraper
blahah Aug 10, 2015
659573b
First run at Psychological Science scraper (added json file, test lin…
Jul 2, 2015
e91dc98
Modified psychologicalscience.json with some dynamic selectors
Jul 2, 2015
acaf40a
Ran tests for Psych Science scraper
Jul 25, 2015
1be754a
Upgrade to new container-based Travis CI
blahah Aug 16, 2015
c92010e
Remove unnecessary setup
blahah Aug 16, 2015
252e17f
Flushing test changes
blahah Aug 16, 2015
83c3cfd
Merge branch 'master' of https://github.com/contentmine/journal-scrapers
Aug 30, 2015
83c9d92
Rename psychscience scraper to sage scraper
Sep 22, 2015
2cc51db
Minor update sage
Oct 1, 2015
6349134
Update test links sage
Oct 2, 2015
bc83997
Add springer test urls
Oct 2, 2015
93f5287
Add taylorfrancis test urls
Oct 2, 2015
066890b
Add wiley test urls
Oct 2, 2015
4f36179
Add APA test urls
Oct 2, 2015
b186403
First run at wiley scraper (fails figure download and pdf download)
Oct 2, 2015
a215a42
Add elsevier, wiley, springer, sage; init apa
Oct 26, 2015
9d36fa4
minor changes to elsevier and springer, while testing
Oct 30, 2015
7e861d1
Add TaylorFrancis scraper
chartgerink Apr 14, 2016
187b323
Minor updates to springer and taylorfrancis definitions
Apr 14, 2016
a8ebe3e
Check and update taylorfrancis, sage, elsevier, springer, wiley scrapers
chartgerink Aug 18, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 0 additions & 48 deletions scrapers/apa.json

This file was deleted.

42 changes: 42 additions & 0 deletions scrapers/bmc.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
{
<<<<<<< HEAD
"url": "www\\.biomedcentral\\.com",
=======
"url": "biomedcentral\\.com",
>>>>>>> 5ba53bf6d08fc9d102cccc623c5511502f7711c1
"elements": {
"publisher": {
"selector": "//meta[@name='citation_publisher']",
Expand Down Expand Up @@ -38,11 +42,19 @@
"attribute": "content"
},
"description": {
<<<<<<< HEAD
"selector": "//meta[@name='description']",
"attribute": "content"
},
"abstract": {
"selector": "//meta[@name='description']",
=======
"selector": "//meta[@name='dc.description']",
"attribute": "content"
},
"abstract": {
"selector": "//meta[@name='dc.description']",
>>>>>>> 5ba53bf6d08fc9d102cccc623c5511502f7711c1
"attribute": "content"
},
"fulltext_html": {
Expand All @@ -59,17 +71,46 @@
"rename": "fulltext.pdf"
}
},
<<<<<<< HEAD
"fulltext_xml": {
"selector": "//a[.='Download XML']",
"attribute": "href",
"download": {
"rename": "fulltext.xml"
}
},
"supplementary_material": {
"selector": "//link[starts-with(@title,'Additional file')]",
=======
"supplementary_material": {
"selector": "//a[@class='filename']",
>>>>>>> 5ba53bf6d08fc9d102cccc623c5511502f7711c1
"attribute": "href",
"download": true
},
"figure": {
<<<<<<< HEAD
"selector": "//div[@class='fig']/p/a/img",
=======
"selector": "//figure[@class='Figure']/div/img",
>>>>>>> 5ba53bf6d08fc9d102cccc623c5511502f7711c1
"attribute": "src",
"download": true
},
"figure_caption": {
<<<<<<< HEAD
"selector": "//div[@class='fig']//strong"
},
"license": {
"selector": "//p[a/@href='http://creativecommons.org/licenses/by/4.0']"
},
"copyright": {
"selector": "//p[contains(.,'licensee')]"
}
}
}

=======
"selector": "//figure[@class='Figure']/figcaption"
},
"license": {
Expand All @@ -81,3 +122,4 @@
}
}
}
>>>>>>> 5ba53bf6d08fc9d102cccc623c5511502f7711c1
25 changes: 25 additions & 0 deletions scrapers/elsevier.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"url": "sciencedirect\\.com",
"elements": {
"title": {
"selector": "/html/head/title",
"attribute": "content"
},
"fulltext_html": {
"selector": "//link[contains(@rel, 'canonical')]",
"attribute": "href",
"download": {
"rename": "fulltext.html"
}
},

"fulltext_pdf": {
"selector": "//*[contains(@id, 'pdfLink')]",
"attribute": "href",
"download": {
"rename": "fulltext.pdf"
}
}
}
}

10 changes: 8 additions & 2 deletions scrapers/sage.json
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
{

"url": ".*sagepub.*\\.com",
"headless": true,
"elements": {
"publisher": {
"selector": "//meta[@name='DC.Publisher']",
"attribute": "content"
},

"title": {
"selector": "//meta[@name='DC.Title']",
"attribute": "content"
Expand All @@ -15,24 +17,28 @@
"attribute": "content"
},
"date": {
"selector": "//meta[@name='citation_online_date']",
"selector": "//meta[@name='DC.Date']",
"attribute": "content"
},

"doi": {
"selector": "//meta[@name='citation_doi']",
"attribute": "content"
},

"issn": {
"selector": "//meta[@name='citation_issn']",
"attribute": "content"
},

"fulltext_pdf": {
"selector": "//meta[@name='citation_pdf_url']",
"attribute": "content",
"download": {
"rename": "fulltext.pdf"
}
},

"fulltext_html": {
"selector": "//meta[@name='citation_fulltext_html_url']",
"attribute": "content",
Expand All @@ -41,4 +47,4 @@
}
}
}
}
}
6 changes: 5 additions & 1 deletion scrapers/springer.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
{
"url": ".*springer.*\\.com",
"headless": true,
"elements": {
Expand All @@ -10,6 +10,7 @@
"selector": "//meta[@name='citation_title']",
"attribute": "content"
},

"authors": {
"selector": "//meta[@name='citation_author']",
"attribute": "content"
Expand All @@ -18,10 +19,12 @@
"selector": "//meta[@name='citation_online_date']",
"attribute": "content"
},

"doi": {
"selector": "//meta[@name='citation_doi']",
"attribute": "content"
},

"issn": {
"selector": "//meta[@name='citation_issn']",
"attribute": "content"
Expand All @@ -33,6 +36,7 @@
"rename": "fulltext.pdf"
}
},

"fulltext_html": {
"selector": "//meta[@name='citation_fulltext_html_url']",
"attribute": "content",
Expand Down
Loading