Skip to content

Added Psychological Science scraper#40

Merged
tarrow merged 177 commits intoContentMine:masterfrom
chartgerink:master
Aug 18, 2016
Merged

Added Psychological Science scraper#40
tarrow merged 177 commits intoContentMine:masterfrom
chartgerink:master

Conversation

@chartgerink
Copy link
Contributor

Hi,

I attempted to write my first scraper, according to your scraperJSON template, and succeeded for the most part. I have also included test links. I tried to scrape as much information as possible, and include some of my problems below, FYI.

Kind regards,
Chris

  1. Introduction is not a defined section but just includes paragraph numbers (SAGE thing..)
  2. Supplementary materials are included at a separate location AND include all files of one issue. Have not discovered an easy way to download these (also a SAGE thing..)
  3. I have not yet succeeded in downloading Figures and tables.

@tarrow
Copy link
Contributor

tarrow commented Jun 6, 2016

I rewrote this into #44 with a rebase because it could be merged to the master (it needed a rebase) before merging the 176 commits

@chartgerink
Copy link
Contributor Author

chartgerink commented Aug 18, 2016

I did some additional checking of the scrapers. I removed tf.json because it conflicted with taylorfrancis.json (which is a clearer filename, I think) and because taylorfrancis.json performed better. I incorporated the code from tf.json for the tables. TaylorFrancis really acts oddly, so we need to check that at some point.

Also checked and updated wiley, sage, springer, elsevier scrapers. Elsevier contains almost no metadata so the scraper only uses html and pdf extraction. I also incorporated some changes, but they eliminated a lot of metadata scraping and did renaming of elements. Are we still adhering to the scraperJSON standard or did that become a thing of the past?

Sorry for the extent of commits, I forgot about this. I can also create a new fork to make things easier and do a new PR. Let me know.

@tarrow
Copy link
Contributor

tarrow commented Aug 18, 2016

We're still adhering to the scraperJSON standard; but not least because given that QS and thresher are the reference implementations basically if it works it's scraperJSON :).

If you could create a new branch from the current origin/master and cherrypick over these changes you've just made that would be awesome! Otherwise I can do that and make another PR. Let me know if you have problems.

@tarrow
Copy link
Contributor

tarrow commented Aug 18, 2016

Great! I'll merge now! :)

@tarrow tarrow merged commit 2cf1206 into ContentMine:master Aug 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.