GitHub - cognitive-metascience/review_crawler: Crawler / file parser for open peer reviews in open access journals (MDPI, PLoS, eLife)

Installation & usage

Clone this repository:

git clone https://github.com/cognitive-metascience/review_crawler.git

and change your working directory to this folder:

cd review_crawler

Then install the required Python packages:

pip install -r requirements.txt

PLOS crawler

this crawler uses the allofplos library to parse and extract metadata from articles in the PLOS corpus. This database is stored in a ZIP file which will be downloaded to your local storage before you run the crawler for the first time. Be warned: as of 15 August 2023, this zip archive contains nearly 8 GB of data: articles in JATS-standard XML format.

First, make sure your current working directory is set to /review_crawler, like above.

Run the following command to download this fork the allofplos, which is necessary to run plos_crawler:

git submodule update --init allofplos

The first time you run the crawler, you will need to use the --download flag in order to download the PLOS corpus on your device, like in the example below:

python -m plos_crawler --download

Alternatively, you can manually download the PLOS corpus from this link. Place the downloaded file into the folder review_crawler/input, without changing its filename.

Again, keep in mind that the downloaded zip file will be very huge in size. Please make sure you have sufficient amount of free space before hitting enter.

The crawler will take its time time to process all this data. Eventually you should find the results in the output/plos folder:

metadata for all articles in JSON format in the folder all_articles,
in the folder reviewed_articles: subfolders for each reviewed article, metadata in JSON, the article itself in XML, and a subfolder sub-articles containing metadata and XMLs of reviews, decision letters, author responses, as well as any supplementary materials (usually DOCX and PDF files).

eLife crawler

This crawler is very similar to the one for PLOS, as both are parsing articles in JATS format. This one also utilises some parts of the allofplos library, so it's necessary to initialize the submodule first, like above.

Download a zip file containing the eLife corpus directly from their GitHub repository (click on 'Code' -> 'download ZIP') and place it in the review_crawler/input folder without changing its filename.

Alternatively, you can clone the entire elife-article-xml submodule which contains uncompressed articles in XML (the corpus is updated daily and as of 27th of July 2022, it contains nearly 3 GB of data). In this case use the following command:

git submodule update --init elife-article-xml

Run the crawler from the command line like this:

python -m elife_crawler

In the output/elife folder you should find the results, in the same format as the ones for PLOS.

MDPI crawler

Consists of two dedicated Scrapy spiders. For usage instructions, consult the Readme file in the crawling directory which contains the Scrapy project.

License

BSD-2-Clause. See the LICENSE.txt file.

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
.vscode		.vscode
allofplos @ 3733067		allofplos @ 3733067
crawling		crawling
doc		doc
elife-article-xml @ d5e44db		elife-article-xml @ d5e44db
input		input
json_schema		json_schema
logs		logs
output		output
scraped/mdpi		scraped/mdpi
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.txt		LICENSE.txt
README.md		README.md
__init__.py		__init__.py
elife_crawler.py		elife_crawler.py
file_management.ipynb		file_management.ipynb
fix_suppms.ipynb		fix_suppms.ipynb
plos_crawler.py		plos_crawler.py
rarticle.py		rarticle.py
requirements.txt		requirements.txt
review_crawler.py		review_crawler.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Installation & usage

PLOS crawler

eLife crawler

MDPI crawler

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

cognitive-metascience/review_crawler

Folders and files

Latest commit

History

Repository files navigation

Installation & usage

PLOS crawler

eLife crawler

MDPI crawler

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages