MedSpider is a collection of scripts that help with web scraping tasks in order to gather online conversations from health forums. MedSpider targets the following listed online forums, which are categorized by the type of interaction between Patients (P) and Medics (M).
- Python 2.7
- Latest version of lxml installed via
pip install lxml==4.1.0 - Pandas is also needed for some of the scrapers
- Please note that the BMJ's Doc2Doc forum is discontinued, the scraper uses cached web pages from Wayback Machine/Internet Archive
- Specify the output directory to write results to by editing the
doc2doc.pyfile's main entry point, e.g.Spidey().crawl('doc2doc')(default isdoc2docif not specified) - Run the script via command line or terminal
python doc2doc.pywhich will create tab-separated output files in the output directory you specified
DocCheck Blogs [M2M]
- This scraper will require registration of a medic-related account on DocCheck
- Specify the output directory to write results to by editing the
doccheck.pyfile's main entry point, e.g.Spidey().crawl('doccheck')(default isdoccheckif not specified) - Run the script via command line or terminal
python doccheck.pywhich will create tab-separated output files in the specified directory:blogs.tsv,comments.tsv, andtopics.tsv
eHealth Forum Questions [P2M]
- Specify the output directory to write results to by editing the
ehealthforum.pyfile's main entry point, e.g.Spidey().crawl('ehealthforum')(default isehealthforumif not specified) - Run the script via command line or terminal
python ehealthforum.pywhich will create a tab-separated output file calledchats.tsvin the specified directory - To run the unit tests, use
pytest -q ehealthforum.py
Scrape the Doctors Lounge Forum in 3 Steps [P2M]
- Specify the output directory to write results to by editing the
doctorslounge.pyfile's main entry point, e.g.Spidey().crawl('doctorslounge')(default isdoctorsloungeif not specified) - Run the script via command line or terminal
python doctorslounge.pywhich will create a tab-separated output file calleddiscussions.tsvin the specified directory - To run the unit tests, use
pytest -q doctorslounge.py
Scrape the Optimal Health Network (OHN) Live Chat Archives in 3 Steps [P2M]
- Specify the output directory to write results to by editing the
ohn.pyfile's main entry point, e.g.Spidey().crawl('ohn')(default isohnif not specified) - Run the script via command line or terminal
python ohn.pywhich will create a tab-separated output file calledchats.tsvin the specified directory - To run the unit tests, use
pytest -q ohn.py
Johns Hopkins Breast Center Expert Answers in 3 Steps [P2M]
- Specify the output file to write results to by editing the
hopkins.pyfile's main entry point, e.g.Spidey().crawl('hopkins')(default is 'hopkins' if not specified) - Run the script via command line or terminal
python hopkins.pywhich will create a tab-separated output file calleddiscussions.tsvin the specified directory - To run the unit tests, use
pytest -q hopkins.py
Scrape the Health Stack Exchange Q&A Forums in 3 Steps [P2P]
- Specify the output directory (must exist) to write results to by editing the
healthse.pyfile's main entry point, e.g.Spidey().crawl('healthse')(default is 'healthse' if not specified) - Run the script via command line or terminal
python healthse.pywhich will create a collection of tab-separated output files (please note that Stack Exchange has rate limits):questions.tsv,answers.tsv,question_comments.tsv, andanswer_comments.tsv. - To run the unit tests, use
pytest -q healthse.py
Parse the Health Stack Exchange Q&A Archives in 3 Steps [P2P]
- Download the
health.stackexchange.com.7zarchive file and extract it using 7-Zip, it has Ubuntu and Windows versions - Note the dataset folder where the extracted XML files are located
- The
SEParser.pyscript can create question pairs using the XML files viapython SEParse.py dataset-folder, for examplepython SEParse.py SEparse. It will save the results to a CSV file within the dataset folder (in the case of the example, the file will be calledSEparse.csv). The script can be modified to perform other extraction and parsing tasks from the XML files.