-
Notifications
You must be signed in to change notification settings - Fork 0
Description
We are working on allowing database-only corpora in I-analyzer (CentreForDigitalHumanities/Textcavator#981). Our intention is that this requires that all pre-processing is done outside of I-analyzer, so you provide source files that can be imported without further data wrangling.
We want to use the Goodreads corpus as a pilot for this. The CSV data created by the scraper in this repository were created specifically to use in I-analyzer, so it should be a good test case.
Current corpus definition: https://github.com/UUDigitalHumanitieslab/I-analyzer/blob/develop/backend/corpora/goodreads/goodreads.py
However, the word_count, date, and year fields of that corpus still use custom functions to transform values, which we will not support. These transforms to applied to the CSV data before we hand them to I-analyzer.
Request: Write a script that polishes the scraped CSV data from Goodreads so all the transform arguments on the corpus definition can be left out. This will require adding extra columns for year and word_count.
Options:
- Write a custom script using
csv - Use a
CSVReaderbased on the I-analyzer corpus definition:- Implement Add CSV export ianalyzer-readers#4
- add
ianalyzer_readersas a dependency in the script - create a stripped-down version of the corpus definition as a
CSVReader - use the
export_csv()method you just created
The latter method will a bit take longer, but describes the "proper" method to convert a Python corpus to the new system, by splitting it into a Reader that does pre-processing, and a JSON file that defines the I-analyzer corpus.