Skip to content

Additional processing for goodreads CSV data #4

@lukavdplas

Description

@lukavdplas

We are working on allowing database-only corpora in I-analyzer (CentreForDigitalHumanities/Textcavator#981). Our intention is that this requires that all pre-processing is done outside of I-analyzer, so you provide source files that can be imported without further data wrangling.

We want to use the Goodreads corpus as a pilot for this. The CSV data created by the scraper in this repository were created specifically to use in I-analyzer, so it should be a good test case.

Current corpus definition: https://github.com/UUDigitalHumanitieslab/I-analyzer/blob/develop/backend/corpora/goodreads/goodreads.py

However, the word_count, date, and year fields of that corpus still use custom functions to transform values, which we will not support. These transforms to applied to the CSV data before we hand them to I-analyzer.

Request: Write a script that polishes the scraped CSV data from Goodreads so all the transform arguments on the corpus definition can be left out. This will require adding extra columns for year and word_count.

Options:

  • Write a custom script using csv
  • Use a CSVReader based on the I-analyzer corpus definition:
    • Implement Add CSV export ianalyzer-readers#4
    • add ianalyzer_readers as a dependency in the script
    • create a stripped-down version of the corpus definition as a CSVReader
    • use the export_csv() method you just created

The latter method will a bit take longer, but describes the "proper" method to convert a Python corpus to the new system, by splitting it into a Reader that does pre-processing, and a JSON file that defines the I-analyzer corpus.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions