Additional processing for goodreads CSV data

We are working on allowing database-only corpora in I-analyzer (https://github.com/UUDigitalHumanitieslab/I-analyzer/issues/981). Our intention is that this requires that all pre-processing is done outside of I-analyzer, so you provide source files that can be imported without further data wrangling.

We want to use the Goodreads corpus as a pilot for this. The CSV data created by the scraper in this repository were created specifically to use in I-analyzer, so it should be a good test case.

Current corpus definition: https://github.com/UUDigitalHumanitieslab/I-analyzer/blob/develop/backend/corpora/goodreads/goodreads.py

However, the `word_count`, `date`, and `year` fields of that corpus still use custom functions to transform values, which we will not support. These transforms to applied to the CSV data before we hand them to I-analyzer.

**Request:** Write a script that polishes the scraped CSV data from Goodreads so all the `transform` arguments on the corpus definition can be left out. This will require adding extra columns for `year` and `word_count`.

Options:
- Write a custom script using `csv`
- Use a `CSVReader` based on the I-analyzer corpus definition:
    - Implement https://github.com/UUDigitalHumanitieslab/ianalyzer-readers/issues/4
    - add `ianalyzer_readers` as a dependency in the script
    - create a stripped-down version of the corpus definition as a `CSVReader`
    - use the `export_csv()` method you just created
 
The latter method will a bit take longer, but describes the "proper" method to convert a Python corpus to the new system, by splitting it into a `Reader` that does pre-processing, and a JSON file that defines the I-analyzer corpus.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional processing for goodreads CSV data #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Additional processing for goodreads CSV data #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions