Skip to content

[discussion]: Which quality checks should we perform on the converted data? #232

@signekb

Description

@signekb

To ensure that the conversion has gone as expected, we should perform some quality checks to compare the data in the original SAS files with the converted Parquet files. The question is: Which checks?

I suggest, per register (not per SAS file):

  • nrows
  • colnames (will be lowercased in the Parquets + source_file and year is added in Parquets)
  • col types (using mapping from create_arrow_schema())
  • values:
    • categorical cols: how many of each unique value + na count
    • numeric cols: min, max, mean, median, quantiles. na count
    • datetime cols: min, max, na count
  • other checks?

Questions:

  • How to do this in the easiest way? Add generic function to do these checks?
  • How to determine categorial cols (with n_unique_values <= x?)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

To do

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions