To ensure that the conversion has gone as expected, we should perform some quality checks to compare the data in the original SAS files with the converted Parquet files. The question is: Which checks?
I suggest, per register (not per SAS file):
- nrows
- colnames (will be lowercased in the Parquets +
source_file and year is added in Parquets)
- col types (using mapping from
create_arrow_schema())
- values:
- categorical cols: how many of each unique value + na count
- numeric cols: min, max, mean, median, quantiles. na count
- datetime cols: min, max, na count
- other checks?
Questions:
- How to do this in the easiest way? Add generic function to do these checks?
- How to determine categorial cols (with n_unique_values <= x?)