@@ -15,12 +15,12 @@ show_tree <- function(dir) {
1515}
1616```
1717
18- One of the main purposes of fastreg is to ease the conversion of SAS
19- register files (` .sas7bdat ` ) into
20- [ Parquet] ( https://parquet.apache.org/ ) . A * register * in this context
21- refers to a collection of related data files that belong to the same
22- dataset, typically with yearly snapshots (e.g., ` bef2020.sas7bdat ` ,
23- ` bef2021.sas7bdat ` ).
18+ fastreg aims to make working with Danish registers simpler and faster by
19+ providing functionality to convert the SAS register files (` .sas7bdat ` )
20+ into [ Parquet] ( https://parquet.apache.org/ ) and read the resulting
21+ Parquet files. A * register * in this context refers to a collection of
22+ related data files that belong to the same dataset, typically with
23+ yearly snapshots (e.g., ` bef2020.sas7bdat ` , ` bef2021.sas7bdat ` ).
2424
2525## Why Parquet?
2626
@@ -53,7 +53,7 @@ fs::dir_create(sas_dir)
5353
5454bef_list <- simulate_register(
5555 "bef",
56- c("", "1999_1 ", "1999_2 ", "2020"),
56+ c("", "1999 ", "1999_1 ", "2020"),
5757 n = 1000
5858)
5959
@@ -98,6 +98,15 @@ larger-than-memory data) with a default of reading 1 million rows,
9898extracts 4-digit years from filenames for partitioning, and lowercases
9999column names. See ` ?convert_file ` for more details.
100100
101+ ::: callout-note
102+ When a SAS file contains more rows than the ` chunk_size ` , multiple
103+ Parquet files will be created from it. This doesn't affect how the data
104+ is loaded with ` read_register() ` (see
105+ [ Reading a Parquet register] ( #reading-a-parquet-register ) below), it
106+ only means you may see more Parquet files in the output than input SAS
107+ files.
108+ :::
109+
101110Even though this only converts a single file, the output is partitioned
102111by the year extracted from the file name as seen below:
103112
@@ -128,62 +137,26 @@ convert_register(
128137)
129138```
130139
131- As with ` convert_file() ` , the output is partitioned by year, extracted
132- from file names.
140+ ` convert_register() ` uses ` convert_file() ` internally so the same
141+ chunking and partitioning behaviour applies. See ` ?convert_file ` and
142+ ` ?convert_register ` for more details. As a result, the output from
143+ ` convert_register() ` is also partitioned by year, extracted from file
144+ names:
133145
134146``` {r output-tree}
135147#| echo: false
136148show_tree(output_register_dir)
137149```
138150
139- ` convert_register() ` uses ` convert_file() ` internally to reads files in
140- chunks (to be able to handle larger-than-memory data), extracts 4-digit
141- years from filenames for partitioning, and lowercases column names. See
142- ` ?convert_file ` and ` ?convert_register ` for more details.
143-
144- There are four different ways that the SAS files can be converted into
145- Parquet. Using the ` bef ` register as an example, we have a set of input
146- SAS files in the ` Grunddata/ ` folder and a set of converted
147- Hive-partitioned Parquet files in the output folder
148- ` parquet-registers/bef/ ` that are both listed below:
149-
150- ``` text
151- # Original SAS files (input)
152- Grunddata/
153- ├── bef2020.sas7bdat # 1
154- ├── bef2021.sas7bdat # 2
155- ├── December_2023/
156- │ └── bef2021.sas7bdat # 2
157- ├── bef2022.sas7bdat # 3
158- └── bef.sas7bdat # 4
159-
160- # Converted Parquet files (output)
161- parquet-registers/bef/
162- ├── year=2020/ # 1
163- │ └── part-c28221.parquet # 1
164- ├── year=2021/ # 2
165- │ ├── part-bf73dc.parquet # 2
166- │ └── part-546bed.parquet # 2
167- ├── year=2022/ # 3
168- │ ├── part-7c041e.parquet # 3
169- │ └── part-8869b7.parquet # 3
170- └── year=__HIVE_DEFAULT_PARTITION__/ # 4
171- └── part-a8d52c.parquet # 4
172- ```
173-
174- 1 . A single SAS file is converted to a single Parquet file, partitioned
175- by year from filename.
176- 2 . Multiple SAS files with the same register and year are converted into
177- separate Parquet files in the same partition (shown below). Rows
178- between these several SAS files are not deduplicated, so you'll have
179- to check for duplicates after conversion.
180- 3 . A large SAS file is split into multiple Parquet files that are only
181- as many rows as is determined by the ` chunk_size ` option.
182- 4 . A SAS file without a year in file name is placed in the
183- ` year=__HIVE_DEFAULT_PARTITION__/ ` folder (the
184- [ Apache Hive] ( https://hive.apache.org/docs/latest/user/configuration-properties/ )
185- default for missing partitions) when it is converted to Parquet.
151+ The output is organised into a "bef" folder (register name extracted from file names) with year-based subdirectories:
186152
153+ - The data from the two SAS files with "1999" in their file names are
154+ located in the subfolder "year=1999"
155+ - The data from the SAS file from 2020 are located in the subfolder
156+ "year=2020"
157+ - One SAS file didn't have a year in its file name, ` bef.sas7bdat ` . The
158+ data from this file is placed in the "year=** HIVE_DEFAULT_PARTITION** "
159+ folder, the default for files without a year in their name.
187160
188161## Converting multiple registers in parallel
189162
0 commit comments