Skip to content

Commit 9243cf6

Browse files
signekbpre-commit-ci[bot]lwjohnst86
authored
docs: 📝 minor edits to get started (#221)
# Description Needs a quick review. ## Checklist - [X] Ran `just run-all` --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Luke W. Johnston <lwjohnst86@users.noreply.github.com>
1 parent c0029d9 commit 9243cf6

File tree

2 files changed

+34
-56
lines changed

2 files changed

+34
-56
lines changed

inst/WORDLIST

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ Tidyverse
77
UUID
88
UUIDs
99
bdat
10+
bef
11+
behaviour
1012
callout
1113
ci
1214
deduplicated
@@ -21,11 +23,14 @@ lowercases
2123
msg
2224
optimised
2325
organise
26+
organised
2427
parallelises
2528
param
2629
pre
2730
sas
2831
schemas
32+
simdata
33+
subfolder
2934
svg
3035
tada
3136
tibbles

vignettes/fastreg.qmd

Lines changed: 29 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -15,12 +15,12 @@ show_tree <- function(dir) {
1515
}
1616
```
1717

18-
One of the main purposes of fastreg is to ease the conversion of SAS
19-
register files (`.sas7bdat`) into
20-
[Parquet](https://parquet.apache.org/). A *register* in this context
21-
refers to a collection of related data files that belong to the same
22-
dataset, typically with yearly snapshots (e.g., `bef2020.sas7bdat`,
23-
`bef2021.sas7bdat`).
18+
fastreg aims to make working with Danish registers simpler and faster by
19+
providing functionality to convert the SAS register files (`.sas7bdat`)
20+
into [Parquet](https://parquet.apache.org/) and read the resulting
21+
Parquet files. A *register* in this context refers to a collection of
22+
related data files that belong to the same dataset, typically with
23+
yearly snapshots (e.g., `bef2020.sas7bdat`,`bef2021.sas7bdat`).
2424

2525
## Why Parquet?
2626

@@ -53,7 +53,7 @@ fs::dir_create(sas_dir)
5353
5454
bef_list <- simulate_register(
5555
"bef",
56-
c("", "1999_1", "1999_2", "2020"),
56+
c("", "1999", "1999_1", "2020"),
5757
n = 1000
5858
)
5959
@@ -98,6 +98,15 @@ larger-than-memory data) with a default of reading 1 million rows,
9898
extracts 4-digit years from filenames for partitioning, and lowercases
9999
column names. See `?convert_file` for more details.
100100

101+
::: callout-note
102+
When a SAS file contains more rows than the `chunk_size`, multiple
103+
Parquet files will be created from it. This doesn't affect how the data
104+
is loaded with `read_register()` (see
105+
[Reading a Parquet register](#reading-a-parquet-register) below), it
106+
only means you may see more Parquet files in the output than input SAS
107+
files.
108+
:::
109+
101110
Even though this only converts a single file, the output is partitioned
102111
by the year extracted from the file name as seen below:
103112

@@ -128,62 +137,26 @@ convert_register(
128137
)
129138
```
130139

131-
As with `convert_file()`, the output is partitioned by year, extracted
132-
from file names.
140+
`convert_register()` uses `convert_file()` internally so the same
141+
chunking and partitioning behaviour applies. See `?convert_file` and
142+
`?convert_register` for more details. As a result, the output from
143+
`convert_register()` is also partitioned by year, extracted from file
144+
names:
133145

134146
```{r output-tree}
135147
#| echo: false
136148
show_tree(output_register_dir)
137149
```
138150

139-
`convert_register()` uses `convert_file()` internally to reads files in
140-
chunks (to be able to handle larger-than-memory data), extracts 4-digit
141-
years from filenames for partitioning, and lowercases column names. See
142-
`?convert_file` and `?convert_register` for more details.
143-
144-
There are four different ways that the SAS files can be converted into
145-
Parquet. Using the `bef` register as an example, we have a set of input
146-
SAS files in the `Grunddata/` folder and a set of converted
147-
Hive-partitioned Parquet files in the output folder
148-
`parquet-registers/bef/` that are both listed below:
149-
150-
``` text
151-
# Original SAS files (input)
152-
Grunddata/
153-
├── bef2020.sas7bdat # 1
154-
├── bef2021.sas7bdat # 2
155-
├── December_2023/
156-
│ └── bef2021.sas7bdat # 2
157-
├── bef2022.sas7bdat # 3
158-
└── bef.sas7bdat # 4
159-
160-
# Converted Parquet files (output)
161-
parquet-registers/bef/
162-
├── year=2020/ # 1
163-
│ └── part-c28221.parquet # 1
164-
├── year=2021/ # 2
165-
│ ├── part-bf73dc.parquet # 2
166-
│ └── part-546bed.parquet # 2
167-
├── year=2022/ # 3
168-
│ ├── part-7c041e.parquet # 3
169-
│ └── part-8869b7.parquet # 3
170-
└── year=__HIVE_DEFAULT_PARTITION__/ # 4
171-
└── part-a8d52c.parquet # 4
172-
```
173-
174-
1. A single SAS file is converted to a single Parquet file, partitioned
175-
by year from filename.
176-
2. Multiple SAS files with the same register and year are converted into
177-
separate Parquet files in the same partition (shown below). Rows
178-
between these several SAS files are not deduplicated, so you'll have
179-
to check for duplicates after conversion.
180-
3. A large SAS file is split into multiple Parquet files that are only
181-
as many rows as is determined by the `chunk_size` option.
182-
4. A SAS file without a year in file name is placed in the
183-
`year=__HIVE_DEFAULT_PARTITION__/` folder (the
184-
[Apache Hive](https://hive.apache.org/docs/latest/user/configuration-properties/)
185-
default for missing partitions) when it is converted to Parquet.
151+
The output is organised into a "bef" folder (register name extracted from file names) with year-based subdirectories:
186152

153+
- The data from the two SAS files with "1999" in their file names are
154+
located in the subfolder "year=1999"
155+
- The data from the SAS file from 2020 are located in the subfolder
156+
"year=2020"
157+
- One SAS file didn't have a year in its file name, `bef.sas7bdat`. The
158+
data from this file is placed in the "year=**HIVE_DEFAULT_PARTITION**"
159+
folder, the default for files without a year in their name.
187160

188161
## Converting multiple registers in parallel
189162

0 commit comments

Comments
 (0)