Skip to content

Commit c11dc33

Browse files
authored
Vignette fread fwrite FR translated (#7253)
1 parent 55bbc3d commit c11dc33

File tree

2 files changed

+338
-21
lines changed

2 files changed

+338
-21
lines changed

vignettes/datatable-fread-and-fwrite.Rmd

Lines changed: 32 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ date: "`r Sys.Date()`"
44
output:
55
markdown::html_format
66
vignette: >
7-
%\VignetteIndexEntry{Importing data.table}
7+
%\VignetteIndexEntry{Fast Read and Fast Write}
88
%\VignetteEngine{knitr::knitr}
99
\usepackage[utf8]{inputenc}
1010
---
@@ -61,7 +61,7 @@ On Windows, command line tools like `grep` are available through various environ
6161

6262
#### 1.1.1 Reading directly from a text string
6363

64-
`fread()` can read data directly from a character string in R using the text argument. This is particularly handy for creating reproducible examples, testing code snippets, or working with data generated programmatically within your R session. Each line in the string should be separated by a newline character `\n`.
64+
`fread()` can read data directly from a character string in R using the `text` argument. This is particularly handy for creating reproducible examples, testing code snippets, or working with data generated programmatically within your R session. Each line in the string should be separated by a newline character `\n`.
6565

6666
```{r}
6767
my_data_string = "colA,colB,colC\n1,apple,TRUE\n2,banana,FALSE\n3,orange,TRUE"
@@ -71,7 +71,8 @@ print(dt_from_text)
7171

7272
#### 1.1.2 Reading from URLs
7373

74-
`fread()` can read data directly from web URLs by passing the URL as a character string to its `file` argument. This allows you to download and read data from the internet in one step.
74+
`fread()` can read data directly from web URLs by passing the URL as a character string to its `file` argument.
75+
This allows you to download and read data from the internet in one step.
7576

7677
```{r}
7778
# dt = fread("https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv")
@@ -86,7 +87,7 @@ In many cases, `fread()` can automatically detect and decompress files with comm
8687
- `.gz` / `.bz2` (gzip / bzip2): Supported and works out of the box.
8788
- `.zip` / `.tar` (ZIP / tar archives, single file): Supported—`fread()` will read the first file in the archive if only one file is present.
8889

89-
> Note: If there are multiple files in the archive, `fread()` will fail with an error.
90+
**Note**: If there are multiple files in the archive, `fread()` will fail with an error.
9091

9192
### 1.2 Automatic separator and skip detection
9293

@@ -108,11 +109,15 @@ You can explicitly tell fread whether a header exists using `header = TRUE` or `
108109

109110
**Skip Detection**
110111

111-
By default (`skip="auto"`), `fread` will automatically skip blank lines and comment lines (e.g., starting with `#`) before the data header. To manually skip a specific number of lines, use `skip=n`.
112+
By default (`skip="auto"`), `fread` will automatically skip blank lines and comment lines (e.g., starting with `#`) before the data header.
113+
To manually specify a different number of lines to skip, use
114+
115+
* `skip=n` to skip the first `n` lines.
116+
* `skip="string"` to search for a line containing a substring (typically from the column names, like `skip="Date"`). Reading begins at the first matching line. This is useful for skipping metadata, or selecting sub-tables in multi-table files. This feature is inspired by the `read.xls` function in the `gdata` package.
112117

113118
### 1.3 High-Quality Automatic Column Type Detection
114119

115-
Many real-world datasets contain columns that are initially blank, zero-filled, or appear numeric but later contain characters. To handle such inconsistencies, `fread()` in `data.table` employs a robust column type detection strategy.
120+
Many real-world datasets contain columns that are initially blank, zero-filled, or appear numeric but later contain characters. To handle such inconsistencies, `fread()` employs a robust column type detection strategy.
116121

117122
Since v1.10.5, `fread()` samples rows by reading blocks of contiguous rows from multiple equally spaced points across the file, including the start, middle, and end. The total number of rows sampled is chosen dynamically based on the file size and structure, and is typically around 10,000, but can be smaller or slightly larger. This wide sampling helps detect type changes that occur later in the data (e.g., `001` to `0A0` or blanks becoming populated).
118123

@@ -131,7 +136,7 @@ The type for each column is inferred based on the lowest required type from the
131136
This ensures:
132137

133138
- Single up-front allocation of memory using the correct type
134-
- Avoidance of rereading the file or manually setting colClasses
139+
- Avoidance of rereading the file or manually setting `colClasses`
135140
- Improved speed and memory efficiency
136141

137142
**Out-of-Sample Type Exceptions**
@@ -142,7 +147,9 @@ All detection logic and any rereads are detailed when `verbose=TRUE` is enabled.
142147

143148
### 1.4 Early Error Detection at End-of-File
144149

145-
Because the large sample explicitly includes the very end of the file, critical issues—such as an inconsistent number of columns, a malformed footer, or an opening quote without a matching closing quote—can be detected and reported almost instantly. This early error detection avoids the unnecessary overhead of processing the entire file or allocating excessive memory, only to encounter a failure at the final step. It ensures faster feedback and more efficient resource usage, especially when working with large datasets.
150+
Because the large sample explicitly includes the very end of the file, critical issues—such as an inconsistent number of columns, a malformed footer, or an opening quote without a matching closing quote—can be detected and reported almost instantly.
151+
This early error detection avoids the unnecessary overhead of processing the entire file or allocating excessive memory, only to encounter a failure at the final step.
152+
It ensures faster feedback and more efficient resource usage, especially when working with large datasets.
146153

147154
### 1.5 `integer64` Support
148155

@@ -185,21 +192,25 @@ Key points:
185192

186193
For details, see the manual page by running `?fread` in R.
187194

188-
### 1.7 Skip to a Sub-Table’s Header Row Using a Column Name Substring
189-
190-
Use `skip="string"` in `fread` to search for a line containing a substring (typically from the column names, e.g., `skip="Date"`). Reading begins at the first matching line. This is useful for skipping metadata or selecting sub-tables in multi-table files. This feature is inspired by the `read.xls` function in the gdata package.
191-
192-
### 1.8 Automatic Quote Escape Detection (Including No-Escape)
195+
### 1.7 Automatic Quote Escape Detection (Including No-Escape)
193196

194197
`fread` automatically detects how quotes are escaped—including doubled ("") or backslash-escaped (\") quotes—without requiring user input. This is determined using a large sample of the data (see point 3), and validated against the entire file.
195198

196199
Supported Scenarios:
197200
- Unescaped quotes inside quoted fields
198-
e.g., `"This "quote" is invalid"` — supported as long as column count remains consistent.
201+
e.g., `"This "quote" is invalid, but fread works anyway"` — supported as long as column count remains consistent :
202+
203+
```{r}
204+
data.table::fread(text='x,y\n"This "quote" is invalid, but fread works anyway",1')
205+
```
199206

200207
- Unquoted fields that begin with quotes
201208
e.g., `Invalid"Field,10,20` — recognized correctly as not a quoted field.
202209

210+
```{r}
211+
data.table::fread(text='x,y\nNot"Valid,1')
212+
```
213+
203214
Requirements & Limitations:
204215
- Escaping rules and column counts must be consistent throughout the file.
205216

@@ -210,7 +221,8 @@ From v1.10.6, `fread` resolves ambiguities more reliably across the entire file
210221

211222
## 2. fwrite()
212223

213-
`fwrite()` is the fast file writer companion to `fread()`. It’s designed for speed, sensible defaults, and ease of use, mirroring many of the conveniences found in fread`.
224+
`fwrite()` is the fast file writer companion to `fread()`.
225+
It’s designed for speed, sensible defaults, and ease of use, mirroring many of the conveniences found in `fread`.
214226

215227
### 2.1 Intelligent and Minimalist Quoting (quote="auto")
216228

@@ -232,7 +244,7 @@ fwrite(dt_quoting_scenario, temp_quote_adv)
232244
cat(readLines(temp_quote_adv), sep = "\n")
233245
```
234246

235-
### 2.2 Fine-Grained Date/Time Serialization (dateTimeAs)
247+
### 2.2 Fine-Grained Date/Time Serialization (`dateTimeAs` argument)
236248

237249
Offers precise control for POSIXct/Date types:
238250

@@ -251,7 +263,7 @@ cat(readLines(temp_dt_iso), sep = "\n")
251263
unlink(temp_dt_iso)
252264
```
253265

254-
### 2.3 Handling of bit64::integer64
266+
### 2.3 Handling of `bit64::integer64`
255267

256268
**Full Precision for Large Integers**: `fwrite` writes `bit64::integer64` columns by converting them to strings with full precision. This prevents data loss or silent conversion to double that might occur with less specialized writers. This is crucial for IDs or measurements requiring more than R's standard `32-bit` integer range or `53-bit` double precision.
257269

@@ -263,14 +275,13 @@ if (requireNamespace("bit64", quietly = TRUE)) {
263275
temp_i64_out = tempfile(fileext = ".csv")
264276
fwrite(dt_i64, temp_i64_out)
265277
cat(readLines(temp_i64_out), sep = "\n")
266-
267278
unlink(temp_i64_out)
268279
}
269280
```
270281

271282
### 2.4 Column Order and Subset Control
272283

273-
To control the order and subset of columns written to file, subset the data.table before calling `fwrite()`. The `col.names` argument in `fwrite()` is a logical (TRUE/FALSE) that controls whether the header row is written, not which columns are written.
284+
To control the order and subset of columns written to file, subset the `data.table` before calling `fwrite()`. The `col.names` argument in `fwrite()` is a logical (TRUE/FALSE) that controls whether the header row is written, not which columns are written.
274285

275286
```{r}
276287
dt = data.table(A = 1:3, B = 4:6, C = 7:9)
@@ -283,7 +294,7 @@ file.remove("out.csv")
283294

284295
## 3. A Note on Performance
285296

286-
While this vignette focuses on features and usability, the primary motivation for `fread` and `fwrite` is speed. The performance of `data.table`'s I/O is a topic of continuous benchmarking.
297+
While this vignette focuses on features and usability, the primary motivation for `fread` and `fwrite` is speed.
287298

288299
For users interested in detailed, up-to-date performance comparisons, we recommend these external blog posts which use the `atime` package for rigorous analysis:
289300

@@ -292,4 +303,4 @@ For users interested in detailed, up-to-date performance comparisons, we recomme
292303

293304
These benchmarks consistently show that `fread` and `fwrite` are highly competitive and often state-of-the-art for performance in the R ecosystem.
294305

295-
***
306+
***

0 commit comments

Comments
 (0)