Skip to content

Commit 70a8992

Browse files
committed
updated
1 parent 8fb7ed6 commit 70a8992

File tree

1 file changed

+32
-37
lines changed

1 file changed

+32
-37
lines changed

vignettes/datatable-fread-and-fwrite.Rmd

Lines changed: 32 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
11
---
22
title: "Fast Read and Fast Write"
33
date: "`r Sys.Date()`"
4-
output: rmarkdown::html_vignette # <--- Changed
4+
output:
5+
markdown::html_format
56
vignette: >
6-
%\VignetteIndexEntry{Fast Read and Fast Write}
7-
%\VignetteEngine{knitr::rmarkdown} # <--- Changed
8-
%\VignetteEncoding{UTF-8} # <--- Ensure this is present
7+
%\VignetteIndexEntry{Importing data.table}
8+
%\VignetteEngine{knitr::knitr}
9+
\usepackage[utf8]{inputenc}
910
---
1011

1112
```{r echo=FALSE, file='_translation_links.R'}
@@ -27,7 +28,7 @@ The `fread()` and `fwrite()` functions in the `data.table` R package are not onl
2728

2829
***
2930

30-
## 1. **fread()**
31+
## 1. fread()
3132

3233
### **1.1 Using command line tools directly**
3334
The `fread()` function from `data.table` can read data piped from shell commands, letting you filter or preprocess data before it even enters R.
@@ -45,21 +46,16 @@ HEADER: Yet more
4546
"example_data.txt")
4647
4748
library(data.table)
48-
49-
all_lines = readLines("example_data.txt")
50-
data_lines = grep("HEADER", all_lines, value = TRUE, invert = TRUE)
51-
fread(text = data_lines)
52-
53-
file.remove("example_data.txt")
49+
fread("grep -v HEADER example_data.txt")
5450
```
5551

5652
The `-v` option makes `grep` return all lines except those containing the string 'HEADER'. Given the number of high quality engineers that have looked at the command tool grep over the years, it is most likely that it is as fast as you can get, as well as being correct, convenient, well documented online, easy to learn and search for solutions for specific tasks. If you need to perform more complex string filtering (e.g., matching strings at the beginning or end of lines), the `grep` syntax is very powerful. Learning its syntax is a transferable skill for other languages and environments.
57-
53+
5854
Look at this [example](https://stackoverflow.com/questions/36256706/fread-together-with-grepl/36270543#36270543) for more detail.
5955

60-
On Windows we recommend [Cygwin](https://www.cygwin.com/) (run one .exe to install) which includes the command line tools such as grep. In March 2016, Microsoft [announced](https://www.hanselman.com/blog/developers-can-run-bash-shell-and-usermode-ubuntu-linux-binaries-on-windows-10) they will include these tools in Windows 10 natively. On Linux and macOS, these tools have always been included in the operating system. You can find many examples and tutorials about command line tools online.
56+
On Windows we recommend [Cygwin](https://www.cygwin.com/) (run one .exe to install) which includes the command line tools such as grep. In March 2016, Microsoft [announced](https://www.hanselman.com/blog/developers-can-run-bash-shell-and-usermode-ubuntu-linux-binaries-on-windows-10) they will include these tools in Windows 10 natively. On Linux and macOS, these tools have always been included in the operating system. You can find many examples and tutorials about command line tools online. We recommend [Data Science at the Command Line](https://www.oreilly.com/library/view/data-science-at/9781491947845/).
6157

62-
#### 1.1.1 **Reading directly from a text string**
58+
#### 1.1.1 Reading directly from a text string
6359

6460
`fread()` can read data directly from a character string in R using the text argument. This is particularly handy for creating reproducible examples, testing code snippets, or working with data generated programmatically within your R session. Each line in the string should be separated by a newline character `(
6561
)`.
@@ -70,7 +66,7 @@ dt_from_text = fread(text = my_data_string)
7066
print(dt_from_text)
7167
```
7268

73-
#### 1.1.2 **Reading from URLs**
69+
#### 1.1.2 Reading from URLs
7470

7571
`fread()` can read data directly from web URLs by passing the URL as a character string to its `file` argument. This allows you to download and read data from the internet in one step.
7672

@@ -79,7 +75,7 @@ print(dt_from_text)
7975
# print(dt)
8076
```
8177

82-
#### 1.1.3 **Automatic decompression of compressed files**
78+
#### 1.1.3 Automatic decompression of compressed files
8379

8480
In many cases, `fread()` can automatically detect and decompress files with common compression extensions directly, without needing an explicit connection object or shell commands. This works by checking the file extension.
8581

@@ -88,7 +84,7 @@ In many cases, `fread()` can automatically detect and decompress files with comm
8884
- `.xz` (xz): Supported and works out of the box.
8985
- `.zip` (ZIP archives, single file): Supported—`fread()` will read the first file in the archive if only one file is present.
9086

91-
### 1.2 **Automatic separator and skip detection**
87+
### 1.2 Automatic separator and skip detection
9288

9389
`fread` automates delimiter and header detection, eliminating the need for manual specification in most cases. You simply provide the filename—`fread` intelligently detects the structure:
9490

@@ -110,11 +106,11 @@ You can explicitly tell fread whether a header exists using `header = TRUE` or `
110106

111107
By default (`skip="auto"`), `fread` will automatically skip blank lines and comment lines (e.g., starting with `#`) before the data header. To manually skip a specific number of lines, use `skip=n`.
112108

113-
### 1.3 **High-Quality Automatic Column Type Detection**
109+
### 1.3 High-Quality Automatic Column Type Detection
114110

115111
Many real-world datasets contain columns that are initially blank, zero-filled, or appear numeric but later contain characters. To handle such inconsistencies, `fread()` in `data.table` employs a robust column type detection strategy.
116112

117-
Since v1.10.5, `fread()` samples 10,000 rows by reading 100 contiguous rows from 100 equally spaced points across the file, including the start, middle, and end. This wide sampling helps detect type changes that occur later in the data (e.g., `001` to `0A0` or blanks becoming populated).
113+
Since v1.10.5, `fread()` samples rows by reading blocks of contiguous rows from multiple equally spaced points across the file, including the start, middle, and end. The total number of rows sampled is chosen dynamically based on the file size and structure, and is typically around 10,000, but can be smaller or slightly larger. This wide sampling helps detect type changes that occur later in the data (e.g., `001` to `0A0` or blanks becoming populated).
118114

119115
**Efficient File Access with mmap**
120116

@@ -140,19 +136,18 @@ If a type change occurs outside the sampled rows, `fread()` automatically detect
140136

141137
All detection logic and any rereads are detailed when `verbose=TRUE` is enabled.
142138

143-
### 1.4 **Early Error Detection at End-of-File**
139+
### 1.4 Early Error Detection at End-of-File
144140

145141
Because the large sample explicitly includes the very end of the file, critical issues—such as an inconsistent number of columns, a malformed footer, or an opening quote without a matching closing quote—can be detected and reported almost instantly. This early error detection avoids the unnecessary overhead of processing the entire file or allocating excessive memory, only to encounter a failure at the final step. It ensures faster feedback and more efficient resource usage, especially when working with large datasets.
146142

147-
### 1.5 **Reading SQL Insert Scripts**
143+
### 1.5 Reading SQL Insert Scripts
148144

149145
`fread()` doesn't directly support SQL `INSERT` scripts, but they can be processed via command-line tools. For example, given `insert_script.sql`:
150146

151-
```{r, eval=FALSE}
152-
# Example SQL insert statements:
153-
# INSERT INTO tbl VALUES (1, 'asd', 923123123, 'zx');
154-
# INSERT INTO tbl VALUES (1, NULL, 923123123, 'zxz');
155-
# INSERT INTO tbl VALUES (3, 'asd3', 923123123, NULL);
147+
```
148+
INSERT INTO tbl VALUES (1, 'asd', 923123123, 'zx');
149+
INSERT INTO tbl VALUES (1, NULL, 923123123, 'zxz');
150+
INSERT INTO tbl VALUES (3, 'asd3', 923123123, NULL);
156151
```
157152

158153
Use this command in R [link](https://stackoverflow.com/questions/32026398/transform-sql-insert-script-into-csv-format):
@@ -177,15 +172,15 @@ print(dt_sql)
177172
file.remove("insert_script.sql")
178173
```
179174

180-
- The `awk` command transforms each INSERT line into a comma-separated list of its values.
175+
- The `gsub()` function in R transforms each INSERT line into a comma-separated list of its values.
181176

182177
- `na.strings = "NULL"` in fread is crucial: it tells fread to interpret the literal string `"NULL"` (output by awk for SQL NULLs) as R's NA value.
183178

184-
- Quoted strings (e.g., 'asd') are preserved by awk and read as character by `fread`.
179+
- Quoted strings (e.g., 'asd') are preserved and read as character by `fread`.
185180

186-
### 1.6 **`integer64` Support**
181+
### 1.6 `integer64` Support
187182

188-
By default, `fread` detects integers larger than 2³¹ and reads them as `bit64::integer64` to preserve full precision. This behavior can be overridden in three ways:
183+
By default, `fread` detects integers larger than 2<sup>31</sup> and reads them as `bit64::integer64` to preserve full precision. This behavior can be overridden in three ways:
189184

190185
- Per-column: Use the `colClasses` argument to specify the type for individual columns.
191186

@@ -209,7 +204,7 @@ options(datatable.integer64 = "double") # Example: set globally to "double"
209204
getOption("datatable.integer64")
210205
```
211206

212-
### 1.7 **Drop or Select Columns by Name or Position**
207+
### 1.7 Drop or Select Columns by Name or Position
213208

214209
To save memory and improve performance, use `fread()`'s `select` or `drop` arguments to read only the columns you need.
215210

@@ -224,11 +219,11 @@ Key points:
224219

225220
For details, see the manual page by running `?fread` in R.
226221

227-
### 1.8 **Skip to a Sub-Table’s Header Row Using a Column Name Substring**
222+
### 1.8 Skip to a Sub-Table’s Header Row Using a Column Name Substring
228223

229224
Use `skip="string"` in `fread` to search for a line containing a substring (typically from the column names, e.g., `skip="Date"`). Reading begins at the first matching line. This is useful for skipping metadata or selecting sub-tables in multi-table files. This feature is inspired by the `read.xls` function in the gdata package.
230225

231-
### 1.9 **Automatic Quote Escape Detection (Including No-Escape)**
226+
### 1.9 Automatic Quote Escape Detection (Including No-Escape)
232227

233228
`fread` automatically detects how quotes are escaped—including doubled ("") or backslash-escaped (\") quotes—without requiring user input. This is determined using a large sample of the data (see point 3), and validated against the entire file.
234229

@@ -251,7 +246,7 @@ From v1.10.6, `fread` resolves ambiguities more reliably across the entire file
251246

252247
`fwrite()` is the fast file writer companion to `fread()`. It’s designed for speed, sensible defaults, and ease of use, mirroring many of the conveniences found in fread`.
253248

254-
### 2.1 **Intelligent and Minimalist Quoting (quote="auto")**
249+
### 2.1 Intelligent and Minimalist Quoting (quote="auto")
255250

256251
When data is written as strings (either inherently, like character columns, or by choice, like `dateTimeAs="ISO"`), `quote="auto"` (default) intelligently quotes fields:
257252

@@ -272,7 +267,7 @@ fwrite(dt_quoting_scenario, temp_quote_adv)
272267
cat(readLines(temp_quote_adv), sep = "\n")
273268
```
274269

275-
### 2.2 **Fine-Grained Date/Time Serialization (dateTimeAs)**
270+
### 2.2 Fine-Grained Date/Time Serialization (dateTimeAs)
276271

277272
Offers precise control for POSIXct/Date types:
278273

@@ -291,7 +286,7 @@ cat(readLines(temp_dt_iso), sep = "\n")
291286
unlink(temp_dt_iso)
292287
```
293288

294-
### 2.3 **Handling of bit64::integer64**
289+
### 2.3 Handling of bit64::integer64
295290

296291
**Full Precision for Large Integers**: `fwrite` writes `bit64::integer64` columns by converting them to strings with full precision. This prevents data loss or silent conversion to double that might occur with less specialized writers. This is crucial for IDs or measurements requiring more than R's standard `32-bit` integer range or `53-bit` double precision.
297292

@@ -308,7 +303,7 @@ if (requireNamespace("bit64", quietly = TRUE)) {
308303
}
309304
```
310305

311-
### 2.4 **Column Order and Subset Control**
306+
### 2.4 Column Order and Subset Control
312307

313308
To control the order and subset of columns written to file, subset the data.table before calling `fwrite()`. The `col.names` argument in `fwrite()` is a logical (TRUE/FALSE) that controls whether the header row is written, not which columns are written.
314309

0 commit comments

Comments
 (0)