1+
12# chunked
23
34[ ![ version] ( https://cran.r-project.org/package=chunked )] ( https://cran.r-project.org/package=chunked )
4- [ ![ Downloads] ( https://cranlogs.r-pkg.org/badges/chunked )] ( https://cran.r-project.org/package=chunked )
5- [ ![ Travis-CI Build Status] ( https://travis-ci.org/edwindj/chunked.svg?branch=master )] ( https://travis-ci.org/edwindj/chunked )
6- [ ![ AppVeyor Build Status] ( https://ci.appveyor.com/api/projects/status/github/edwindj/chunked?branch=master )] ( https://ci.appveyor.com/project/edwindj/chunked )
7- [ ![ Coverage Status] ( https://coveralls.io/repos/edwindj/chunked/badge.svg?branch=master&service=github )] ( https://coveralls.io/github/edwindj/chunked?branch=master )
8- R is a great tool, but processing data in large text files is cumbersome.
9- ` chunked ` helps you to process large text files with _ dplyr_ while loading only a part of the data in memory.
10- It builds on the excellent R package [ _ LaF_ ] ( https://github.com/djvanderlaan/LaF ) .
11-
12- Processing commands are written in dplyr syntax, and ` chunked ` (using ` LaF ` ) will take care that chunk by chunk is
13- processed, taking far less memory than otherwise. ` chunked ` is useful for __ select__ -ing columns, __ mutate__ -ing columns
14- and __ filter__ -ing rows. It is less helpful in __ group__ -ing and __ summarize__ -ation of large text files. It can be used in
15- data pre-processing.
5+ [ ![ Downloads] ( https://cranlogs.r-pkg.org/badges/chunked )] ( https://cran.r-project.org/package=chunked )
6+ [ ![ R-CMD-check] ( https://github.com/edwindj/chunked/workflows/R-CMD-check/badge.svg )] ( https://github.com/edwindj/chunked/actions )
7+ [ ![ Coverage
8+ Status] ( https://coveralls.io/repos/edwindj/chunked/badge.svg?branch=master&service=github )] ( https://coveralls.io/github/edwindj/chunked?branch=master )
9+ R is a great tool, but processing data in large text files is
10+ cumbersome. ` chunked ` helps you to process large text files with * dplyr*
11+ while loading only a part of the data in memory. It builds on the
12+ excellent R package [ * LaF* ] ( https://github.com/djvanderlaan/LaF ) .
13+
14+ Processing commands are written in dplyr syntax, and ` chunked ` (using
15+ ` LaF ` ) will take care that chunk by chunk is processed, taking far less
16+ memory than otherwise. ` chunked ` is useful for ** select** -ing columns,
17+ ** mutate** -ing columns and ** filter** -ing rows. It is less helpful in
18+ ** group** -ing and ** summarize** -ation of large text files. It can be
19+ used in data pre-processing.
1620
1721## Install
1822
19- ' chunked' can be installed with
23+ ‘ chunked’ can be installed with
2024
21- ``` r
25+ ``` r
2226install.packages(' chunked' )
2327```
2428
25- beta version with:
26- ``` r
29+ beta version with:
30+
31+ ``` r
2732install.packages(' chunked' , repos = c(' https://cran.rstudio.com' , ' https://edwindj.github.io/drat' ))
2833```
2934
3035and the development version with:
3136
32- ``` r
37+ ``` r
3338devtools :: install_github(' edwindj/chunked' )
3439```
3540
36-
37- Enjoy! Feedback is welcome...
41+ Enjoy! Feedback is welcome…
3842
3943# Usage
4044
4145## Text file -> process -> text file
4246
43- Most common case is processing a large text file, select or add columns, filter it and
44- write the result back to a text file
45- ``` r
47+ Most common case is processing a large text file, select or add columns,
48+ filter it and write the result back to a text file
49+
50+ ``` r
4651 read_chunkwise(" ./large_file_in.csv" , chunk_size = 5000 ) %> %
4752 select(col1 , col2 , col5 ) %> %
4853 filter(col1 > 10 ) %> %
4954 mutate(col6 = col1 + col2 ) %> %
5055 write_chunkwise(" ./large_file_out.csv" )
5156```
5257
53- ` chunked ` will write process the above statement in chunks of 5000 records. This is different from for example ` read.csv ` which reads all data into memory before processing it.
58+ ` chunked ` will write process the above statement in chunks of 5000
59+ records. This is different from for example ` read.csv ` which reads all
60+ data into memory before processing it.
5461
5562## Text file -> process -> database
5663
57- Another option is to use ` chunked ` as a preprocessing step before adding it to a database
58- ``` r
64+ Another option is to use ` chunked ` as a preprocessing step before adding
65+ it to a database
66+
67+ ``` r
5968con <- DBI :: dbConnect(RSQLite :: SQLite(), ' test.db' , create = TRUE )
6069db <- dbplyr :: src_dbi(con )
6170
@@ -69,14 +78,18 @@ tbl <-
6978# tbl now points to the table in sqlite.
7079```
7180
72- ## Db -> process -> Text file
73- Chunked can be used to export chunkwise to a text file. Note however that in that case processing
74- takes place in the database and the chunkwise restrictions only apply to the writing.
81+ ## Db -> process -> Text file
82+
83+ Chunked can be used to export chunkwise to a text file. Note however
84+ that in that case processing takes place in the database and the
85+ chunkwise restrictions only apply to the writing.
7586
7687## Lazy processing
7788
78- ` chunked ` will not start processing until ` collect ` or ` write_chunkwise ` is called.
79- ``` r
89+ ` chunked ` will not start processing until ` collect ` or ` write_chunkwise `
90+ is called.
91+
92+ ``` r
8093data_chunks <-
8194 read_chunkwise(" ./large_file_in.csv" , chunk_size = 5000 ) %> %
8295 select(col1 , col3 )
@@ -88,40 +101,42 @@ write_chunkwise(data_chunks, "test.csv")
88101# or
89102write_chunkwise(data_chunks , db , " test" )
90103```
91- Syntax completion of variables of a chunkwise file in RStudio works like a charm...
104+
105+ Syntax completion of variables of a chunkwise file in RStudio works like
106+ a charm…
92107
93108# Dplyr verbs
94109
95110` chunked ` implements the following dplyr verbs:
96111
97- - ` filter `
98- - ` select `
99- - ` rename `
100- - ` mutate `
101- - ` mutate_each `
102- - ` transmute `
103- - ` do `
104- - ` tbl_vars `
105- - ` inner_join `
106- - ` left_join `
107- - ` semi_join `
108- - ` anti_join `
109-
112+ - ` filter `
113+ - ` select `
114+ - ` rename `
115+ - ` mutate `
116+ - ` mutate_each `
117+ - ` transmute `
118+ - ` do `
119+ - ` tbl_vars `
120+ - ` inner_join `
121+ - ` left_join `
122+ - ` semi_join `
123+ - ` anti_join `
110124
111125Since data is processed in chunks, some dplyr verbs are not implemented:
112126
113- - ` arrange `
114- - ` right_join `
115- - ` full_join `
127+ - ` arrange `
128+ - ` right_join `
129+ - ` full_join `
116130
117- ` summarize ` and ` group_by ` are implemented but generate a warning: they operate on each chunk and
118- __ not__ on the whole data set. However this makes is more easy to process a large file, by repeatedly
119- aggregating the resulting data.
131+ ` summarize ` and ` group_by ` are implemented but generate a warning: they
132+ operate on each chunk and ** not** on the whole data set. However this
133+ makes is more easy to process a large file, by repeatedly aggregating
134+ the resulting data.
120135
121- - ` summarize `
122- - ` group_by `
136+ - ` summarize `
137+ - ` group_by `
123138
124- ``` R
139+ ``` r
125140tmp <- tempfile()
126141write.csv(iris , tmp , row.names = FALSE , quote = FALSE )
127142iris_cw <- read_chunkwise(tmp , chunk_size = 30 ) # read in chunks of 30 rows for this example
0 commit comments