improve efficiency of read.csv and write.csv #7

tdhock · 2023-07-01T01:51:58Z

tdhock
Jul 1, 2023

Hi! I will not be attending the sprint, but I had a couple of ideas related to improving efficiency of read.csv and write.csv.

Probably the more important issue to address would be read.csv, which had time complexity quadratic in number of columns, see this issue for some empirical analysis:
tdhock/atime#8

Another issue was that write.csv uses linear memory, whereas other CSV writers use only constant memory (this is not that big of an issue though, because anyways you need linear memory to store the data in R before writing to CSV)
tdhock/atime#10

@gmbecker @bastistician may be able to help mentor? They worked on fixing a similar efficiency issue tdhock/atime#9

hturner · 2023-07-01T07:45:17Z

hturner
Jul 1, 2023
Maintainer

I guess this may not be high priority as there are many options in contributed packages. However, the vroom package uses ALTREP to do a quick initial read and I wonder if this might be something that makes sense to bring into base R. This might widen the pool of potentially interested mentors to include e.g. @ltierney and @kalibera.

It’s difficult to include in benchmarks as the full read is delayed, but maybe someone could do an initial investigation to see how vroom scales with the number of columns.

1 reply

tdhock Jul 1, 2023
Author

Thanks for the feedback Heather. I did that comparison https://tdhock.github.io/blog/2023/compare-read-write/ vroom is used for the readr LAZY=TRUE methods in my figures. There is some constant factor speed up for vroom if you don't actually do any computations but actually, there is no difference in timing if you do some computations using all the data. overall, I would think that Implementing vroom-like functionality in base r would be complicated for a small, constant factor speed up. whereas the quadratic time fix i proposed may actually be much easier (not sure) but definitely would result in much Bigger speed ups for a large number of columns.

mmaechler · 2023-07-08T08:35:37Z

mmaechler
Jul 8, 2023
Collaborator

I'm almost sure that we (R-core) got suggestions previously... about improving read.table() speed for such cases {read.csv() is a very thin wrapper around read.table()}. Maybe on the R-devel mailing list even?
I vaguely recall that a preliminary conclusion at the time had been that it was hard to fix this without of a big rewrite and such a rewrite would almost necessarily entail loss of back compatibility {which is not easy to maintain with the many optional arguments of read.table() et al.}.

Still I think everyone would agree that it should be desirable to get to a linear O(p) instead of quadratic O(p^2) time complexity for read.*()

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

improve efficiency of read.csv and write.csv #7

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

improve efficiency of read.csv and write.csv #7

Uh oh!

tdhock Jul 1, 2023

Replies: 2 comments · 1 reply

Uh oh!

hturner Jul 1, 2023 Maintainer

Uh oh!

tdhock Jul 1, 2023 Author

Uh oh!

mmaechler Jul 8, 2023 Collaborator

tdhock
Jul 1, 2023

Replies: 2 comments 1 reply

hturner
Jul 1, 2023
Maintainer

tdhock Jul 1, 2023
Author

mmaechler
Jul 8, 2023
Collaborator