Skip to content

Commit f31c74f

Browse files
author
Dan Brown
committed
update doc with processing speeds
1 parent 90e7f63 commit f31c74f

File tree

1 file changed

+12
-39
lines changed

1 file changed

+12
-39
lines changed

README.md

Lines changed: 12 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -146,47 +146,20 @@ data will be misinterpreted.
146146

147147
Another issue is that the **character encoding** of CSV files is not specified.
148148

149-
Digression on programming languages
150-
===================================
151-
152-
I'm a novice programmer dabbling in various programming languages. When
153-
starting to learn a new language, I try to port this program into it. This
154-
gives me a real-world exercise and a basis for comparison. Here are a few
155-
remarks based on this limited experience.
156-
157-
* **python** code is quick to write, but slow to run. The python version relies
158-
on the built-in CSV library, but the other versions of this program do not use
159-
any special libraries.
160-
161-
* **go** has some convenient improvements relative to C. In cases where runtime
162-
speed is important, and either concurrency or modern libraries would be
163-
helpful, this is a good choice. For this program, neither of these applied.
164-
When thinking about dependencies, go has the benefit of static linking into an
165-
independent binary; but it's not a widely used language so may not be suitable
166-
for broad distribution.
167-
168-
* **awk** is a good fit for reading and writing text. Relatively fast to write
169-
and run. It's available on all UNIX systems, but can be slightly different.
170-
171-
* **C** is fastest to run, and requires some machine-level understanding of
172-
memory management so takes a bit more care to write the code. It's easy to do
173-
dangerous things in this language, so a bit of guidance can be very helpful.
174-
I believe C is more likely to be useful to other people because most systems
175-
will be able to compile it without installing additional requirements.
149+
Run-time Speed comparison
150+
=========================
176151

177-
Based on these reasons, this project will use the C version in the main branch
178-
and put the other versions in a different branch for reference.
152+
How fast is csvquote? Here are some data processing rates measured when running csvquote on an Intel i7 CPU model from 2013. Due to recently introduced optimizations in csvquote, the processing speed depends on how common the quote characters are in the source data.
179153

180-
Run-time Speed comparison
181-
-------------------------
154+
* 1.9 GB/sec : csvquote reading random csv data with 10% of fields quoted
155+
* 0.5 GB/sec : csvquote reading random csv data with 100% of fields quoted
156+
* 3.7 GB/sec : csvquote reading random csv data with no quoted fields (nothing for csvquote to do!)
182157

183-
Time spent processing a 100 MB CSV file on my laptop.
158+
A common use of csvquote is as one (or two) steps in a pipeline sequence of commands commands. When each command can run on a separate processor, the time to complete the overall pipeline sequence will be determined by the slowest step in the chain of dependencies. So as long as csvquote is not the slowest step in the sequence, then its relative speed will not affect the overall run time. This seems likely if some of these commands are involved:
184159

185-
* python ~ 100 seconds
186-
* lua ~ 60
187-
* awk (mawk) ~ 14
188-
* luajit ~ 3.5
189-
* go ~ 1.2
190-
* C ~ 1.0
160+
* 3.1 GB/sec : wc -l
161+
* 1.3 GB/sec : grep 'ZZZ'
162+
* 1.0 GB/sec : tr 'a' 'b'
163+
* 0.3 GB/sec : cut -f1
191164

192-
These numbers above were observed in 2013, and as of 2022 the current performance of the C version is faster by a factor of 5.
165+
In January 2022, csvquote was rewritten to be approximately 10x faster than before, from optimizing for source data with at least half of its fields not quoted. The inspiration to revisit this old code came from a rewrite by [skeeto@](https://github.com/skeeto/scratch/tree/master/csvquote). His version is especially aimed at modern CPUs (Intel and AMD from about the year 2015) and runs approximately 50% faster using [AVX2 SIMD](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) instructions. When running skeeto's version on my CPU it processes data at a consistent pace of 0.7 GB/sec, and does not vary depending on how many quote characters are in the source data.

0 commit comments

Comments
 (0)