You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+12-39Lines changed: 12 additions & 39 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -146,47 +146,20 @@ data will be misinterpreted.
146
146
147
147
Another issue is that the **character encoding** of CSV files is not specified.
148
148
149
-
Digression on programming languages
150
-
===================================
151
-
152
-
I'm a novice programmer dabbling in various programming languages. When
153
-
starting to learn a new language, I try to port this program into it. This
154
-
gives me a real-world exercise and a basis for comparison. Here are a few
155
-
remarks based on this limited experience.
156
-
157
-
***python** code is quick to write, but slow to run. The python version relies
158
-
on the built-in CSV library, but the other versions of this program do not use
159
-
any special libraries.
160
-
161
-
***go** has some convenient improvements relative to C. In cases where runtime
162
-
speed is important, and either concurrency or modern libraries would be
163
-
helpful, this is a good choice. For this program, neither of these applied.
164
-
When thinking about dependencies, go has the benefit of static linking into an
165
-
independent binary; but it's not a widely used language so may not be suitable
166
-
for broad distribution.
167
-
168
-
***awk** is a good fit for reading and writing text. Relatively fast to write
169
-
and run. It's available on all UNIX systems, but can be slightly different.
170
-
171
-
***C** is fastest to run, and requires some machine-level understanding of
172
-
memory management so takes a bit more care to write the code. It's easy to do
173
-
dangerous things in this language, so a bit of guidance can be very helpful.
174
-
I believe C is more likely to be useful to other people because most systems
175
-
will be able to compile it without installing additional requirements.
149
+
Run-time Speed comparison
150
+
=========================
176
151
177
-
Based on these reasons, this project will use the C version in the main branch
178
-
and put the other versions in a different branch for reference.
152
+
How fast is csvquote? Here are some data processing rates measured when running csvquote on an Intel i7 CPU model from 2013. Due to recently introduced optimizations in csvquote, the processing speed depends on how common the quote characters are in the source data.
179
153
180
-
Run-time Speed comparison
181
-
-------------------------
154
+
* 1.9 GB/sec : csvquote reading random csv data with 10% of fields quoted
155
+
* 0.5 GB/sec : csvquote reading random csv data with 100% of fields quoted
156
+
* 3.7 GB/sec : csvquote reading random csv data with no quoted fields (nothing for csvquote to do!)
182
157
183
-
Time spent processing a 100 MB CSV file on my laptop.
158
+
A common use of csvquote is as one (or two) steps in a pipeline sequence of commands commands. When each command can run on a separate processor, the time to complete the overall pipeline sequence will be determined by the slowest step in the chain of dependencies. So as long as csvquote is not the slowest step in the sequence, then its relative speed will not affect the overall run time. This seems likely if some of these commands are involved:
184
159
185
-
* python ~ 100 seconds
186
-
* lua ~ 60
187
-
* awk (mawk) ~ 14
188
-
* luajit ~ 3.5
189
-
* go ~ 1.2
190
-
* C ~ 1.0
160
+
* 3.1 GB/sec : wc -l
161
+
* 1.3 GB/sec : grep 'ZZZ'
162
+
* 1.0 GB/sec : tr 'a' 'b'
163
+
* 0.3 GB/sec : cut -f1
191
164
192
-
These numbers above were observed in 2013, and as of 2022 the current performance of the C version is faster by a factor of 5.
165
+
In January 2022, csvquote was rewritten to be approximately 10x faster than before, from optimizing for source data with at least half of its fields not quoted. The inspiration to revisit this old code came from a rewrite by [skeeto@](https://github.com/skeeto/scratch/tree/master/csvquote). His version is especially aimed at modern CPUs (Intel and AMD from about the year 2015) and runs approximately 50% faster using [AVX2 SIMD](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) instructions. When running skeeto's version on my CPU it processes data at a consistent pace of 0.7 GB/sec, and does not vary depending on how many quote characters are in the source data.
0 commit comments