Commit cde550c
authored
Overhaul how column pooling works while parsing (#962)
* Overhaul how column pooling works while parsing
Fixes #950. Ok, so there's a lot going on here, let me try to summarize
the motivation and what the changes are doing here:
* In 0.8 releases, `CSV.File` would parse the 1st 100 rows of a file,
then do a calculation for whether the # of unique values and provided
`pool` argument should determine if a column should be pooled or not
* In 0.9, this was changed since just considering the 1st 100 rows
tends to not be very representative of large files in terms of # of
unique values. Instead, we parsed the entire file, and afterwards did
the calculation for whether the column should be pooled.
* Additionally, in multithreaded parsing, each thread parsed it's own
chunk of the file, assuming a column would be pooled. After parsing,
we needed to "synchronize" the ref codes that would be used by a pooled
column. This ended up being an expensive operation to recode each
chunk of a column for the whole file and process and reprocess the
unique values of each chunk. This was notably expensive when the # of
unique values was large (10s of thousands of values), as reported in
#950.
That provides the context for why we needed to change, so what is
changing here:
* We previously tried to assume a column would be pooled before
parsing and build up the "pool dict" while parsing. This ended up
complicating a lot of code: additional code to build the dict,
complications because the element type doesn't match the column type
(`UInt32` for ref array, `T` for pool dict key type), and
complications for all the promotion machinery (needing to
promote/convert pool values instead of ref values). That's in addition
to the very complex `syncrefs!` routine, that has been the source of
more a few bugs.
* Proposed in this PR is to ignore pooling until post-parsing, where
we'll start checking unique values and do the calculation for whether
a column should be pooled or not.
* Also introduced in this PR is allowing the `pool` keyword argument
to be a number greater than `1.0`, which will be interpreted as an
absolute upper limit on the # of unique values allowed before a column
will switch from pooled to non-pooled. This seems to make sense for
larger files for several reasons: using a % can result in a really
large # of unique values allowed, which can be a performance
bottleneck (though still allowed in this PR), and it provides a way
for us to "short-circuit" the pooling check post-parsing. If the pool
of unique values gets larger than the pool upper limit threshold, we
can immediately abort on building an expensive pool with potentially
lots of unique values. We update the `pool` keyword argument to be a
default of `500`. Another note is that a file must have number of rows
> then `pool` upper limit in order to pool, otherwise, the column
won't be pooled.
One consideration, and potential optimization, is that if
`stringtype=String` or a column otherwise has a type of `String`, we'll
individually allocate each row `String` during parsing, instead of
getting a natural "interning" benefit we got with the pooling strategies
of 0.8 or previous to this PR. During various rounds of benchmarking
every since before 0.8, one thing that has continued to surface is how
efficient allocating individual strings actually is. So while we could
_potentially_ get a little efficiency by interning string columns while
parsing, it's actually a more minimal benefit that people may guess. The
memory saved by interning gets partly used up by having to keep the
intern pool around, and the extra cost of hashing to check to intern can
be comparable to just allocating a smallish string and setting it in our
parsed array. The other consideration is multithreaded parsing: the
interning benefit isn't as strong when each thread has to maintain its
own intern pool (not as many values to intern against). We could perhaps
explore using a lock-free or spin-lock based dict/set for interning
globally for a column, which may end up providing enough performance
benefit. If we do manage to figure out efficient interning, we could
leverage the intern pool post-parsing when considering whether a column
should be pooled or not.
* Allow pool keyword arg to be Tuple{Float64, Int}1 parent 482a187 commit cde550c
File tree
12 files changed
+134
-280
lines changed- docs/src
- src
- test
12 files changed
+134
-280
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
630 | 630 | | |
631 | 631 | | |
632 | 632 | | |
633 | | - | |
634 | | - | |
| 633 | + | |
| 634 | + | |
635 | 635 | | |
636 | 636 | | |
637 | 637 | | |
| |||
666 | 666 | | |
667 | 667 | | |
668 | 668 | | |
| 669 | + | |
| 670 | + | |
| 671 | + | |
| 672 | + | |
| 673 | + | |
| 674 | + | |
| 675 | + | |
| 676 | + | |
| 677 | + | |
| 678 | + | |
| 679 | + | |
| 680 | + | |
| 681 | + | |
| 682 | + | |
| 683 | + | |
| 684 | + | |
| 685 | + | |
| 686 | + | |
| 687 | + | |
| 688 | + | |
| 689 | + | |
669 | 690 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
192 | 192 | | |
193 | 193 | | |
194 | 194 | | |
195 | | - | |
| 195 | + | |
196 | 196 | | |
197 | 197 | | |
198 | 198 | | |
199 | 199 | | |
| 200 | + | |
200 | 201 | | |
201 | 202 | | |
202 | 203 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
59 | | - | |
| 59 | + | |
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
25 | | - | |
26 | | - | |
27 | | - | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
28 | 30 | | |
29 | 31 | | |
30 | 32 | | |
31 | 33 | | |
32 | | - | |
33 | | - | |
34 | | - | |
35 | | - | |
| 34 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
64 | 64 | | |
65 | 65 | | |
66 | 66 | | |
67 | | - | |
| 67 | + | |
68 | 68 | | |
69 | 69 | | |
70 | 70 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
2 | | - | |
3 | | - | |
4 | | - | |
5 | | - | |
6 | | - | |
7 | | - | |
8 | | - | |
9 | | - | |
10 | | - | |
11 | | - | |
12 | | - | |
13 | | - | |
14 | | - | |
15 | | - | |
16 | | - | |
17 | 1 | | |
18 | 2 | | |
19 | 3 | | |
| |||
25 | 9 | | |
26 | 10 | | |
27 | 11 | | |
28 | | - | |
29 | 12 | | |
30 | 13 | | |
31 | 14 | | |
| |||
36 | 19 | | |
37 | 20 | | |
38 | 21 | | |
39 | | - | |
| 22 | + | |
40 | 23 | | |
41 | 24 | | |
42 | 25 | | |
43 | | - | |
44 | 26 | | |
45 | 27 | | |
46 | 28 | | |
47 | 29 | | |
48 | 30 | | |
49 | 31 | | |
50 | | - | |
| 32 | + | |
51 | 33 | | |
52 | 34 | | |
53 | 35 | | |
| |||
71 | 53 | | |
72 | 54 | | |
73 | 55 | | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | 56 | | |
79 | 57 | | |
80 | 58 | | |
| |||
126 | 104 | | |
127 | 105 | | |
128 | 106 | | |
129 | | - | |
| 107 | + | |
130 | 108 | | |
131 | 109 | | |
132 | 110 | | |
| |||
239 | 217 | | |
240 | 218 | | |
241 | 219 | | |
242 | | - | |
| 220 | + | |
243 | 221 | | |
244 | 222 | | |
245 | 223 | | |
| |||
0 commit comments