Skip to content

Commit cde550c

Browse files
authored
Overhaul how column pooling works while parsing (#962)
* Overhaul how column pooling works while parsing Fixes #950. Ok, so there's a lot going on here, let me try to summarize the motivation and what the changes are doing here: * In 0.8 releases, `CSV.File` would parse the 1st 100 rows of a file, then do a calculation for whether the # of unique values and provided `pool` argument should determine if a column should be pooled or not * In 0.9, this was changed since just considering the 1st 100 rows tends to not be very representative of large files in terms of # of unique values. Instead, we parsed the entire file, and afterwards did the calculation for whether the column should be pooled. * Additionally, in multithreaded parsing, each thread parsed it's own chunk of the file, assuming a column would be pooled. After parsing, we needed to "synchronize" the ref codes that would be used by a pooled column. This ended up being an expensive operation to recode each chunk of a column for the whole file and process and reprocess the unique values of each chunk. This was notably expensive when the # of unique values was large (10s of thousands of values), as reported in #950. That provides the context for why we needed to change, so what is changing here: * We previously tried to assume a column would be pooled before parsing and build up the "pool dict" while parsing. This ended up complicating a lot of code: additional code to build the dict, complications because the element type doesn't match the column type (`UInt32` for ref array, `T` for pool dict key type), and complications for all the promotion machinery (needing to promote/convert pool values instead of ref values). That's in addition to the very complex `syncrefs!` routine, that has been the source of more a few bugs. * Proposed in this PR is to ignore pooling until post-parsing, where we'll start checking unique values and do the calculation for whether a column should be pooled or not. * Also introduced in this PR is allowing the `pool` keyword argument to be a number greater than `1.0`, which will be interpreted as an absolute upper limit on the # of unique values allowed before a column will switch from pooled to non-pooled. This seems to make sense for larger files for several reasons: using a % can result in a really large # of unique values allowed, which can be a performance bottleneck (though still allowed in this PR), and it provides a way for us to "short-circuit" the pooling check post-parsing. If the pool of unique values gets larger than the pool upper limit threshold, we can immediately abort on building an expensive pool with potentially lots of unique values. We update the `pool` keyword argument to be a default of `500`. Another note is that a file must have number of rows > then `pool` upper limit in order to pool, otherwise, the column won't be pooled. One consideration, and potential optimization, is that if `stringtype=String` or a column otherwise has a type of `String`, we'll individually allocate each row `String` during parsing, instead of getting a natural "interning" benefit we got with the pooling strategies of 0.8 or previous to this PR. During various rounds of benchmarking every since before 0.8, one thing that has continued to surface is how efficient allocating individual strings actually is. So while we could _potentially_ get a little efficiency by interning string columns while parsing, it's actually a more minimal benefit that people may guess. The memory saved by interning gets partly used up by having to keep the intern pool around, and the extra cost of hashing to check to intern can be comparable to just allocating a smallish string and setting it in our parsed array. The other consideration is multithreaded parsing: the interning benefit isn't as strong when each thread has to maintain its own intern pool (not as many values to intern against). We could perhaps explore using a lock-free or spin-lock based dict/set for interning globally for a column, which may end up providing enough performance benefit. If we do manage to figure out efficient interning, we could leverage the intern pool post-parsing when considering whether a column should be pooled or not. * Allow pool keyword arg to be Tuple{Float64, Int}
1 parent 482a187 commit cde550c

File tree

12 files changed

+134
-280
lines changed

12 files changed

+134
-280
lines changed

docs/src/examples.md

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -630,8 +630,8 @@ file = CSV.File(IOBuffer(data); types=Dict(:zipcode => String))
630630
using CSV
631631

632632
# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations
633-
# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt64`. By default,
634-
# `pool=0.1`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide
633+
# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default,
634+
# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide
635635
# greater control: `pool=0.4` means that if 40% or less of a column's values are unique, then it will be pooled.
636636
data = """
637637
id,code
@@ -666,4 +666,25 @@ category,amount
666666

667667
file = CSV.File(IOBuffer(data); pool=Dict(1 => true))
668668
file = CSV.File(IOBuffer(data); pool=[true, false])
669+
```
670+
671+
## [Pool with absolute threshold](@id pool_absolute_threshold)
672+
673+
```julia
674+
using CSV
675+
676+
# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations
677+
# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default,
678+
# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide
679+
# greater control: `pool=(0.5, 2)` means that if a column has 2 or fewer unique values _and_ the total number of unique values is less than 50% of all values, then it will be pooled.
680+
data = """
681+
id,code
682+
A18E9,AT
683+
BF392,GC
684+
93EBC,AT
685+
54EE1,AT
686+
8CD2E,GC
687+
"""
688+
689+
file = CSV.File(IOBuffer(data); pool=(0.5, 2))
669690
```

docs/src/reading.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -192,11 +192,12 @@ A `Dict{Type, Type}` argument that allows replacing a non-`String` standard type
192192

193193
## [`pool`](@id pool)
194194

195-
Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, vector of `Bool` or `Float64`, dict mapping column number/name to `Bool` or `Float64`, or a function of the form `(i, name) -> Union{Bool, Float64, Nothing}`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` indicating the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. As mentioned, when the `pool` argument is a single `Bool` or `Float64`, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool` or `Float64`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool` or `Float64` for specific columns. Unspecified columns will not be pooled when the argument is a dict.
195+
Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, `Tuple{Float64, Int}`, vector, dict, or a function of the form `(i, name) -> Union{Bool, Real, Tuple{Float64, Int}, Nothing}`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` to indicate the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. If `pool` is provided as a tuple, like `(0.2, 500)`, the first tuple element is the same as a single `Float64` value, which represents the % cardinality allowed. The second tuple element is an upper limit on the # of unique values allowed to pool the column. So the example, `pool=(0.2, 500)` means if a String column has less than or equal to 500 unique values _and_ the # of unique values is less than 20% of total # of values, it will be pooled, otherwise, it won't. As mentioned, when the `pool` argument is a single `Bool`, `Real`, or `Tuple{Float64, Int}`, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool`, `Float64`, or `Tuple{Float64, Int}`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool`, `Float64`, or `Tuple{Float64, Int}` for specific columns. Unspecified columns will not be pooled when the argument is a dict.
196196

197197
### Examples
198198
* [Pooled values](@ref pool_example)
199199
* [Non-string column pooling](@ref nonstring_pool_example)
200+
* [Pool with absolute threshold](@ref pool_absolute_threshold)
200201

201202
## [`downcast`](@id downcast)
202203

src/CSV.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ Base.showerror(io::IO, e::Error) = println(io, e.msg)
5656

5757
# constants
5858
const DEFAULT_STRINGTYPE = InlineString
59-
const DEFAULT_POOL = 0.25
59+
const DEFAULT_POOL = (0.2, 500)
6060
const DEFAULT_ROWS_TO_CHECK = 30
6161
const DEFAULT_MAX_WARNINGS = 100
6262
const DEFAULT_MAX_INLINE_STRING_LENGTH = 32

src/README.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -22,14 +22,13 @@ By providing the `pool` keyword argument, users can control how this optimizatio
2222

2323
Valid inputs for `pool` include:
2424
* A `Bool`, `true` or `false`, which will apply to all string columns parsed; string columns either will _all_ be pooled, or _all_ not pooled
25-
* A `Real`, which will be converted to `Float64`, which should be a value between `0.0` and `1.0`, to indicate the % cardinality threshold _under which_ a column will be pooled. e.g. by passing `pool=0.1`, if a column has less than 10% unique values, it will end up as a `PooledArray`, otherwise a normal array. Like the `Bool` argument, this will apply the same % threshold to only/all string columns
26-
* An `AbstractVector`, where the # of elements should/needs to match the # of columns in the dataset. Each element of the `pool` argument should be a `Bool` or `Real` indicating the pooling behavior for each specific column.
27-
* An `AbstractDict`, with keys as `String`s, `Symbol`s, or `Int`s referring to column names or indices, and values in the `AbstractDict` being `Bool` or `Real` to again signal how specific columns should be pooled
25+
* A `Real`, which will be converted to `Float64`, which should be a value between `0.0` and `1.0`, to indicate the % cardinality threshold _under which_ a column will be pooled. e.g. by passing `pool=0.1`, if a column has less than 10% unique values, it will end up as a `PooledArray`, otherwise a normal array. Like the `Bool` argument, this will apply the same % threshold to only/all string columns.
26+
* a `Tuple{Float64, Int}`, where the 1st argument is the same as the above percent threshold on cardinality, while the 2nd argument is an absolute upper limit on the # of unique values. This is useful for large datasets where 0.2 may grow to allow pooled columns with thousands of values; it's helpful performance-wise to put an upper limit like `pool=(0.2, 500)` to ensure no pooled column will have more than 500 unique values.
27+
* An `AbstractVector`, where the # of elements should/needs to match the # of columns in the dataset. Each element of the `pool` argument should be a `Bool`, `Real`, or `Tuple{Float64, Int}` indicating the pooling behavior for each specific column.
28+
* An `AbstractDict`, with keys as `String`s, `Symbol`s, or `Int`s referring to column names or indices, and values in the `AbstractDict` being `Bool`, `Real`, or `Tuple{Float64, Int}` to again signal how specific columns should be pooled
29+
* A function of the form `(i, nm) -> Union{Bool, Real, Tuple{Float64, Int}}` where it takes the column index and name as two arguments, and returns one of the first 3 possible pool values from the above list.
2830

2931
For the implementation of pooling:
3032
* We normalize however the keyword argument was provided to have a `pool` value per column while parsing
3133
* We also have a `pool` field on the `Context` structure in case columns are widened while parsing, they will take on this value
32-
* For multithreaded parsing, we decide if a column will be pooled or not from the type sampling stage; if a column has a `pool` value of `1.0`, it will _always_ be pooled (as requested), if it has `0.0` it will _not_ be pooled, if `0.0 < pool < 1.0` then we'll calculate whether or not it will be pooled from the sampled values. As noted aboved, a `pool` value of `NaN` will also be considered if a column had `String` values sampled and meets the default threshold. Currently, we'll sample `rows_to_check * ntasks` values per column, which are both configurable via keyword arguments, with defaults `rows_to_check=30` and `ntasks=Threads.nthreads()`. From those samples, we'll calculate the # of unique values divided by the total # of values sampled and compare it with the `pool` value to determine whether the column will be ultimately pooled. The ramification here is that we may "get it wrong" in two different ways: 1) based on the values sampled, we may determine that a column _will_ be pooled even though the total # of uniques values we'll parse will be over the `pool` threshold and 2) we may determine a column _shouldn't_ be pooled because of seemingly high cardinality, even though the total # of unique values for the column is ultimately _lower_ than the `pool` threshold. The only way to do things perfectly is to check _all_ the values of the entire column, but in multithreaded parsing, that would be expensive relative to the simpler sampling method. The other unique piece of multithreaded parsing is that we form a column's initial `refpool` field from the sampled values, which individual task-parsing columns will then start with when parsing their local chunks. Two tricky implementation details involved with sampling and pooling are 1) if a column's values end up being promoted to a different type _while_ sampling or post-sampling while parsing, and 2) if the whole columnset parsed is widened. For the first case, we take all sampled values and promote to the "widest" type, then when building a potential refpool, only consider values that "are" (i.e. `val isa type`) of the promoted type. That may seem like the obvious course of action, but consider that we may detect a value like `2021-01-01` as a `Date` object, but the column type is promoted to `String`; in that case, the `Date(2021, 1, 1)` object parsed _will not_ be in the initial refpool, since `!(Date(2021, 1, 1) isa String)`. For the 2nd tricky case, the columnset isn't widened while type sampling, so "extra" columns are just ignored. The columnset _will_ be widened by each local parsing task that detects the extra columns, and those extra columns will be synchronized/promoted post-parsing as needed. These "widened" columns will only ever be pooled if the user passed `pool=true`, meaning _every_ column for the whole file should be pooled.
33-
* In the single-threaded case, we take a simpler approach by pooling any column with `pool` value `0.0 < pool <= 1.0`, meaning even if we're not totally sure the column will be pooled, we'll pool it while parsing, then decide post-parsing whether the column should actually be pooled or unpooled.
34-
* One of the effects of these approaches to pooling in single vs. multithreading code is that we'll never change _whether a column is pooled_ while parsing; it's either decided post-sampling (multithreaded) or post-parsing (single-threaded).
35-
* Couple other notes on things we have to account for in the multithreading case that makes things a little more complicated. We have to synchronize ref values between local parsing tasks. This is because each task is encountering unique values in different orders, so different tasks may have the same values, but mapped to different ref values. We also have to account for the fact that different parsing tasks may also have promoted to different types, so we may need to promote the pooled values.
34+
* Once column parsing is done, the cardinality is checked against the individual column pool value and whether the column should be pooled or not is computed.

src/chunks.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ function Chunks(source::ValidSources;
6464
type=nothing,
6565
types=nothing,
6666
typemap::Dict=Dict{Type, Type}(),
67-
pool::Union{Bool, Real, AbstractVector, AbstractDict}=DEFAULT_POOL,
67+
pool::Union{Bool, Real, AbstractVector, AbstractDict, Base.Callable, Tuple}=DEFAULT_POOL,
6868
downcast::Bool=false,
6969
lazystrings::Bool=false,
7070
stringtype::StringTypes=DEFAULT_STRINGTYPE,

src/context.jl

Lines changed: 4 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,3 @@
1-
# a RefPool holds our refs as a Dict, along with a lastref field which is incremented when a new ref is found while parsing pooled columns
2-
mutable struct RefPool
3-
# what? why ::Any here? well, we want flexibility in what kind of refs we stick in here
4-
# it might be Dict{Union{String, Missing}, UInt32}, but it might be some other string type
5-
# or it might not allow `missing`; in short, there are too many options to try and type
6-
# the field concretely; luckily, working with the `refs` field here is limited to
7-
# a very few specific methods, and we'll always have the column type, so we just need
8-
# to make sure we assert the concrete type before using refs
9-
refs::Any
10-
lastref::UInt32
11-
end
12-
13-
# start lastref at 1, since it's reserved for `missing`, so first ref value will be 2
14-
const Refs{T} = Dict{Union{T, Missing}, UInt32}
15-
RefPool(::Type{T}=String) where {T} = RefPool(Refs{T}(), 1)
16-
171
"""
182
Internal structure used to track information for a single column in a delimited file.
193
@@ -25,7 +9,6 @@ Fields:
259
* `pool`: computed from `pool` keyword argument; `true` is `1.0`, `false` is `0.0`, everything else is `Float64(pool)`; once computed, this field isn't mutated at all while parsing; it's used in type detection to determine whether a column will be pooled or not once a type is detected;
2610
* `columnspecificpool`: if `pool` was provided via Vector or Dict by user, then `true`, other `false`; if `false`, then only string column types will attempt pooling
2711
* `column`: the actual column vector to hold parsed values; field is typed as `AbstractVector` and while parsing, we do switches on `col.type` to assert the column type to make code concretely typed
28-
* `refpool`: if the column is pooled (or might be pooled in single-threaded case), this is the column-specific `RefPool` used to track unique parsed values and their `UInt32` ref codes
2912
* `lock`: in multithreaded parsing, we have a top-level set of `Vector{Column}`, then each threaded parsing task makes its own copy to parse its own chunk; when synchronizing column types/pooled refs, the task-local `Column` will `lock(col.lock)` to make changes to the parent `Column`; each task-local `Column` shares the same `lock` of the top-level `Column`
3013
* `position`: for transposed reading, the current column position
3114
* `endposition`: for transposed reading, the expected ending position for this column
@@ -36,18 +19,17 @@ mutable struct Column
3619
anymissing::Bool
3720
userprovidedtype::Bool
3821
willdrop::Bool
39-
pool::Float64
22+
pool::Union{Float64, Tuple{Float64, Int}}
4023
columnspecificpool::Bool
4124
# lazily/manually initialized fields
4225
column::AbstractVector
43-
refpool::RefPool
4426
# per top-level column fields (don't need to copy per task when parsing)
4527
lock::ReentrantLock
4628
position::Int
4729
endposition::Int
4830
options::Parsers.Options
4931

50-
Column(type::Type, anymissing::Bool, userprovidedtype::Bool, willdrop::Bool, pool::Float64, columnspecificpool::Bool) =
32+
Column(type::Type, anymissing::Bool, userprovidedtype::Bool, willdrop::Bool, pool::Union{Float64, Tuple{Float64, Int}}, columnspecificpool::Bool) =
5133
new(type, anymissing, userprovidedtype, willdrop, pool, columnspecificpool)
5234
end
5335

@@ -71,10 +53,6 @@ function Column(x::Column)
7153
if isdefined(x, :options)
7254
y.options = x.options
7355
end
74-
if isdefined(x, :refpool)
75-
# if parent has refpool from sampling, make a copy
76-
y.refpool = RefPool(copy(x.refpool.refs), x.refpool.lastref)
77-
end
7856
# specifically _don't_ copy/re-use x.column; that needs to be allocated fresh per parsing task
7957
return y
8058
end
@@ -126,7 +104,7 @@ struct Context
126104
datarow::Int
127105
options::Parsers.Options
128106
columns::Vector{Column}
129-
pool::Float64
107+
pool::Union{Float64, Tuple{Float64, Int}}
130108
downcast::Bool
131109
customtypes::Type
132110
typemap::Dict{Type, Type}
@@ -239,7 +217,7 @@ end
239217
type::Union{Nothing, Type},
240218
types::Union{Nothing, Type, AbstractVector, AbstractDict, Function},
241219
typemap::Dict,
242-
pool::Union{Bool, Real, AbstractVector, AbstractDict, Base.Callable},
220+
pool::Union{Bool, Real, AbstractVector, AbstractDict, Base.Callable, Tuple},
243221
downcast::Bool,
244222
lazystrings::Bool,
245223
stringtype::StringTypes,

0 commit comments

Comments
 (0)