JuliaData
diff --git a/‎docs/src/examples.md‎
Lines changed: 23 additions & 2 deletions b/‎docs/src/examples.md‎
Lines changed: 23 additions & 2 deletions
diff --git a/‎docs/src/reading.md‎
Lines changed: 2 additions & 1 deletion b/‎docs/src/reading.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎src/CSV.jl‎
Lines changed: 1 addition & 1 deletion b/‎src/CSV.jl‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/README.md‎
Lines changed: 6 additions & 7 deletions b/‎src/README.md‎
Lines changed: 6 additions & 7 deletions
diff --git a/‎src/chunks.jl‎
Lines changed: 1 addition & 1 deletion b/‎src/chunks.jl‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/context.jl‎
Lines changed: 4 additions & 26 deletions b/‎src/context.jl‎
Lines changed: 4 additions & 26 deletions
@@ -630,8 +630,8 @@ file = CSV.File(IOBuffer(data); types=Dict(:zipcode => String))
 using CSV
 
 # In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations 
-# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt64`. By default, 
-# `pool=0.1`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide 
+# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default, 
+# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide 
 # greater control: `pool=0.4` means that if 40% or less of a column's values are unique, then it will be pooled.
 data = """
 id,code
@@ -666,4 +666,25 @@ category,amount
 
 file = CSV.File(IOBuffer(data); pool=Dict(1 => true))
 file = CSV.File(IOBuffer(data); pool=[true, false])
+```
+
+## [Pool with absolute threshold](@id pool_absolute_threshold)
+
+```julia
+using CSV
+
+# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations 
+# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default, 
+# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide 
+# greater control: `pool=(0.5, 2)` means that if a column has 2 or fewer unique values _and_ the total number of unique values is less than 50% of all values, then it will be pooled.
+data = """
+id,code
+A18E9,AT
+BF392,GC
+93EBC,AT
+54EE1,AT
+8CD2E,GC
+"""
+
+file = CSV.File(IOBuffer(data); pool=(0.5, 2))
 ```
@@ -192,11 +192,12 @@ A `Dict{Type, Type}` argument that allows replacing a non-`String` standard type
 
 ## [`pool`](@id pool)
 
-Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, vector of `Bool` or `Float64`, dict mapping column number/name to `Bool` or `Float64`, or a function of the form `(i, name) -> Union{Bool, Float64, Nothing}`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` indicating the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. As mentioned, when the `pool` argument is a single `Bool` or `Float64`, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool` or `Float64`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool` or `Float64` for specific columns. Unspecified columns will not be pooled when the argument is a dict.
+Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, `Tuple{Float64, Int}`, vector, dict, or a function of the form `(i, name) -> Union{Bool, Real,  Tuple{Float64, Int}, Nothing}`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` to indicate the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. If `pool` is provided as a tuple, like `(0.2, 500)`, the first tuple element is the same as a single `Float64` value, which represents the % cardinality allowed. The second tuple element is an upper limit on the # of unique values allowed to pool the column. So the example, `pool=(0.2, 500)` means if a String column has less than or equal to 500 unique values _and_ the # of unique values is less than 20% of total # of values, it will be pooled, otherwise, it won't. As mentioned, when the `pool` argument is a single `Bool`, `Real`, or `Tuple{Float64, Int}`, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool`, `Float64`, or `Tuple{Float64, Int}`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool`, `Float64`, or `Tuple{Float64, Int}` for specific columns. Unspecified columns will not be pooled when the argument is a dict.
 
 ### Examples
   * [Pooled values](@ref pool_example)
   * [Non-string column pooling](@ref nonstring_pool_example)
+  * [Pool with absolute threshold](@ref pool_absolute_threshold)
 
 ## [`downcast`](@id downcast)
 
 
@@ -56,7 +56,7 @@ Base.showerror(io::IO, e::Error) = println(io, e.msg)
 
 # constants
 const DEFAULT_STRINGTYPE = InlineString
-const DEFAULT_POOL = 0.25
+const DEFAULT_POOL = (0.2, 500)
 const DEFAULT_ROWS_TO_CHECK = 30
 const DEFAULT_MAX_WARNINGS = 100
 const DEFAULT_MAX_INLINE_STRING_LENGTH = 32
 
@@ -22,14 +22,13 @@ By providing the `pool` keyword argument, users can control how this optimizatio
 
 Valid inputs for `pool` include:
   * A `Bool`, `true` or `false`, which will apply to all string columns parsed; string columns either will _all_ be pooled, or _all_ not pooled
-  * A `Real`, which will be converted to `Float64`, which should be a value between `0.0` and `1.0`, to indicate the % cardinality threshold _under which_ a column will be pooled. e.g. by passing `pool=0.1`, if a column has less than 10% unique values, it will end up as a `PooledArray`, otherwise a normal array. Like the `Bool` argument, this will apply the same % threshold to only/all string columns
-  * An `AbstractVector`, where the # of elements should/needs to match the # of columns in the dataset. Each element of the `pool` argument should be a `Bool` or `Real` indicating the pooling behavior for each specific column.
-  * An `AbstractDict`, with keys as `String`s, `Symbol`s, or `Int`s referring to column names or indices, and values in the `AbstractDict` being `Bool` or `Real` to again signal how specific columns should be pooled
+  * A `Real`, which will be converted to `Float64`, which should be a value between `0.0` and `1.0`, to indicate the % cardinality threshold _under which_ a column will be pooled. e.g. by passing `pool=0.1`, if a column has less than 10% unique values, it will end up as a `PooledArray`, otherwise a normal array. Like the `Bool` argument, this will apply the same % threshold to only/all string columns.
+  * a `Tuple{Float64, Int}`, where the 1st argument is the same as the above percent threshold on cardinality, while the 2nd argument is an absolute upper limit on the # of unique values. This is useful for large datasets where 0.2 may grow to allow pooled columns with thousands of values; it's helpful performance-wise to put an upper limit like `pool=(0.2, 500)` to ensure no pooled column will have more than 500 unique values.
+  * An `AbstractVector`, where the # of elements should/needs to match the # of columns in the dataset. Each element of the `pool` argument should be a `Bool`, `Real`, or `Tuple{Float64, Int}` indicating the pooling behavior for each specific column.
+  * An `AbstractDict`, with keys as `String`s, `Symbol`s, or `Int`s referring to column names or indices, and values in the `AbstractDict` being `Bool`, `Real`, or `Tuple{Float64, Int}` to again signal how specific columns should be pooled
+  * A function of the form `(i, nm) -> Union{Bool, Real, Tuple{Float64, Int}}` where it takes the column index and name as two arguments, and returns one of the first 3 possible pool values from the above list.
 
 For the implementation of pooling:
   * We normalize however the keyword argument was provided to have a `pool` value per column while parsing
   * We also have a `pool` field on the `Context` structure in case columns are widened while parsing, they will take on this value
-  * For multithreaded parsing, we decide if a column will be pooled or not from the type sampling stage; if a column has a `pool` value of `1.0`, it will _always_ be pooled (as requested), if it has `0.0` it will _not_ be pooled, if `0.0 < pool < 1.0` then we'll calculate whether or not it will be pooled from the sampled values. As noted aboved, a `pool` value of `NaN` will also be considered if a column had `String` values sampled and meets the default threshold. Currently, we'll sample `rows_to_check * ntasks` values per column, which are both configurable via keyword arguments, with defaults `rows_to_check=30` and `ntasks=Threads.nthreads()`. From those samples, we'll calculate the # of unique values divided by the total # of values sampled and compare it with the `pool` value to determine whether the column will be ultimately pooled. The ramification here is that we may "get it wrong" in two different ways: 1) based on the values sampled, we may determine that a column _will_ be pooled even though the total # of uniques values we'll parse will be over the `pool` threshold and 2) we may determine a column _shouldn't_ be pooled because of seemingly high cardinality, even though the total # of unique values for the column is ultimately _lower_ than the `pool` threshold. The only way to do things perfectly is to check _all_ the values of the entire column, but in multithreaded parsing, that would be expensive relative to the simpler sampling method. The other unique piece of multithreaded parsing is that we form a column's initial `refpool` field from the sampled values, which individual task-parsing columns will then start with when parsing their local chunks. Two tricky implementation details involved with sampling and pooling are 1) if a column's values end up being promoted to a different type _while_ sampling or post-sampling while parsing, and 2) if the whole columnset parsed is widened. For the first case, we take all sampled values and promote to the "widest" type, then when building a potential refpool, only consider values that "are" (i.e. `val isa type`) of the promoted type. That may seem like the obvious course of action, but consider that we may detect a value like `2021-01-01` as a `Date` object, but the column type is promoted to `String`; in that case, the `Date(2021, 1, 1)` object parsed _will not_ be in the initial refpool, since `!(Date(2021, 1, 1) isa String)`. For the 2nd tricky case, the columnset isn't widened while type sampling, so "extra" columns are just ignored. The columnset _will_ be widened by each local parsing task that detects the extra columns, and those extra columns will be synchronized/promoted post-parsing as needed. These "widened" columns will only ever be pooled if the user passed `pool=true`, meaning _every_ column for the whole file should be pooled.
-  * In the single-threaded case, we take a simpler approach by pooling any column with `pool` value `0.0 < pool <= 1.0`, meaning even if we're not totally sure the column will be pooled, we'll pool it while parsing, then decide post-parsing whether the column should actually be pooled or unpooled.
-  * One of the effects of these approaches to pooling in single vs. multithreading code is that we'll never change _whether a column is pooled_ while parsing; it's either decided post-sampling (multithreaded) or post-parsing (single-threaded).
-  * Couple other notes on things we have to account for in the multithreading case that makes things a little more complicated. We have to synchronize ref values between local parsing tasks. This is because each task is encountering unique values in different orders, so different tasks may have the same values, but mapped to different ref values. We also have to account for the fact that different parsing tasks may also have promoted to different types, so we may need to promote the pooled values.
+  * Once column parsing is done, the cardinality is checked against the individual column pool value and whether the column should be pooled or not is computed.
@@ -64,7 +64,7 @@ function Chunks(source::ValidSources;
     type=nothing,
     types=nothing,
     typemap::Dict=Dict{Type, Type}(),
-    pool::Union{Bool, Real, AbstractVector, AbstractDict}=DEFAULT_POOL,
+    pool::Union{Bool, Real, AbstractVector, AbstractDict, Base.Callable, Tuple}=DEFAULT_POOL,
     downcast::Bool=false,
     lazystrings::Bool=false,
     stringtype::StringTypes=DEFAULT_STRINGTYPE,
 
@@ -1,19 +1,3 @@
-# a RefPool holds our refs as a Dict, along with a lastref field which is incremented when a new ref is found while parsing pooled columns
-mutable struct RefPool
-    # what? why ::Any here? well, we want flexibility in what kind of refs we stick in here
-    # it might be Dict{Union{String, Missing}, UInt32}, but it might be some other string type
-    # or it might not allow `missing`; in short, there are too many options to try and type
-    # the field concretely; luckily, working with the `refs` field here is limited to
-    # a very few specific methods, and we'll always have the column type, so we just need
-    # to make sure we assert the concrete type before using refs
-    refs::Any
-    lastref::UInt32
-end
-
-# start lastref at 1, since it's reserved for `missing`, so first ref value will be 2
-const Refs{T} = Dict{Union{T, Missing}, UInt32}
-RefPool(::Type{T}=String) where {T} = RefPool(Refs{T}(), 1)
-
 """
 Internal structure used to track information for a single column in a delimited file.
 
@@ -25,7 +9,6 @@ Fields:
   * `pool`: computed from `pool` keyword argument; `true` is `1.0`, `false` is `0.0`, everything else is `Float64(pool)`; once computed, this field isn't mutated at all while parsing; it's used in type detection to determine whether a column will be pooled or not once a type is detected; 
   * `columnspecificpool`: if `pool` was provided via Vector or Dict by user, then `true`, other `false`; if `false`, then only string column types will attempt pooling
   * `column`: the actual column vector to hold parsed values; field is typed as `AbstractVector` and while parsing, we do switches on `col.type` to assert the column type to make code concretely typed
-  * `refpool`: if the column is pooled (or might be pooled in single-threaded case), this is the column-specific `RefPool` used to track unique parsed values and their `UInt32` ref codes
   * `lock`: in multithreaded parsing, we have a top-level set of `Vector{Column}`, then each threaded parsing task makes its own copy to parse its own chunk; when synchronizing column types/pooled refs, the task-local `Column` will `lock(col.lock)` to make changes to the parent `Column`; each task-local `Column` shares the same `lock` of the top-level `Column`
   * `position`: for transposed reading, the current column position
   * `endposition`: for transposed reading, the expected ending position for this column
@@ -36,18 +19,17 @@ mutable struct Column
     anymissing::Bool
     userprovidedtype::Bool
     willdrop::Bool
-    pool::Float64
+    pool::Union{Float64, Tuple{Float64, Int}}
     columnspecificpool::Bool
     # lazily/manually initialized fields
     column::AbstractVector
-    refpool::RefPool
     # per top-level column fields (don't need to copy per task when parsing)
     lock::ReentrantLock
     position::Int
     endposition::Int
     options::Parsers.Options
 
-    Column(type::Type, anymissing::Bool, userprovidedtype::Bool, willdrop::Bool, pool::Float64, columnspecificpool::Bool) =
+    Column(type::Type, anymissing::Bool, userprovidedtype::Bool, willdrop::Bool, pool::Union{Float64, Tuple{Float64, Int}}, columnspecificpool::Bool) =
         new(type, anymissing, userprovidedtype, willdrop, pool, columnspecificpool)
 end
 
@@ -71,10 +53,6 @@ function Column(x::Column)
     if isdefined(x, :options)
         y.options = x.options
     end
-    if isdefined(x, :refpool)
-        # if parent has refpool from sampling, make a copy
-        y.refpool = RefPool(copy(x.refpool.refs), x.refpool.lastref)
-    end
     # specifically _don't_ copy/re-use x.column; that needs to be allocated fresh per parsing task
     return y
 end
@@ -126,7 +104,7 @@ struct Context
     datarow::Int
     options::Parsers.Options
     columns::Vector{Column}
-    pool::Float64
+    pool::Union{Float64, Tuple{Float64, Int}}
     downcast::Bool
     customtypes::Type
     typemap::Dict{Type, Type}
@@ -239,7 +217,7 @@ end
     type::Union{Nothing, Type},
     types::Union{Nothing, Type, AbstractVector, AbstractDict, Function},
     typemap::Dict,
-    pool::Union{Bool, Real, AbstractVector, AbstractDict, Base.Callable},
+    pool::Union{Bool, Real, AbstractVector, AbstractDict, Base.Callable, Tuple},
     downcast::Bool,
     lazystrings::Bool,
     stringtype::StringTypes,