Skip to content

Commit ea78bbb

Browse files
Allow Regex keys for selecting columns in types/dateformat/pool keyword (#1016)
* Support `types` being `Dict{Regex} * Validate `types` getting `Regex` * Support `types` being `Dict{Any}` with `Regex` key(s) * Test `dateformat` now also understands Regex * fix typo in test * Add test for when `Regex` and exact name both match * Document that Regex can be used to identify columns by name
1 parent b4360cc commit ea78bbb

File tree

5 files changed

+60
-6
lines changed

5 files changed

+60
-6
lines changed

docs/src/examples.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -616,6 +616,9 @@ col1,col2,col3,col4,col5,col6,col7
616616
"""
617617
file = CSV.File(IOBuffer(data); types=(i, name) -> i == 1 ? Bool : Int8)
618618
file = CSV.File(IOBuffer(data); types=(i, name) -> name == :col1 ? Bool : Int8)
619+
# Alternatively by providing the exact name for the first column and a Regex to match the rest.
620+
# Note that an exact column name always takes precedence over a regular expression.
621+
file = CSV.File(IOBuffer(data); types=Dict(:col1 => Bool, r"^col\d" => Int8))
619622
```
620623

621624
## [Typemap](@id typemap_example)

docs/src/reading.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -150,7 +150,7 @@ An ASCII `Char` argument that parsing uses when parsing quoted cells and the `qu
150150

151151
## [`dateformat`](@id dateformat)
152152

153-
A `String` or `AbstractDict` argument that controls how parsing detects datetime values in the data input. As a single `String` (or `DateFormat`) argument, the same format will be applied to _all_ columns in the file. For columns without type information provided otherwise, parsing will use the provided format string to check if the cell is parseable and if so, will attempt to parse the entire column as the datetime type (`Time`, `Date`, or `DateTime`). By default, if no `dateformat` argument is explicitly provided, parsing will try to detect any of `Time`, `Date`, or `DateTime` types following the standard `Dates.ISOTimeFormat`, `Dates.ISODateFormat`, or `Dates.ISODateTimeFormat` formats, respectively. If a datetime type is provided for a column, (see the [types](@ref types) argument), then the `dateformat` format string needs to match the format of values in that column, otherwise, a warning will be emitted and the value will be replaced with a `missing` value (this behavior is also configurable via the [strict](@ref) and [silencewarnings](@ref strict) arguments). If an `AbstractDict` is provided, different `dateformat` strings can be provided for specific columns; the provided dict can map either an `Integer` for column number, or a `String` or `Symbol` for column name to the dateformat string that should be used for that column. Columns not mapped in the dict argument will use the default format strings mentioned above.
153+
A `String` or `AbstractDict` argument that controls how parsing detects datetime values in the data input. As a single `String` (or `DateFormat`) argument, the same format will be applied to _all_ columns in the file. For columns without type information provided otherwise, parsing will use the provided format string to check if the cell is parseable and if so, will attempt to parse the entire column as the datetime type (`Time`, `Date`, or `DateTime`). By default, if no `dateformat` argument is explicitly provided, parsing will try to detect any of `Time`, `Date`, or `DateTime` types following the standard `Dates.ISOTimeFormat`, `Dates.ISODateFormat`, or `Dates.ISODateTimeFormat` formats, respectively. If a datetime type is provided for a column, (see the [types](@ref types) argument), then the `dateformat` format string needs to match the format of values in that column, otherwise, a warning will be emitted and the value will be replaced with a `missing` value (this behavior is also configurable via the [strict](@ref) and [silencewarnings](@ref strict) arguments). If an `AbstractDict` is provided, different `dateformat` strings can be provided for specific columns; the provided dict can map either an `Integer` for column number or a `String`, `Symbol` or `Regex` for column name to the dateformat string that should be used for that column. Columns not mapped in the dict argument will use the default format strings mentioned above.
154154

155155
### Examples
156156
* [DateFormat](@ref dateformat_example)
@@ -171,9 +171,13 @@ These arguments can be provided as `Vector{String}` to specify custom values tha
171171

172172
## [`types`](@id types)
173173

174-
Argument to control the types of columns that get parsed in the data input. Can be provided as a single `Type`, an `AbstractVector` of types, an `AbstractDict`, or a function. If a single type is provided, like `types=Float64`, then _all_ columns in the data input will be parsed as `Float64`. If a column's value isn't a valid `Float64` value, then a warning will be emitted, unless `silencewarnings=false` is passed, then no warning will be printed. However, if `strict=true` is passed, then an error will be thrown instead, regarldess of the `silencewarnings` argument. The `types` argument can also be provided as an `AbstractVector{Type}`, wherein the length of the vector should match the number of columns in the data input, and each element gives the type of the corresponding column in order. If provided as an `AbstractDict`, then specific columns can have their column type specified, with the key of the dict being an `Integer` for column number, or `String` or `Symbol` for column name, and the dict value being the column type. Unspecified columns will have their column type auto-detected while parsing. A function can also be provided, and should be of the form `(i, name) -> Union{T, Nothing}`, and will be applied to each detected column during initial parsing. Returning `nothing` from the function will result in the column's type being automatically detected during parsing.
174+
Argument to control the types of columns that get parsed in the data input. Can be provided as a single `Type`, an `AbstractVector` of types, an `AbstractDict`, or a function.
175+
- If a single type is provided, like `types=Float64`, then _all_ columns in the data input will be parsed as `Float64`. If a column's value isn't a valid `Float64` value, then a warning will be emitted, unless `silencewarnings=false` is passed, then no warning will be printed. However, if `strict=true` is passed, then an error will be thrown instead, regarldess of the `silencewarnings` argument.
176+
- If a `AbstractVector{Type}` is provided, then the length of the vector should match the number of columns in the data input, and each element gives the type of the corresponding column in order.
177+
- If an `AbstractDict`, then specific columns can have their column type specified with the key of the dict being an `Integer` for column number, or `String` or `Symbol` for column name or `Regex` matching column names, and the dict value being the column type. Unspecified columns will have their column type auto-detected while parsing.
178+
- If a function, then it should be of the form `(i, name) -> Union{T, Nothing}`, and will be applied to each detected column during initial parsing. Returning `nothing` from the function will result in the column's type being automatically detected during parsing.
175179

176-
By default, `types=nothing`, which means all column types in the data input will be detected while parsing. Note that it isn't necessary to pass `types=Union{Float64, Missing}` if the data input contains `missing` values. Parsing will detect `missing` values if present, and promote any manually provided column types from the singular (`Float64`) to the missing equivalent (`Union{Float64, Missing}`) automatically. Standard types will be auto-detected in the following order when not otherwise specified: `Int64`, `Float64`, `Date`, `DateTime`, `Time`, `Bool`, `String`.
180+
By default `types=nothing`, which means all column types in the data input will be detected while parsing. Note that it isn't necessary to pass `types=Union{Float64, Missing}` if the data input contains `missing` values. Parsing will detect `missing` values if present, and promote any manually provided column types from the singular (`Float65`) to the missing equivalent (`Union{Float64, Missing}`) automatically. Standard types will be auto-detected in the following order when not otherwise specified: `Int64`, `Float64`, `Date`, `DateTime`, `Time`, `Bool`, `String`.
177181

178182
Non-standard types can be provided, like `Dec64` from the DecFP.jl package, but must support the `Base.tryparse(T, str)` function for parsing a value from a string. This allows, for example, easily defining a custom type, like `struct Float64Array; values::Vector{Float64}; end`, as long as a corresponding `Base.tryparse` definition is defined, like `Base.tryparse(::Type{Float64Array}, str) = Float64Array(map(x -> parse(Float64, x), split(str, ';')))`, where a single cell in the data input is like `1.23;4.56;7.89`.
179183

src/context.jl

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ Fields:
66
* `anymissing`: whether any missing values have been encountered while parsing; if a user provided a type like `Union{Int, Missing}`, we'll set this to `true`, or when `missing` values are encountered while parsing
77
* `userprovidedtype`: whether the column type was provided by the user or not; this affects whether we'll promote a column's type while parsing, or emit a warning/error depending on `strict` keyword arg
88
* `willdrop`: whether we'll drop this column from the final columnset; computed from select/drop keyword arguments; this will result in a column type of `HardMissing` while parsing, where an efficient parser is used to "skip" a field w/o allocating any parsed value
9-
* `pool`: computed from `pool` keyword argument; `true` is `1.0`, `false` is `0.0`, everything else is `Float64(pool)`; once computed, this field isn't mutated at all while parsing; it's used in type detection to determine whether a column will be pooled or not once a type is detected;
9+
* `pool`: computed from `pool` keyword argument; `true` is `1.0`, `false` is `0.0`, everything else is `Float64(pool)`; once computed, this field isn't mutated at all while parsing; it's used in type detection to determine whether a column will be pooled or not once a type is detected;
1010
* `columnspecificpool`: if `pool` was provided via Vector or Dict by user, then `true`, other `false`; if `false`, then only string column types will attempt pooling
1111
* `column`: the actual column vector to hold parsed values; field is typed as `AbstractVector` and while parsing, we do switches on `col.type` to assert the column type to make code concretely typed
1212
* `lock`: in multithreaded parsing, we have a top-level set of `Vector{Column}`, then each threaded parsing task makes its own copy to parse its own chunk; when synchronizing column types/pooled refs, the task-local `Column` will `lock(col.lock)` to make changes to the parent `Column`; each task-local `Column` shares the same `lock` of the top-level `Column`
@@ -84,7 +84,8 @@ function checkinvalidcolumns(dict, argname, ncols, names)
8484
if k isa Integer
8585
(0 < k <= ncols) || throw(ArgumentError("invalid column number provided in `$argname` keyword argument: $k. Column number must be 0 < i <= $ncols as detected in the data. To ignore invalid columns numbers in `$argname`, pass `validate=false`"))
8686
else
87-
Symbol(k) in names || throw(ArgumentError("invalid column name provided in `$argname` keyword argument: $k. Valid column names detected in the data are: $names. To ignore invalid columns names in `$argname`, pass `validate=false`"))
87+
isvalid = (k isa Regex && any(nm -> contains(string(nm), k), names)) || Symbol(k) in names
88+
isvalid || throw(ArgumentError("invalid column name provided in `$argname` keyword argument: $k. Valid column names detected in the data are: $names. To ignore invalid columns names in `$argname`, pass `validate=false`"))
8889
end
8990
end
9091
return

src/utils.jl

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -364,7 +364,32 @@ end
364364
getordefault(x::AbstractDict{String}, nm, i, def) = haskey(x, string(nm)) ? x[string(nm)] : def
365365
getordefault(x::AbstractDict{Symbol}, nm, i, def) = haskey(x, nm) ? x[nm] : def
366366
getordefault(x::AbstractDict{Int}, nm, i, def) = haskey(x, i) ? x[i] : def
367-
getordefault(x::AbstractDict, nm, i, def) = haskey(x, i) ? x[i] : haskey(x, nm) ? x[nm] : haskey(x, string(nm)) ? x[string(nm)] : def
367+
function getordefault(x::AbstractDict{Regex}, nm, i, def)
368+
for (re, T) in x
369+
contains(string(nm), re) && return T
370+
end
371+
return def
372+
end
373+
function getordefault(x::AbstractDict, nm, i, def)
374+
return if haskey(x, i)
375+
x[i]
376+
elseif haskey(x, nm)
377+
x[nm]
378+
elseif haskey(x, string(nm))
379+
x[string(nm)]
380+
else
381+
val = _firstmatch(x, string(nm))
382+
val !== nothing ? val : def
383+
end
384+
end
385+
386+
# return the first value in `x` with a `key::Regex` that matches on `nm`
387+
function _firstmatch(x::AbstractDict, nm::AbstractString)
388+
for (k, T) in x
389+
k isa Regex && contains(nm, k) && return T
390+
end
391+
return nothing
392+
end
368393

369394
# given a DateFormat, is it meant for parsing Date, DateTime, or Time?
370395
function timetype(df::Parsers.Format)::Union{Type{Date}, Type{Time}, Type{DateTime}}

test/basics.jl

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -777,4 +777,25 @@ f = CSV.File(IOBuffer(join((rand(("a,$(rand())", "b,$(rand())")) for _ = 1:10^6)
777777
f = CSV.File(IOBuffer("a\nfalse\n"))
778778
@test eltype(f.a) == Bool
779779

780+
# 1014
781+
# types is Dict{Regex}
782+
data = IOBuffer("a_col,b_col,c,d\n1,2,3.14,hey\n4,2,6.5,hey\n")
783+
f = CSV.File(data; types=Dict(r"_col$" => Int16))
784+
@test eltype(f.a_col) == Int16
785+
@test eltype(f.b_col) == Int16
786+
@test_throws ArgumentError CSV.File(data; types=Dict(r"_column$" => Int16))
787+
# types is Dict{Any} including `Regex` key
788+
f = CSV.File(data; types=Dict(r"_col$" => Int16, "c" => Float16))
789+
@test eltype(f.a_col) == Int16
790+
@test eltype(f.b_col) == Int16
791+
@test eltype(f.c) == Float16
792+
# Regex has lower precedence than exact column name/number match
793+
f = CSV.File(data; types=Dict(r"_col$" => Int16, :a_col => Int8))
794+
@test eltype(f.a_col) == Int8
795+
@test eltype(f.b_col) == Int16
796+
# dateformat supports Regex
797+
f = CSV.File(IOBuffer("time,date1,date2\n10:00:00.0,04/16/2020,04/17/2022\n"); dateformat=Dict(r"^date"=>"mm/dd/yyyy"))
798+
@test f[1].date1 == Dates.Date(2020, 4, 16)
799+
@test f[1].date2 == Dates.Date(2022, 4, 17)
800+
780801
end

0 commit comments

Comments
 (0)