You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Allow Regex keys for selecting columns in types/dateformat/pool keyword (#1016)
* Support `types` being `Dict{Regex}
* Validate `types` getting `Regex`
* Support `types` being `Dict{Any}` with `Regex` key(s)
* Test `dateformat` now also understands Regex
* fix typo in test
* Add test for when `Regex` and exact name both match
* Document that Regex can be used to identify columns by name
Copy file name to clipboardExpand all lines: docs/src/reading.md
+7-3Lines changed: 7 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -150,7 +150,7 @@ An ASCII `Char` argument that parsing uses when parsing quoted cells and the `qu
150
150
151
151
## [`dateformat`](@id dateformat)
152
152
153
-
A `String` or `AbstractDict` argument that controls how parsing detects datetime values in the data input. As a single `String` (or `DateFormat`) argument, the same format will be applied to _all_ columns in the file. For columns without type information provided otherwise, parsing will use the provided format string to check if the cell is parseable and if so, will attempt to parse the entire column as the datetime type (`Time`, `Date`, or `DateTime`). By default, if no `dateformat` argument is explicitly provided, parsing will try to detect any of `Time`, `Date`, or `DateTime` types following the standard `Dates.ISOTimeFormat`, `Dates.ISODateFormat`, or `Dates.ISODateTimeFormat` formats, respectively. If a datetime type is provided for a column, (see the [types](@ref types) argument), then the `dateformat` format string needs to match the format of values in that column, otherwise, a warning will be emitted and the value will be replaced with a `missing` value (this behavior is also configurable via the [strict](@ref) and [silencewarnings](@ref strict) arguments). If an `AbstractDict` is provided, different `dateformat` strings can be provided for specific columns; the provided dict can map either an `Integer` for column number, or a `String` or `Symbol` for column name to the dateformat string that should be used for that column. Columns not mapped in the dict argument will use the default format strings mentioned above.
153
+
A `String` or `AbstractDict` argument that controls how parsing detects datetime values in the data input. As a single `String` (or `DateFormat`) argument, the same format will be applied to _all_ columns in the file. For columns without type information provided otherwise, parsing will use the provided format string to check if the cell is parseable and if so, will attempt to parse the entire column as the datetime type (`Time`, `Date`, or `DateTime`). By default, if no `dateformat` argument is explicitly provided, parsing will try to detect any of `Time`, `Date`, or `DateTime` types following the standard `Dates.ISOTimeFormat`, `Dates.ISODateFormat`, or `Dates.ISODateTimeFormat` formats, respectively. If a datetime type is provided for a column, (see the [types](@ref types) argument), then the `dateformat` format string needs to match the format of values in that column, otherwise, a warning will be emitted and the value will be replaced with a `missing` value (this behavior is also configurable via the [strict](@ref) and [silencewarnings](@ref strict) arguments). If an `AbstractDict` is provided, different `dateformat` strings can be provided for specific columns; the provided dict can map either an `Integer` for column number or a `String`, `Symbol` or `Regex` for column name to the dateformat string that should be used for that column. Columns not mapped in the dict argument will use the default format strings mentioned above.
154
154
155
155
### Examples
156
156
*[DateFormat](@ref dateformat_example)
@@ -171,9 +171,13 @@ These arguments can be provided as `Vector{String}` to specify custom values tha
171
171
172
172
## [`types`](@id types)
173
173
174
-
Argument to control the types of columns that get parsed in the data input. Can be provided as a single `Type`, an `AbstractVector` of types, an `AbstractDict`, or a function. If a single type is provided, like `types=Float64`, then _all_ columns in the data input will be parsed as `Float64`. If a column's value isn't a valid `Float64` value, then a warning will be emitted, unless `silencewarnings=false` is passed, then no warning will be printed. However, if `strict=true` is passed, then an error will be thrown instead, regarldess of the `silencewarnings` argument. The `types` argument can also be provided as an `AbstractVector{Type}`, wherein the length of the vector should match the number of columns in the data input, and each element gives the type of the corresponding column in order. If provided as an `AbstractDict`, then specific columns can have their column type specified, with the key of the dict being an `Integer` for column number, or `String` or `Symbol` for column name, and the dict value being the column type. Unspecified columns will have their column type auto-detected while parsing. A function can also be provided, and should be of the form `(i, name) -> Union{T, Nothing}`, and will be applied to each detected column during initial parsing. Returning `nothing` from the function will result in the column's type being automatically detected during parsing.
174
+
Argument to control the types of columns that get parsed in the data input. Can be provided as a single `Type`, an `AbstractVector` of types, an `AbstractDict`, or a function.
175
+
- If a single type is provided, like `types=Float64`, then _all_ columns in the data input will be parsed as `Float64`. If a column's value isn't a valid `Float64` value, then a warning will be emitted, unless `silencewarnings=false` is passed, then no warning will be printed. However, if `strict=true` is passed, then an error will be thrown instead, regarldess of the `silencewarnings` argument.
176
+
- If a `AbstractVector{Type}` is provided, then the length of the vector should match the number of columns in the data input, and each element gives the type of the corresponding column in order.
177
+
- If an `AbstractDict`, then specific columns can have their column type specified with the key of the dict being an `Integer` for column number, or `String` or `Symbol` for column name or `Regex` matching column names, and the dict value being the column type. Unspecified columns will have their column type auto-detected while parsing.
178
+
- If a function, then it should be of the form `(i, name) -> Union{T, Nothing}`, and will be applied to each detected column during initial parsing. Returning `nothing` from the function will result in the column's type being automatically detected during parsing.
175
179
176
-
By default,`types=nothing`, which means all column types in the data input will be detected while parsing. Note that it isn't necessary to pass `types=Union{Float64, Missing}` if the data input contains `missing` values. Parsing will detect `missing` values if present, and promote any manually provided column types from the singular (`Float64`) to the missing equivalent (`Union{Float64, Missing}`) automatically. Standard types will be auto-detected in the following order when not otherwise specified: `Int64`, `Float64`, `Date`, `DateTime`, `Time`, `Bool`, `String`.
180
+
By default `types=nothing`, which means all column types in the data input will be detected while parsing. Note that it isn't necessary to pass `types=Union{Float64, Missing}` if the data input contains `missing` values. Parsing will detect `missing` values if present, and promote any manually provided column types from the singular (`Float65`) to the missing equivalent (`Union{Float64, Missing}`) automatically. Standard types will be auto-detected in the following order when not otherwise specified: `Int64`, `Float64`, `Date`, `DateTime`, `Time`, `Bool`, `String`.
177
181
178
182
Non-standard types can be provided, like `Dec64` from the DecFP.jl package, but must support the `Base.tryparse(T, str)` function for parsing a value from a string. This allows, for example, easily defining a custom type, like `struct Float64Array; values::Vector{Float64}; end`, as long as a corresponding `Base.tryparse` definition is defined, like `Base.tryparse(::Type{Float64Array}, str) = Float64Array(map(x -> parse(Float64, x), split(str, ';')))`, where a single cell in the data input is like `1.23;4.56;7.89`.
Copy file name to clipboardExpand all lines: src/context.jl
+3-2Lines changed: 3 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ Fields:
6
6
* `anymissing`: whether any missing values have been encountered while parsing; if a user provided a type like `Union{Int, Missing}`, we'll set this to `true`, or when `missing` values are encountered while parsing
7
7
* `userprovidedtype`: whether the column type was provided by the user or not; this affects whether we'll promote a column's type while parsing, or emit a warning/error depending on `strict` keyword arg
8
8
* `willdrop`: whether we'll drop this column from the final columnset; computed from select/drop keyword arguments; this will result in a column type of `HardMissing` while parsing, where an efficient parser is used to "skip" a field w/o allocating any parsed value
9
-
* `pool`: computed from `pool` keyword argument; `true` is `1.0`, `false` is `0.0`, everything else is `Float64(pool)`; once computed, this field isn't mutated at all while parsing; it's used in type detection to determine whether a column will be pooled or not once a type is detected;
9
+
* `pool`: computed from `pool` keyword argument; `true` is `1.0`, `false` is `0.0`, everything else is `Float64(pool)`; once computed, this field isn't mutated at all while parsing; it's used in type detection to determine whether a column will be pooled or not once a type is detected;
10
10
* `columnspecificpool`: if `pool` was provided via Vector or Dict by user, then `true`, other `false`; if `false`, then only string column types will attempt pooling
11
11
* `column`: the actual column vector to hold parsed values; field is typed as `AbstractVector` and while parsing, we do switches on `col.type` to assert the column type to make code concretely typed
12
12
* `lock`: in multithreaded parsing, we have a top-level set of `Vector{Column}`, then each threaded parsing task makes its own copy to parse its own chunk; when synchronizing column types/pooled refs, the task-local `Column` will `lock(col.lock)` to make changes to the parent `Column`; each task-local `Column` shares the same `lock` of the top-level `Column`
@@ -84,7 +84,8 @@ function checkinvalidcolumns(dict, argname, ncols, names)
84
84
if k isa Integer
85
85
(0< k <= ncols) ||throw(ArgumentError("invalid column number provided in `$argname` keyword argument: $k. Column number must be 0 < i <= $ncols as detected in the data. To ignore invalid columns numbers in `$argname`, pass `validate=false`"))
86
86
else
87
-
Symbol(k) in names ||throw(ArgumentError("invalid column name provided in `$argname` keyword argument: $k. Valid column names detected in the data are: $names. To ignore invalid columns names in `$argname`, pass `validate=false`"))
87
+
isvalid = (k isa Regex &&any(nm ->contains(string(nm), k), names)) ||Symbol(k) in names
88
+
isvalid ||throw(ArgumentError("invalid column name provided in `$argname` keyword argument: $k. Valid column names detected in the data are: $names. To ignore invalid columns names in `$argname`, pass `validate=false`"))
0 commit comments