|
| 1 | +# Metadata |
| 2 | + |
| 3 | +## Design of metadata support |
| 4 | + |
| 5 | +DataFrames.jl allows you to store and retrieve metadata on table and column |
| 6 | +level. This is supported using the functions defined by the DataAPI.jl interface: |
| 7 | + |
| 8 | +* for table-level metadata: [`metadata`](@ref), [`metadatakeys`](@ref), |
| 9 | + [`metadata!`](@ref), [`deletemetadata!`](@ref), [`emptymetadata!`](@ref); |
| 10 | +* for column-level metatadata: [`colmetadata`](@ref), [`colmetadatakeys`](@ref), |
| 11 | + [`colmetadata!`](@ref), [`deletecolmetadata!`](@ref), [`emptycolmetadata!`](@ref). |
| 12 | + |
| 13 | +Assume that we work with a data frame-like object `df` that has a column `col` |
| 14 | +(referred to either via a `Symbol`, a string or an integer index). |
| 15 | + |
| 16 | +Table-level metadata are key-value pairs that are attached to `df`. |
| 17 | +Column-level metadata are key-value pairs that are attached to |
| 18 | +a specific column `col` of `df` data frame. |
| 19 | + |
| 20 | +Additionally each metadata key-value pair has a style information attached to |
| 21 | +it. |
| 22 | +In DataFrames.jl the metadata style influences how metadata is propagated when |
| 23 | +`df` is transformed. The following metadata styles are supported: |
| 24 | + |
| 25 | +* `:default`: Metadata having this style is considered to be attached to a concrete |
| 26 | + state of `df`. This means that any operation on this data frame |
| 27 | + invalidates such metadata and it is dropped in the result of such operation. |
| 28 | + Note that this happens even if the operation eventually does not change |
| 29 | + the data frame: the rule is that calling a function that might alter a data |
| 30 | + frame drops such metadata; in this way it is possible to statically determine |
| 31 | + whether metadata of styles other than `:note` is dropped after a function call. |
| 32 | + Only two functions are exceptions that keep non-`:note`-style metadata, as these |
| 33 | + operations are specifically designed to create an identical copy of the source |
| 34 | + data frame: |
| 35 | + - [`DataFrame`](@ref) constructor; |
| 36 | + - [`copy`](@ref) of a data frame; |
| 37 | +* `:note`: Metadata having this style is considered to be an annotation of |
| 38 | + a table or a column that should be propagated under transformations |
| 39 | + (exact propagation rules of such metadata are described below). |
| 40 | +* All other metadata styles are allowed but they are currently treated as having |
| 41 | + `:default`-style (this might change in the future if other standard metadata |
| 42 | + styles are defined). |
| 43 | + |
| 44 | +All DataAPI.jl metadata functions work with [`DataFrame`](@ref), |
| 45 | +[`SubDataFrame`](@ref), [`DataFrameRow`](@ref) |
| 46 | +objects, and objects returned by [`eachrow`](@ref) and [`eachcol`](@ref) |
| 47 | +functions. In this section collectively these objects will be called |
| 48 | +*data frame-like*, and follow the rules: |
| 49 | + |
| 50 | +* objects returned by |
| 51 | + [`eachrow`](@ref) and [`eachcol`](@ref) functions have the same metadata |
| 52 | + as their parent `AbstractDataFrame`; |
| 53 | +* [`SubDataFrame`](@ref) and [`DataFrameRow`](@ref) only expose metadata from |
| 54 | + their parent `DataFrame` whose style is `:note`. |
| 55 | + |
| 56 | +Notably metadata is not supported for [`GroupedDataFrame`](@ref) as it does not |
| 57 | +expose columns directly. You can inspect metadata of the `parent` of a |
| 58 | +[`GroupedDataFrame`](@ref) or of any of its groups. |
| 59 | + |
| 60 | +!!! note |
| 61 | + |
| 62 | + DataFrames.jl allows users to extract out columns of a data frame |
| 63 | + and perform operations on them. Such operations will not affect |
| 64 | + metadata. Therefore, even if some metadata has `:default` style it might |
| 65 | + no longer correctly describe the column's contents if the user mutates |
| 66 | + columns directly. |
| 67 | + |
| 68 | +### DataFrames.jl-specific design principles for use of metadata |
| 69 | + |
| 70 | +DataFrames.jl supports storing any object as metadata values. However, |
| 71 | +it is recommended to use strings as values of the metadata, |
| 72 | +as some storage formats, like for example Apache Arrow, only support |
| 73 | +strings. |
| 74 | + |
| 75 | +For all functions that operate on column-level metadata, an `ArgumentError` is |
| 76 | +thrown if passed column is not present in a data frame. |
| 77 | + |
| 78 | +If [`metadata!`](@ref) or [`colmetadata!`](@ref) is used to add metadata |
| 79 | +to a [`SubDataFrame`](@ref) or a [`DataFrameRow`](@ref) then: |
| 80 | + |
| 81 | +* using metadata that has style other than `:note` throws an error; |
| 82 | +* trying to add key-value pair for which a mapping for key already exists |
| 83 | + with style other than `:note` in the parent data frame throws an error. |
| 84 | + |
| 85 | +DataFrames.jl is designed so that there is no performance overhead due to metadata support |
| 86 | +when there is no metadata in a data frame. Therefore if you need maximum performance |
| 87 | +of operations that do not rely on metadata call `emptymetadata!` and |
| 88 | +`emptycolmetadata!` before running these operations. |
| 89 | + |
| 90 | +Processing metadata for `SubDataFrame` and `DataFrameRow` has more overhead |
| 91 | +than for other types defined in DataFrames.jl that support metadata, because |
| 92 | +they have a more complex logic of handling it (they support only `:note`-style |
| 93 | +metadata, which means that other metadata needs to be filtered-out). |
| 94 | + |
| 95 | +## Examples |
| 96 | + |
| 97 | +Here is a simple example how you can work with metadata in DataFrames.jl: |
| 98 | + |
| 99 | +```jldoctest dataframe |
| 100 | +julia> using DataFrames |
| 101 | +
|
| 102 | +julia> df = DataFrame(name=["Jan Krzysztof Duda", "Jan Krzysztof Duda", |
| 103 | + "Radosław Wojtaszek", "Radosław Wojtaszek"], |
| 104 | + date=["2022-Jun", "2021-Jun", "2022-Jun", "2021-Jun"], |
| 105 | + rating=[2750, 2729, 2708, 2687]) |
| 106 | +4×3 DataFrame |
| 107 | + Row │ name date rating |
| 108 | + │ String String Int64 |
| 109 | +─────┼────────────────────────────────────── |
| 110 | + 1 │ Jan Krzysztof Duda 2022-Jun 2750 |
| 111 | + 2 │ Jan Krzysztof Duda 2021-Jun 2729 |
| 112 | + 3 │ Radosław Wojtaszek 2022-Jun 2708 |
| 113 | + 4 │ Radosław Wojtaszek 2021-Jun 2687 |
| 114 | +
|
| 115 | +julia> metadatakeys(df) |
| 116 | +() |
| 117 | +
|
| 118 | +julia> metadata!(df, "caption", "ELO ratings of chess players", style=:note); |
| 119 | +
|
| 120 | +julia> collect(metadatakeys(df)) |
| 121 | +1-element Vector{String}: |
| 122 | + "caption" |
| 123 | +
|
| 124 | +julia> metadata(df, "caption") |
| 125 | +"ELO ratings of chess players" |
| 126 | +
|
| 127 | +julia> metadata(df, "caption", style=true) |
| 128 | +("ELO ratings of chess players", :note) |
| 129 | +
|
| 130 | +julia> emptymetadata!(df); |
| 131 | +
|
| 132 | +julia> metadatakeys(df) |
| 133 | +() |
| 134 | +
|
| 135 | +julia> colmetadatakeys(df) |
| 136 | +() |
| 137 | +
|
| 138 | +julia> colmetadata!(df, :name, "label", "First and last name of a player", style=:note); |
| 139 | +
|
| 140 | +julia> colmetadata!(df, :date, "label", "Rating date in yyyy-u format", style=:note); |
| 141 | +
|
| 142 | +julia> colmetadata!(df, :rating, "label", "ELO rating in classical time control", style=:note); |
| 143 | +
|
| 144 | +julia> colmetadata(df, :rating, "label") |
| 145 | +"ELO rating in classical time control" |
| 146 | +
|
| 147 | +julia> colmetadata(df, :rating, "label", style=true) |
| 148 | +("ELO rating in classical time control", :note) |
| 149 | +
|
| 150 | +julia> collect(colmetadatakeys(df)) |
| 151 | +3-element Vector{Pair{Symbol, Base.KeySet{String, Dict{String, Tuple{Any, Any}}}}}: |
| 152 | + :date => ["label"] |
| 153 | + :rating => ["label"] |
| 154 | + :name => ["label"] |
| 155 | +
|
| 156 | +julia> [only(names(df, col)) => |
| 157 | + [key => colmetadata(df, col, key) for key in metakeys] for |
| 158 | + (col, metakeys) in colmetadatakeys(df)] |
| 159 | +3-element Vector{Pair{String, Vector{Pair{String, String}}}}: |
| 160 | + "date" => ["label" => "Rating date in yyyy-u format"] |
| 161 | + "rating" => ["label" => "ELO rating in classical time control"] |
| 162 | + "name" => ["label" => "First and last name of a player"] |
| 163 | +
|
| 164 | +julia> emptycolmetadata!(df); |
| 165 | +
|
| 166 | +julia> colmetadatakeys(df) |
| 167 | +() |
| 168 | +``` |
| 169 | + |
| 170 | +## Propagation of `:note`-style metadata |
| 171 | + |
| 172 | +An important design feature of `:note`-style metatada is how it is handled when |
| 173 | +data frames are transformed. |
| 174 | + |
| 175 | +!!! note |
| 176 | + |
| 177 | + The provided rules might slightly change in the future. Any change to |
| 178 | + `:note`-style metadata propagation rules will not be considered as breaking |
| 179 | + and can be done in any minor release of DataFrames.jl. |
| 180 | + Such changes might be made based on users' feedback about what metadata |
| 181 | + propagation rules are most convenient in practice. |
| 182 | + |
| 183 | +The general design rules for propagation of `:note`-style metadata are as follows. |
| 184 | + |
| 185 | +For operations that take a single data frame as an input: |
| 186 | +* Table level metadata is propagated to the returned data frame object. |
| 187 | +* For column-level metadata: |
| 188 | + - in all cases when a single column is transformed to |
| 189 | + a single column and the name of the column does |
| 190 | + not change (or is automatically changed e.g. to de-duplicate column names or |
| 191 | + via column renaming in joins) |
| 192 | + column-level metadata is preserved (example operations of this kind are |
| 193 | + `getindex`, `subset`, joins, `mapcols`). |
| 194 | + - in all cases when a single column is transformed with `identity` or `copy` to a single column, |
| 195 | + column-level metadata is preserved even if column name is changed (example |
| 196 | + operations of this kind are `rename`, or the `:x => :y` or |
| 197 | + `:x => copy => :y` operation specification in `select`). |
| 198 | + |
| 199 | +For operations that take multiple data frames as their input two cases are distinguished: |
| 200 | + |
| 201 | +- When there is a natural main table in the operation (`append!`, `prepend!`, |
| 202 | + `leftjoin`, `leftjoin!`, `rightjoin`, `semijoin`, `antijoin`, `setindex!`): |
| 203 | + - table-level metadata is taken from the main table; |
| 204 | + - column-level metadata for columns from the main table is taken from main table; |
| 205 | + - column-level metadata for columns from the non-main table is taken only for |
| 206 | + columns not present in the main table. |
| 207 | +- When all tables are equivalent (`hcat`, `vcat`, `innerjoin`, `outerjoin`): |
| 208 | + - table-level metadata is preserved only for keys which are defined |
| 209 | + in all passed tables and have the same value; |
| 210 | + - column-level metadata is preserved only for keys which are defined |
| 211 | + in all passed tables that contain this column and have the same value. |
| 212 | +In all these operations when metadata is preserved the values in the key-value |
| 213 | +pairs are not copied (this is relevant in case of mutable values). |
| 214 | + |
| 215 | +!!! note |
| 216 | + |
| 217 | + The rules for column-level `:note`-style metadata propagation are designed |
| 218 | + to make the right decision in common cases. In particular, they assume that if |
| 219 | + source and target column name is the same then the metadata for the column is |
| 220 | + not changed. While this is valid for many operations, it is not always true |
| 221 | + in general. For example the `:x => ByRow(log) => :x` transformation might |
| 222 | + invalidate metadata if it contained unit of measure of the variable. In such |
| 223 | + cases user must either use a different name for the output column, |
| 224 | + set metadata style to `:default` before the operation, |
| 225 | + or manually drop or update such metadata from the `:x` column |
| 226 | + after the transformation. |
| 227 | + |
| 228 | +### Operations that preserve `:note`-style metadata |
| 229 | + |
| 230 | +Most of the functions in DataFrames.jl only preserve table and column metadata |
| 231 | +whose style is `:note`. |
| 232 | +Some functions use a more complex logic, even if they follow the general rules |
| 233 | +described above (in particular under any transformation all non-`:note`-style |
| 234 | +metadata is always dropped). These are: |
| 235 | + |
| 236 | +* [`describe`](@ref) drops all metadata. |
| 237 | +* [`hcat`](@ref): propagates table-level metadata only for keys which are defined |
| 238 | + in all passed tables and have the same value; |
| 239 | + column-level metadata is preserved. |
| 240 | +* [`vcat`](@ref): propagates table-level metadata only for keys which are defined |
| 241 | + in all passed tables and have the same value; |
| 242 | + column-level metadata is preserved only for keys which are defined |
| 243 | + in all passed tables that contain this column and have the same value; |
| 244 | +* [`stack`](@ref): propagates table-level metadata and column-level metadata |
| 245 | + for identifier columns. |
| 246 | +* [`stack`](@ref): propagates table-level metadata and column-level metadata |
| 247 | + for row keys columns. |
| 248 | +* [`permutedims`](@ref): propagates table-level metadata and drops column-level |
| 249 | + metadata. |
| 250 | +* broadcasted assignment does not change target metadata; |
| 251 | + under Julia earlier than 1.7 operation of kind `df.a .= s` does not drop non-`:note`-style |
| 252 | + metadata; under Julia 1.7 or later this operation perserves only `:note`-style |
| 253 | + metadata |
| 254 | +* broadcasting propagates table-level metadata if some key is present |
| 255 | + in all passed data frames and value associated with it is identical in all |
| 256 | + passed data frames; column-level metadata is propagated for columns if some |
| 257 | + key for a given column is present in all passed data frames and value |
| 258 | + associated with it is identical in all passed data frames. |
| 259 | +* `getindex` preserves table-level metadata and column-level metadata |
| 260 | + for selected columns |
| 261 | +* `setindex!` does not affect table-level and column-level metadata |
| 262 | +* [`push!`](@ref), [`pushfirst!`](@ref), [`insert!`](@ref) do not affect |
| 263 | + table-level nor column-level metadata (even if they add new columns and pushed row is |
| 264 | + a `DataFrameRow` or other value supporting metadata interface) |
| 265 | +* [`append!`](@ref) and [`prepend!`](@ref) do not change table and column-level |
| 266 | + metadata of the destination data frame, except that if new columns are added |
| 267 | + and these columns have metadata in the appended/prepended table then this |
| 268 | + metadata is preserved. |
| 269 | +* [`leftjoin!`](@ref), [`leftjoin`](@ref): table and column-level metadata is |
| 270 | + taken from the left table except for non-key columns from right table for which |
| 271 | + metadata is taken from right table; |
| 272 | +* [`rightjoin`](@ref): table and column-level metadata is taken from the right |
| 273 | + table except for non-key columns from left table for which metadata is |
| 274 | + taken from left table; |
| 275 | +* [`innerjoin`](@ref), [`outerjoin`](@ref): propagates table-level metadata only for keys |
| 276 | + that are defined in all passed data frames and have the same value; |
| 277 | + column-level metadata is propagated for all columns except for key |
| 278 | + columns, for which it is propagated only for keys that are defined |
| 279 | + in all passed data frames and have the same value. |
| 280 | +* [`semijoin`](@ref), [`antijoin`](@ref): table and column-level metadata is |
| 281 | + taken from the left table. |
| 282 | +* [`crossjoin`](@ref): propagates table-level metadata only for keys |
| 283 | + that are defined in both passed data frames and have the same value; |
| 284 | + propagates column-level metadata from both passed data frames. |
| 285 | +* [`select`]](@ref), [`select!`](@ref), [`transform`](@ref), |
| 286 | + [`transform!`](@ref), [`combine`]](@ref): propagate table-level metadata; |
| 287 | + column-level metadata is propagated if: |
| 288 | + a) a single column is transformed to a single column and the name of the column does not change |
| 289 | + (this includes all column selection operations), or |
| 290 | + b) a single column is transformed with `identity` or `copy` to a single column |
| 291 | + even if column name is changed (this includes column renaming). |
0 commit comments