Skip to content

Commit b01fd38

Browse files
authored
Metadata on data frame and column level (#3055)
1 parent 1851e11 commit b01fd38

File tree

26 files changed

+5818
-225
lines changed

26 files changed

+5818
-225
lines changed

NEWS.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,12 @@
4444
* New `threads` argument allows disabling multithreading in
4545
`combine`, `select`, `select!`, `transform`, `transform!`, `subset` and `subset!`
4646
([#3030](https://github.com/JuliaData/DataFrames.jl/pull/3030))
47+
* Add support for table-level and column-level metadata using
48+
DataAPI.jl interface
49+
([#3055](https://github.com/JuliaData/DataFrames.jl/pull/3055))
50+
* `completecases` and `nonunique` no longer throw an error when data frame
51+
with no columns is passed
52+
([#3055](https://github.com/JuliaData/DataFrames.jl/pull/3055))
4753
* `describe` now accepts two predefined arguments: `:nnonmissing` and `:nuniqueall`
4854
([#3146](https://github.com/JuliaData/DataFrames.jl/pull/3146))
4955

@@ -54,8 +60,19 @@
5460
or older it is an in place operation.
5561
([#3022](https://github.com/JuliaData/DataFrames.jl/pull/3022))
5662

63+
## Internal changes
64+
65+
* `DataFrame` is now a `mutable struct` and has three new fields
66+
`metadata`, `colmetadata`, and `allnotemetadata`;
67+
this change makes `DataFrame` objects serialized under
68+
earlier versions of DataFrames.jl incompatible with version 1.4
69+
([#3055](https://github.com/JuliaData/DataFrames.jl/pull/3055))
70+
5771
## Bug fixes
5872

73+
* fix dispatch ambiguity in `rename` and `rename!` when only
74+
source data frame is passed
75+
([#3055](https://github.com/JuliaData/DataFrames.jl/pull/3055))
5976
* Make sure that `AsTable` accepts only valid argument
6077
([#3064](https://github.com/JuliaData/DataFrames.jl/pull/3064))
6178
* Make sure we avoid aliasing when repeating the same column

Project.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ Unicode = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"
2626
[compat]
2727
CategoricalArrays = "0.10.0"
2828
Compat = "3.46, 4.2"
29-
DataAPI = "1.10"
29+
DataAPI = "1.11"
3030
InvertedIndices = "1"
3131
IteratorInterfaceExtensions = "0.1.1, 1"
3232
Missings = "0.4.2, 1"

docs/make.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ makedocs(
3838
"Types" => "lib/types.md",
3939
"Functions" => "lib/functions.md",
4040
"Indexing" => "lib/indexing.md",
41+
"Metadata" => "lib/metadata.md",
4142
hide("Internals" => "lib/internals.md"),
4243
]
4344
],

docs/src/lib/functions.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,3 +189,17 @@ pairs
189189
```@docs
190190
isapprox
191191
```
192+
193+
## Metadata
194+
```@docs
195+
metadata
196+
metadatakeys
197+
metadata!
198+
deletemetadata!
199+
emptymetadata!
200+
colmetadata
201+
colmetadatakeys
202+
colmetadata!
203+
deletecolmetadata!
204+
emptycolmetadata!
205+
```

docs/src/lib/metadata.md

Lines changed: 291 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,291 @@
1+
# Metadata
2+
3+
## Design of metadata support
4+
5+
DataFrames.jl allows you to store and retrieve metadata on table and column
6+
level. This is supported using the functions defined by the DataAPI.jl interface:
7+
8+
* for table-level metadata: [`metadata`](@ref), [`metadatakeys`](@ref),
9+
[`metadata!`](@ref), [`deletemetadata!`](@ref), [`emptymetadata!`](@ref);
10+
* for column-level metatadata: [`colmetadata`](@ref), [`colmetadatakeys`](@ref),
11+
[`colmetadata!`](@ref), [`deletecolmetadata!`](@ref), [`emptycolmetadata!`](@ref).
12+
13+
Assume that we work with a data frame-like object `df` that has a column `col`
14+
(referred to either via a `Symbol`, a string or an integer index).
15+
16+
Table-level metadata are key-value pairs that are attached to `df`.
17+
Column-level metadata are key-value pairs that are attached to
18+
a specific column `col` of `df` data frame.
19+
20+
Additionally each metadata key-value pair has a style information attached to
21+
it.
22+
In DataFrames.jl the metadata style influences how metadata is propagated when
23+
`df` is transformed. The following metadata styles are supported:
24+
25+
* `:default`: Metadata having this style is considered to be attached to a concrete
26+
state of `df`. This means that any operation on this data frame
27+
invalidates such metadata and it is dropped in the result of such operation.
28+
Note that this happens even if the operation eventually does not change
29+
the data frame: the rule is that calling a function that might alter a data
30+
frame drops such metadata; in this way it is possible to statically determine
31+
whether metadata of styles other than `:note` is dropped after a function call.
32+
Only two functions are exceptions that keep non-`:note`-style metadata, as these
33+
operations are specifically designed to create an identical copy of the source
34+
data frame:
35+
- [`DataFrame`](@ref) constructor;
36+
- [`copy`](@ref) of a data frame;
37+
* `:note`: Metadata having this style is considered to be an annotation of
38+
a table or a column that should be propagated under transformations
39+
(exact propagation rules of such metadata are described below).
40+
* All other metadata styles are allowed but they are currently treated as having
41+
`:default`-style (this might change in the future if other standard metadata
42+
styles are defined).
43+
44+
All DataAPI.jl metadata functions work with [`DataFrame`](@ref),
45+
[`SubDataFrame`](@ref), [`DataFrameRow`](@ref)
46+
objects, and objects returned by [`eachrow`](@ref) and [`eachcol`](@ref)
47+
functions. In this section collectively these objects will be called
48+
*data frame-like*, and follow the rules:
49+
50+
* objects returned by
51+
[`eachrow`](@ref) and [`eachcol`](@ref) functions have the same metadata
52+
as their parent `AbstractDataFrame`;
53+
* [`SubDataFrame`](@ref) and [`DataFrameRow`](@ref) only expose metadata from
54+
their parent `DataFrame` whose style is `:note`.
55+
56+
Notably metadata is not supported for [`GroupedDataFrame`](@ref) as it does not
57+
expose columns directly. You can inspect metadata of the `parent` of a
58+
[`GroupedDataFrame`](@ref) or of any of its groups.
59+
60+
!!! note
61+
62+
DataFrames.jl allows users to extract out columns of a data frame
63+
and perform operations on them. Such operations will not affect
64+
metadata. Therefore, even if some metadata has `:default` style it might
65+
no longer correctly describe the column's contents if the user mutates
66+
columns directly.
67+
68+
### DataFrames.jl-specific design principles for use of metadata
69+
70+
DataFrames.jl supports storing any object as metadata values. However,
71+
it is recommended to use strings as values of the metadata,
72+
as some storage formats, like for example Apache Arrow, only support
73+
strings.
74+
75+
For all functions that operate on column-level metadata, an `ArgumentError` is
76+
thrown if passed column is not present in a data frame.
77+
78+
If [`metadata!`](@ref) or [`colmetadata!`](@ref) is used to add metadata
79+
to a [`SubDataFrame`](@ref) or a [`DataFrameRow`](@ref) then:
80+
81+
* using metadata that has style other than `:note` throws an error;
82+
* trying to add key-value pair for which a mapping for key already exists
83+
with style other than `:note` in the parent data frame throws an error.
84+
85+
DataFrames.jl is designed so that there is no performance overhead due to metadata support
86+
when there is no metadata in a data frame. Therefore if you need maximum performance
87+
of operations that do not rely on metadata call `emptymetadata!` and
88+
`emptycolmetadata!` before running these operations.
89+
90+
Processing metadata for `SubDataFrame` and `DataFrameRow` has more overhead
91+
than for other types defined in DataFrames.jl that support metadata, because
92+
they have a more complex logic of handling it (they support only `:note`-style
93+
metadata, which means that other metadata needs to be filtered-out).
94+
95+
## Examples
96+
97+
Here is a simple example how you can work with metadata in DataFrames.jl:
98+
99+
```jldoctest dataframe
100+
julia> using DataFrames
101+
102+
julia> df = DataFrame(name=["Jan Krzysztof Duda", "Jan Krzysztof Duda",
103+
"Radosław Wojtaszek", "Radosław Wojtaszek"],
104+
date=["2022-Jun", "2021-Jun", "2022-Jun", "2021-Jun"],
105+
rating=[2750, 2729, 2708, 2687])
106+
4×3 DataFrame
107+
Row │ name date rating
108+
│ String String Int64
109+
─────┼──────────────────────────────────────
110+
1 │ Jan Krzysztof Duda 2022-Jun 2750
111+
2 │ Jan Krzysztof Duda 2021-Jun 2729
112+
3 │ Radosław Wojtaszek 2022-Jun 2708
113+
4 │ Radosław Wojtaszek 2021-Jun 2687
114+
115+
julia> metadatakeys(df)
116+
()
117+
118+
julia> metadata!(df, "caption", "ELO ratings of chess players", style=:note);
119+
120+
julia> collect(metadatakeys(df))
121+
1-element Vector{String}:
122+
"caption"
123+
124+
julia> metadata(df, "caption")
125+
"ELO ratings of chess players"
126+
127+
julia> metadata(df, "caption", style=true)
128+
("ELO ratings of chess players", :note)
129+
130+
julia> emptymetadata!(df);
131+
132+
julia> metadatakeys(df)
133+
()
134+
135+
julia> colmetadatakeys(df)
136+
()
137+
138+
julia> colmetadata!(df, :name, "label", "First and last name of a player", style=:note);
139+
140+
julia> colmetadata!(df, :date, "label", "Rating date in yyyy-u format", style=:note);
141+
142+
julia> colmetadata!(df, :rating, "label", "ELO rating in classical time control", style=:note);
143+
144+
julia> colmetadata(df, :rating, "label")
145+
"ELO rating in classical time control"
146+
147+
julia> colmetadata(df, :rating, "label", style=true)
148+
("ELO rating in classical time control", :note)
149+
150+
julia> collect(colmetadatakeys(df))
151+
3-element Vector{Pair{Symbol, Base.KeySet{String, Dict{String, Tuple{Any, Any}}}}}:
152+
:date => ["label"]
153+
:rating => ["label"]
154+
:name => ["label"]
155+
156+
julia> [only(names(df, col)) =>
157+
[key => colmetadata(df, col, key) for key in metakeys] for
158+
(col, metakeys) in colmetadatakeys(df)]
159+
3-element Vector{Pair{String, Vector{Pair{String, String}}}}:
160+
"date" => ["label" => "Rating date in yyyy-u format"]
161+
"rating" => ["label" => "ELO rating in classical time control"]
162+
"name" => ["label" => "First and last name of a player"]
163+
164+
julia> emptycolmetadata!(df);
165+
166+
julia> colmetadatakeys(df)
167+
()
168+
```
169+
170+
## Propagation of `:note`-style metadata
171+
172+
An important design feature of `:note`-style metatada is how it is handled when
173+
data frames are transformed.
174+
175+
!!! note
176+
177+
The provided rules might slightly change in the future. Any change to
178+
`:note`-style metadata propagation rules will not be considered as breaking
179+
and can be done in any minor release of DataFrames.jl.
180+
Such changes might be made based on users' feedback about what metadata
181+
propagation rules are most convenient in practice.
182+
183+
The general design rules for propagation of `:note`-style metadata are as follows.
184+
185+
For operations that take a single data frame as an input:
186+
* Table level metadata is propagated to the returned data frame object.
187+
* For column-level metadata:
188+
- in all cases when a single column is transformed to
189+
a single column and the name of the column does
190+
not change (or is automatically changed e.g. to de-duplicate column names or
191+
via column renaming in joins)
192+
column-level metadata is preserved (example operations of this kind are
193+
`getindex`, `subset`, joins, `mapcols`).
194+
- in all cases when a single column is transformed with `identity` or `copy` to a single column,
195+
column-level metadata is preserved even if column name is changed (example
196+
operations of this kind are `rename`, or the `:x => :y` or
197+
`:x => copy => :y` operation specification in `select`).
198+
199+
For operations that take multiple data frames as their input two cases are distinguished:
200+
201+
- When there is a natural main table in the operation (`append!`, `prepend!`,
202+
`leftjoin`, `leftjoin!`, `rightjoin`, `semijoin`, `antijoin`, `setindex!`):
203+
- table-level metadata is taken from the main table;
204+
- column-level metadata for columns from the main table is taken from main table;
205+
- column-level metadata for columns from the non-main table is taken only for
206+
columns not present in the main table.
207+
- When all tables are equivalent (`hcat`, `vcat`, `innerjoin`, `outerjoin`):
208+
- table-level metadata is preserved only for keys which are defined
209+
in all passed tables and have the same value;
210+
- column-level metadata is preserved only for keys which are defined
211+
in all passed tables that contain this column and have the same value.
212+
In all these operations when metadata is preserved the values in the key-value
213+
pairs are not copied (this is relevant in case of mutable values).
214+
215+
!!! note
216+
217+
The rules for column-level `:note`-style metadata propagation are designed
218+
to make the right decision in common cases. In particular, they assume that if
219+
source and target column name is the same then the metadata for the column is
220+
not changed. While this is valid for many operations, it is not always true
221+
in general. For example the `:x => ByRow(log) => :x` transformation might
222+
invalidate metadata if it contained unit of measure of the variable. In such
223+
cases user must either use a different name for the output column,
224+
set metadata style to `:default` before the operation,
225+
or manually drop or update such metadata from the `:x` column
226+
after the transformation.
227+
228+
### Operations that preserve `:note`-style metadata
229+
230+
Most of the functions in DataFrames.jl only preserve table and column metadata
231+
whose style is `:note`.
232+
Some functions use a more complex logic, even if they follow the general rules
233+
described above (in particular under any transformation all non-`:note`-style
234+
metadata is always dropped). These are:
235+
236+
* [`describe`](@ref) drops all metadata.
237+
* [`hcat`](@ref): propagates table-level metadata only for keys which are defined
238+
in all passed tables and have the same value;
239+
column-level metadata is preserved.
240+
* [`vcat`](@ref): propagates table-level metadata only for keys which are defined
241+
in all passed tables and have the same value;
242+
column-level metadata is preserved only for keys which are defined
243+
in all passed tables that contain this column and have the same value;
244+
* [`stack`](@ref): propagates table-level metadata and column-level metadata
245+
for identifier columns.
246+
* [`stack`](@ref): propagates table-level metadata and column-level metadata
247+
for row keys columns.
248+
* [`permutedims`](@ref): propagates table-level metadata and drops column-level
249+
metadata.
250+
* broadcasted assignment does not change target metadata;
251+
under Julia earlier than 1.7 operation of kind `df.a .= s` does not drop non-`:note`-style
252+
metadata; under Julia 1.7 or later this operation perserves only `:note`-style
253+
metadata
254+
* broadcasting propagates table-level metadata if some key is present
255+
in all passed data frames and value associated with it is identical in all
256+
passed data frames; column-level metadata is propagated for columns if some
257+
key for a given column is present in all passed data frames and value
258+
associated with it is identical in all passed data frames.
259+
* `getindex` preserves table-level metadata and column-level metadata
260+
for selected columns
261+
* `setindex!` does not affect table-level and column-level metadata
262+
* [`push!`](@ref), [`pushfirst!`](@ref), [`insert!`](@ref) do not affect
263+
table-level nor column-level metadata (even if they add new columns and pushed row is
264+
a `DataFrameRow` or other value supporting metadata interface)
265+
* [`append!`](@ref) and [`prepend!`](@ref) do not change table and column-level
266+
metadata of the destination data frame, except that if new columns are added
267+
and these columns have metadata in the appended/prepended table then this
268+
metadata is preserved.
269+
* [`leftjoin!`](@ref), [`leftjoin`](@ref): table and column-level metadata is
270+
taken from the left table except for non-key columns from right table for which
271+
metadata is taken from right table;
272+
* [`rightjoin`](@ref): table and column-level metadata is taken from the right
273+
table except for non-key columns from left table for which metadata is
274+
taken from left table;
275+
* [`innerjoin`](@ref), [`outerjoin`](@ref): propagates table-level metadata only for keys
276+
that are defined in all passed data frames and have the same value;
277+
column-level metadata is propagated for all columns except for key
278+
columns, for which it is propagated only for keys that are defined
279+
in all passed data frames and have the same value.
280+
* [`semijoin`](@ref), [`antijoin`](@ref): table and column-level metadata is
281+
taken from the left table.
282+
* [`crossjoin`](@ref): propagates table-level metadata only for keys
283+
that are defined in both passed data frames and have the same value;
284+
propagates column-level metadata from both passed data frames.
285+
* [`select`]](@ref), [`select!`](@ref), [`transform`](@ref),
286+
[`transform!`](@ref), [`combine`]](@ref): propagate table-level metadata;
287+
column-level metadata is propagated if:
288+
a) a single column is transformed to a single column and the name of the column does not change
289+
(this includes all column selection operations), or
290+
b) a single column is transformed with `identity` or `copy` to a single column
291+
even if column name is changed (this includes column renaming).

0 commit comments

Comments
 (0)