Skip to content

Commit 3a71ae5

Browse files
authored
Support adding columns to views (#2794)
1 parent f9ca4ad commit 3a71ae5

File tree

14 files changed

+2278
-134
lines changed

14 files changed

+2278
-134
lines changed

NEWS.md

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,16 @@
22

33
## New functionalities
44

5-
* in the `groupby` function the `sort` keyword argument now allows three values
5+
* Improve `sort` keyword argument in `groupby`
6+
([#2812](https://github.com/JuliaData/DataFrames.jl/pull/2812)).
7+
8+
In the `groupby` function the `sort` keyword argument now allows three values:
69
- `nothing` (the default) leaves the order of groups undefined and allows
710
`groupby` to pick the fastest available grouping algorithm;
811
- `true` sorts groups by key columns;
912
- `false` creates groups in the order of their appearance in the parent data
1013
frame;
14+
1115
In previous versions, the `sort` keyword argument allowed only `Bool` values
1216
and `false` (which was the default) corresponded to the new
1317
behavior when `nothing` is passed. Therefore only the user visible change
@@ -18,6 +22,31 @@
1822
(notably `PooledArray` and `CategoricalArray`) or when they contained only
1923
integers in a small range.
2024
([#2812](https://github.com/JuliaData/DataFrames.jl/pull/2812))
25+
26+
* Allow adding new columns to a `SubDataFrame` created with `:` as column selector
27+
([#2794](https://github.com/JuliaData/DataFrames.jl/pull/2794)).
28+
29+
If `sdf` is a `SubDataFrame` created with `:` as a column selector then
30+
`insertcols!`, `setindex!`, and broadcasted assignment allow for creation
31+
of new columns, automatically filling filtered-out rows with `missing` values;
32+
33+
* Allow replacing existing columns in a `SubDataFrame` with `!` as row selector
34+
in assignment and broadcasted assignment
35+
([#2794](https://github.com/JuliaData/DataFrames.jl/pull/2794)).
36+
37+
Assignment to existing columns allocates a new column.
38+
Values already stored in filtered-out rows are copied.
39+
40+
* Allow `SubDataFrame` to be passed as an argument to `select!` and `transform!`
41+
(also on `GroupedDataFrame` created from a `SubDataFrame`)
42+
([#2794](https://github.com/JuliaData/DataFrames.jl/pull/2794)).
43+
44+
Assignment to existing columns allocates a new column.
45+
Values already stored in filtered-out rows are copied.
46+
In case of creation of new columns, filtered-out rows are automatically
47+
filled with `missing` values.
48+
If `SubDataFrame` was not created with `:` as column selector the resulting operation
49+
must produce the same column names as stored in the source `SubDataFrame` or an error is thrown.
2150
* `Tables.materializer` when passed the following types or their subtypes:
2251
`AbstractDataFrame`, `DataFrameRows`, `DataFrameColumns` returns `DataFrame`.
2352
([#2839](https://github.com/JuliaData/DataFrames.jl/pull/2839))

docs/src/lib/indexing.md

Lines changed: 38 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -144,9 +144,30 @@ so it is unsafe to use it afterwards (the column length correctness will be pres
144144
* `sdf[row, cols] = v` -> the same as `dfr = df[row, cols]; dfr[:] = v` in-place;
145145
* `sdf[rows, col] = v` -> set rows `rows` of column `col`, in-place; `v` must be an abstract vector;
146146
* `sdf[rows, cols] = v` -> set rows `rows` of columns `cols` in-place;
147-
`v` can be an `AbstractMatrix` or `v` can be `AbstractDataFrame` when column names must match;
147+
`v` can be an `AbstractMatrix` or `v` can be `AbstractDataFrame`
148+
in which case column names must match;
149+
* `sdf[!, col] = v` -> replaces `col` with `v` with copying; if `col` is present in `sdf`
150+
then filtered-out rows in newly created vector are filled with
151+
values already present in that column and `promote_type` is used
152+
to determine the `eltype` of the new column;
153+
if `col` is not present in `sdf` then the operation is only allowed
154+
if `sdf` was created with `:` as column selector, in which case
155+
filtered-out rows are filled with `missing`;
156+
equivalent to `sdf.col = v` if `col` is a valid identifier;
157+
operation is allowed if `length(v) == nrow(sdf)`;
158+
* `sdf[!, cols] = v` -> replaces existing columns `cols` in data frame `sdf` with copying;
159+
`v` must be an `AbstractMatrix` or an `AbstractDataFrame`
160+
(in the latter case column names must match);
161+
filtered-out rows in newly created vectors are filled with
162+
values already present in respective columns
163+
and `promote_type` is used to determine the `eltype` of the new columns;
164+
165+
!!! note
166+
167+
The rules above mean that `sdf[:, col] = v` is an in-place operation if `col` is present in `sdf`,
168+
therefore it will be fast in general. On the other hand using `sdf[!, col] = v`
169+
or `sdf.col = v` will always allocate a new vector, which is more expensive computationally.
148170

149-
Note that `sdf[!, col] = v`, `sdf[!, cols] = v` and `sdf.col = v` are not allowed as `sdf` can be only modified in-place.
150171

151172
`setindex!` on `DataFrameRow`:
152173
* `dfr[col] = v` -> set value of `col` in row `row` to `v` in-place;
@@ -171,7 +192,6 @@ The following broadcasting rules apply to `AbstractDataFrame` objects:
171192
Note that if broadcasting assignment operation throws an error the target data frame may be partially changed
172193
so it is unsafe to use it afterwards (the column length correctness will be preserved).
173194

174-
175195
Broadcasting `DataFrameRow` is currently not allowed (which is consistent with `NamedTuple`).
176196

177197
It is possible to assign a value to `AbstractDataFrame` and `DataFrameRow` objects using the `.=` operator.
@@ -199,8 +219,21 @@ Additional rules:
199219
Starting from Julia 1.7 if `:col` is not present in `df` then a new column will be created in `df`.
200220
* in the `sdf[CartesianIndex(row, col)] .= v`, `sdf[row, col] .= v` and `sdf[row, cols] .= v` syntaxes the assignment to `sdf` is performed in-place;
201221
* in the `sdf[rows, col] .= v` and `sdf[rows, cols] .= v` syntaxes the assignment to `sdf` is performed in-place;
202-
* `sdf.col .= v` syntax is performs an in-place assignment to an existing vector `sdf.col` and is deprecated;
203-
in the future this operation will not be allowed.
222+
if `rows` is `:` and `col` is a `Symbol` or `AbstractString`
223+
referring to a column missing from `sdf` and `sdf` was created with `:` as column selector
224+
then a new column is allocated and added;
225+
the filtered-out rows are filled with `missing`;
226+
* in the `sdf[!, col] .= v` syntax column `col` is replaced by a freshly allocated vector;
227+
the filtered-out rows are filled with values already present in `col`;
228+
if `col` is a `Symbol` or `AbstractString` referring to a column missing from `sdf`
229+
and was `sdf` created with `:` as column selector then a new column is allocated and added;
230+
in this case the filtered-out rows are filled with `missing`;
231+
* the `sdf[!, cols] .= v` syntax replaces existing columns `cols` in data frame `sdf` with freshly allocated vectors;
232+
the filtered-out rows are filled with values already present in `cols`;
233+
* `sdf.col .= v` syntax currently performs in-place assignment to an existing vector `sdf.col`;
234+
this behavior is deprecated and a new column will be allocated in the future.
235+
Starting from Julia 1.7 if `:col` is not present in `sdf` then a new column will be created in `sdf`
236+
if `sdf` was created with `:` as a column selector.
204237
* `dfr.col .= v` syntax is allowed and performs in-place assignment to a value extracted by `dfr.col`.
205238

206239
Note that `sdf[!, col] .= v` and `sdf[!, cols] .= v` syntaxes are not allowed as `sdf` can be only modified in-place.

docs/src/man/split_apply_combine.md

Lines changed: 166 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -113,9 +113,10 @@ In all of these cases, `function` can return either a single row or multiple
113113
rows. As a particular rule, values wrapped in a `Ref` or a `0`-dimensional
114114
`AbstractArray` are unwrapped and then treated as a single row.
115115

116-
`select`/`select!` and `transform`/`transform!` always return a `DataFrame`
116+
`select`/`select!` and `transform`/`transform!` always return a data frame
117117
with the same number and order of rows as the source (even if `GroupedDataFrame`
118-
had its groups reordered).
118+
had its groups reordered), except when selection results in zero columns
119+
in the resulting data frame (in which case the result has zero rows).
119120

120121
For `combine`, rows in the returned object appear in the order of groups in the
121122
`GroupedDataFrame`. The functions can return an arbitrary number of rows for
@@ -617,3 +618,166 @@ julia> gd[1]
617618
─────┼───────
618619
1 │ 1
619620
```
621+
622+
# Simulating the SQL `where` clause
623+
624+
You can conveniently work on subsets of a data frame by using `SubDataFrame`s.
625+
Operations performed on such objects can either create a new data frame or be
626+
performed in-place. Here are some examples:
627+
628+
```jldoctest sac
629+
julia> df = DataFrame(a=1:5)
630+
5×1 DataFrame
631+
Row │ a
632+
│ Int64
633+
─────┼───────
634+
1 │ 1
635+
2 │ 2
636+
3 │ 3
637+
4 │ 4
638+
5 │ 5
639+
640+
julia> sdf = @view df[2:3, :]
641+
2×1 SubDataFrame
642+
Row │ a
643+
│ Int64
644+
─────┼───────
645+
1 │ 2
646+
2 │ 3
647+
648+
julia> transform(sdf, :a => ByRow(string)) # create a new data frame
649+
2×2 DataFrame
650+
Row │ a a_string
651+
│ Int64 String
652+
─────┼─────────────────
653+
1 │ 2 2
654+
2 │ 3 3
655+
656+
julia> transform!(sdf, :a => ByRow(string)) # update the source df in-place
657+
2×2 SubDataFrame
658+
Row │ a a_string
659+
│ Int64 String?
660+
─────┼─────────────────
661+
1 │ 2 2
662+
2 │ 3 3
663+
664+
julia> df # new column was created filled with missing in filtered-out rows
665+
5×2 DataFrame
666+
Row │ a a_string
667+
│ Int64 String?
668+
─────┼─────────────────
669+
1 │ 1 missing
670+
2 │ 2 2
671+
3 │ 3 3
672+
4 │ 4 missing
673+
5 │ 5 missing
674+
675+
julia> select!(sdf, :a => -, renamecols=false) # update the source df in-place
676+
2×1 SubDataFrame
677+
Row │ a
678+
│ Int64
679+
─────┼───────
680+
1 │ -2
681+
2 │ -3
682+
683+
julia> df # the column replaced an existing column; previously stored values are re-used in filtered-out rows
684+
5×1 DataFrame
685+
Row │ a
686+
│ Int64
687+
─────┼───────
688+
1 │ 1
689+
2 │ -2
690+
3 │ -3
691+
4 │ 4
692+
5 │ 5
693+
```
694+
695+
Similar operations can be performed on `GroupedDataFrame` as well:
696+
```jldoctest sac
697+
julia> df = DataFrame(a=[1, 1, 1, 2, 2, 3], b=1:6)
698+
6×2 DataFrame
699+
Row │ a b
700+
│ Int64 Int64
701+
─────┼──────────────
702+
1 │ 1 1
703+
2 │ 1 2
704+
3 │ 1 3
705+
4 │ 2 4
706+
5 │ 2 5
707+
6 │ 3 6
708+
709+
julia> sdf = @view df[2:4, :]
710+
3×2 SubDataFrame
711+
Row │ a b
712+
│ Int64 Int64
713+
─────┼──────────────
714+
1 │ 1 2
715+
2 │ 1 3
716+
3 │ 2 4
717+
718+
julia> gsdf = groupby(sdf, :a)
719+
GroupedDataFrame with 2 groups based on key: a
720+
First Group (2 rows): a = 1
721+
Row │ a b
722+
│ Int64 Int64
723+
─────┼──────────────
724+
1 │ 1 2
725+
2 │ 1 3
726+
727+
Last Group (1 row): a = 2
728+
Row │ a b
729+
│ Int64 Int64
730+
─────┼──────────────
731+
1 │ 2 4
732+
733+
julia> transform(gsdf, nrow) # create a new data frame
734+
3×3 DataFrame
735+
Row │ a b nrow
736+
│ Int64 Int64 Int64
737+
─────┼─────────────────────
738+
1 │ 1 2 2
739+
2 │ 1 3 2
740+
3 │ 2 4 1
741+
742+
julia> transform!(gsdf, nrow, :b => :b_copy)
743+
3×4 SubDataFrame
744+
Row │ a b nrow b_copy
745+
│ Int64 Int64 Int64? Int64?
746+
─────┼──────────────────────────────
747+
1 │ 1 2 2 2
748+
2 │ 1 3 2 3
749+
3 │ 2 4 1 4
750+
751+
julia> df
752+
6×4 DataFrame
753+
Row │ a b nrow b_copy
754+
│ Int64 Int64 Int64? Int64?
755+
─────┼────────────────────────────────
756+
1 │ 1 1 missing missing
757+
2 │ 1 2 2 2
758+
3 │ 1 3 2 3
759+
4 │ 2 4 1 4
760+
5 │ 2 5 missing missing
761+
6 │ 3 6 missing missing
762+
763+
julia> select!(gsdf, :b_copy, :b => sum, renamecols=false)
764+
3×3 SubDataFrame
765+
Row │ a b_copy b
766+
│ Int64 Int64? Int64
767+
─────┼──────────────────────
768+
1 │ 1 2 5
769+
2 │ 1 3 5
770+
3 │ 2 4 4
771+
772+
julia> df
773+
6×3 DataFrame
774+
Row │ a b_copy b
775+
│ Int64 Int64? Int64
776+
─────┼───────────────────────
777+
1 │ 1 missing 1
778+
2 │ 1 2 5
779+
3 │ 1 3 5
780+
4 │ 2 4 4
781+
5 │ 2 missing 5
782+
6 │ 3 missing 6
783+
```

src/abstractdataframe/selection.jl

Lines changed: 25 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -107,9 +107,10 @@ const TRANSFORMATION_COMMON_RULES =
107107
rows. As a particular rule, values wrapped in a `Ref` or a `0`-dimensional
108108
`AbstractArray` are unwrapped and then treated as a single row.
109109
110-
`select`/`select!` and `transform`/`transform!` always return a `DataFrame`
110+
`select`/`select!` and `transform`/`transform!` always return a data frame
111111
with the same number and order of rows as the source (even if `GroupedDataFrame`
112-
had its groups reordered).
112+
had its groups reordered), except when selection results in zero columns
113+
in the resulting data frame (in which case the result has zero rows).
113114
114115
For `combine`, rows in the returned object appear in the order of groups in the
115116
`GroupedDataFrame`. The functions can return an arbitrary number of rows for
@@ -623,17 +624,27 @@ function select_transform!((nc,)::Ref{Any}, df::AbstractDataFrame, newdf::DataFr
623624
end
624625

625626
"""
626-
select!(df::DataFrame, args...; renamecols::Bool=true)
627+
select!(df::AbstractDataFrame, args...; renamecols::Bool=true)
627628
select!(args::Base.Callable, df::DataFrame; renamecols::Bool=true)
628-
select!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true)
629+
select!(gd::GroupedDataFrame, args...; ungroup::Bool=true, renamecols::Bool=true)
629630
select!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true, renamecols::Bool=true)
630631
631632
Mutate `df` or `gd` in place to retain only columns or transformations specified by `args...` and
632633
return it. The result is guaranteed to have the same number of rows as `df` or
633634
parent of `gd`, except when no columns are selected (in which case the result
634635
has zero rows).
635636
636-
If `gd` is passed then it is updated to reflect the new rows of its updated
637+
If a `SubDataFrame` or `GroupedDataFrame{SubDataFrame}` is passed, the parent data frame
638+
is updated using columns generated by `args...`, following the same rules as indexing:
639+
- for existing columns filtered-out rows are filled with values present in the
640+
old columns
641+
- for new columns (which is only allowed if `SubDataFrame` was created with `:`
642+
as column selector) filtered-out rows are filled with `missing`
643+
- if `SubDataFrame` was not created with `:` as column selector then `select!`
644+
is only allowed if the transformations keep exactly the same sequence of column
645+
names as is in the passed `df`
646+
647+
If a `GroupedDataFrame` is passed then it is updated to reflect the new rows of its updated
637648
parent. If there are independent `GroupedDataFrame` objects constructed using
638649
the same parent data frame they might get corrupt.
639650
@@ -650,6 +661,9 @@ See [`select`](@ref) for examples.
650661
select!(df::DataFrame, @nospecialize(args...); renamecols::Bool=true) =
651662
_replace_columns!(df, select(df, args..., copycols=false, renamecols=renamecols))
652663

664+
select!(df::SubDataFrame, @nospecialize(args...); renamecols::Bool=true) =
665+
_replace_columns!(df, select(df, args..., copycols=true, renamecols=renamecols))
666+
653667
function select!(@nospecialize(arg::Base.Callable), df::AbstractDataFrame; renamecols::Bool=true)
654668
if arg isa Colon
655669
throw(ArgumentError("First argument must be a transformation if the second argument is a data frame"))
@@ -658,14 +672,15 @@ function select!(@nospecialize(arg::Base.Callable), df::AbstractDataFrame; renam
658672
end
659673

660674
"""
661-
transform!(df::DataFrame, args...; renamecols::Bool=true)
662-
transform!(args::Callable, df::DataFrame; renamecols::Bool=true)
663-
transform!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true)
675+
transform!(df::AbstractDataFrame, args...; renamecols::Bool=true)
676+
transform!(args::Callable, df::AbstractDataFrame; renamecols::Bool=true)
677+
transform!(gd::GroupedDataFrame, args...; ungroup::Bool=true, renamecols::Bool=true)
664678
transform!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true, renamecols::Bool=true)
665679
666680
Mutate `df` or `gd` in place to add columns specified by `args...` and return it.
667681
The result is guaranteed to have the same number of rows as `df`.
668-
Equivalent to `select!(df, :, args...)` or `select!(gd, :, args...)`.
682+
Equivalent to `select!(df, :, args...)` or `select!(gd, :, args...)`,
683+
except that column renaming performs a copy.
669684
670685
$TRANSFORMATION_COMMON_RULES
671686
@@ -677,7 +692,7 @@ $TRANSFORMATION_COMMON_RULES
677692
678693
See [`select`](@ref) for examples.
679694
"""
680-
function transform!(df::DataFrame, @nospecialize(args...); renamecols::Bool=true)
695+
function transform!(df::AbstractDataFrame, @nospecialize(args...); renamecols::Bool=true)
681696
idx = index(df)
682697
newargs = Any[if sel isa Pair{<:ColumnIndex, Symbol}
683698
idx[first(sel)] => copy => last(sel)

0 commit comments

Comments
 (0)