Skip to content

Commit 7705974

Browse files
authored
add filter and subset to documentation (#2900)
1 parent ba36297 commit 7705974

File tree

1 file changed

+131
-36
lines changed

1 file changed

+131
-36
lines changed

docs/src/man/working_with_dataframes.md

Lines changed: 131 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,8 @@
22

33
## Examining the Data
44

5-
The default printing of `DataFrame` objects only includes a sample of rows and columns that fits on screen:
5+
The default printing of `DataFrame` objects only includes a sample of rows and
6+
columns that fits on screen:
67

78
```jldoctest dataframe
89
julia> using DataFrames
@@ -85,9 +86,12 @@ julia> DataFrame(a = 1:2, b = [1.0, missing],
8586
we can observe that:
8687

8788
* the first column `:a` can hold elements of type `Int64`;
88-
* the second column `:b` can hold `Float64` or `Missing`, which is indicated by `?` printed after the name of type;
89-
* the third column `:c` can hold categorical data; here we notice ``, which indicates that the actual name of the type was long and got truncated;
90-
* the type information in fourth column `:d` presents a situation where the name is both truncated and the type allows `Missing`.
89+
* the second column `:b` can hold `Float64` or `Missing`, which is indicated by
90+
`?` printed after the name of type;
91+
* the third column `:c` can hold categorical data; here we notice ``, which
92+
indicates that the actual name of the type was long and got truncated;
93+
* the type information in fourth column `:d` presents a situation where the name
94+
is both truncated and the type allows `Missing`.
9195

9296
## Taking a Subset
9397

@@ -160,7 +164,8 @@ julia> df[[3, 1], [:C]]
160164
2 │ 1
161165
```
162166

163-
Do note that `df[!, [:A]]` and `df[:, [:A]]` return a `DataFrame` object, while `df[!, :A]` and `df[:, :A]` return a vector:
167+
Do note that `df[!, [:A]]` and `df[:, [:A]]` return a `DataFrame` object, while
168+
`df[!, :A]` and `df[:, :A]` return a vector:
164169

165170
```jldoctest dataframe
166171
julia> df[!, [:A]]
@@ -222,7 +227,8 @@ that a single column vector should be extracted. Note that in the first case a
222227
vector is required to be passed (not just any iterable), so e.g. `df[:, (:x1,
223228
:x2)]` is not allowed, but `df[:, [:x1, :x2]]` is valid.
224229

225-
It is also possible to use a regular expression as a selector of columns matching it:
230+
It is also possible to use a regular expression as a selector of columns
231+
matching it:
226232
```jldoctest dataframe
227233
julia> df = DataFrame(x1=1, x2=2, y=3)
228234
1×3 DataFrame
@@ -294,9 +300,9 @@ julia> df[:, Cols(x -> startswith(x, "x"))] # keep columns whose name starts wit
294300
1 │ 2 3
295301
```
296302

297-
The following examples show a more complex use of the `Cols` selector, which moves all
298-
columns whose names match `r"x"` regular expression respectively to the front
299-
and to the end of the data frame:
303+
The following examples show a more complex use of the `Cols` selector, which
304+
moves all columns whose names match `r"x"` regular expression respectively to
305+
the front and to the end of the data frame:
300306
```jldoctest dataframe
301307
julia> df[:, Cols(r"x", :)]
302308
1×4 DataFrame
@@ -313,7 +319,8 @@ julia> df[:, Cols(Not(r"x"), :)]
313319
1 │ 1 4 2 3
314320
```
315321

316-
The indexing syntax can also be used to select rows based on conditions on variables:
322+
The indexing syntax can also be used to select rows based on conditions on
323+
variables:
317324

318325
```jldoctest dataframe
319326
julia> df = DataFrame(A = 1:2:1000, B = repeat(1:10, inner=50), C = 1:500)
@@ -385,7 +392,9 @@ julia> df[(df.A .> 500) .& (300 .< df.C .< 400), :]
385392
99 │ 797 8 399
386393
84 rows omitted
387394
```
388-
Where a specific subset of values needs to be matched, the `in()` function can be applied:
395+
396+
Where a specific subset of values needs to be matched, the `in()` function can
397+
be applied:
389398

390399
```jldoctest dataframe
391400
julia> df[in.(df.A, Ref([1, 5, 601])), :]
@@ -409,19 +418,87 @@ a function object that tests whether each value belongs to the subset
409418

410419
The only indexing situations where data frames will **not** return a copy are:
411420

412-
- when a `!` is placed in the first indexing position (`df[!, :A]`, or `df[!, [:A, :B]]`),
421+
- when a `!` is placed in the first indexing position
422+
(`df[!, :A]`, or `df[!, [:A, :B]]`),
413423
- when using `.` (`getpropery`) notation (`df.A`),
414424
- when a single row is selected using an integer (`df[1, [:A, :B]]`)
415425
- when `view` or `@view` is used (e.g. `@view df[1:3, :A]`).
416426

417427
More details on copies, views, and references can be found
418428
in the [`getindex` and `view`](@ref) section.
419429

430+
### Subsetting functions
431+
432+
An alternative approach to row subsetting in a data frame is to use
433+
the [`subset`](@ref) function, or the [`subset!`](@ref) function,
434+
which is its in-place variant.
435+
436+
These functions take a data frame as their first argument. The
437+
following positional arguments (one or more) are filtering condition
438+
specifications that must be jointly met. Each condition should be passed as a
439+
`Pair` consisting of source column(s) and a function specifying the filtering
440+
condition taking this or these column(s) as arguments:
441+
442+
```jldoctest dataframe
443+
julia> subset(df, :A => a -> a .< 10, :C => c -> isodd.(c))
444+
3×3 DataFrame
445+
Row │ A B C
446+
│ Int64 Int64 Int64
447+
─────┼─────────────────────
448+
1 │ 1 1 1
449+
2 │ 5 1 3
450+
3 │ 9 1 5
451+
```
452+
453+
It is a frequent situation that `missing` values might be present in the
454+
filtering columns, which could then lead the filtering condition to return
455+
`missing` instead of the expected `true` or `false`. In order
456+
to handle this situation one can either use the `coalesce` function or pass
457+
the `skipmissing=true` keyword argument to `subset`. Here is an example:
458+
459+
```jldoctest dataframe
460+
julia> df = DataFrame(x=[1, 2, missing, 4])
461+
4×1 DataFrame
462+
Row │ x
463+
│ Int64?
464+
─────┼─────────
465+
1 │ 1
466+
2 │ 2
467+
3 │ missing
468+
4 │ 4
469+
470+
julia> subset(df, :x => x -> coalesce.(iseven.(x), false))
471+
2×1 DataFrame
472+
Row │ x
473+
│ Int64?
474+
─────┼────────
475+
1 │ 2
476+
2 │ 4
477+
478+
julia> subset(df, :x => x -> iseven.(x), skipmissing=true)
479+
2×1 DataFrame
480+
Row │ x
481+
│ Int64?
482+
─────┼────────
483+
1 │ 2
484+
2 │ 4
485+
```
486+
487+
Additionally DataFrames.jl extends the [`filter`](@ref) and [`filter!`](@ref)
488+
functions provided in Julia Base and they also allow to subset a data frame.
489+
Please refer to their documentation for details.
490+
491+
It is worth to mention that the [`subset`](@ref) was designed in a way that is
492+
consistent how column transformations are specified in functions like
493+
[`combine`](@ref), [`select`](@ref), and [`transform`](@ref). Examples of column
494+
transformations accepted by these functions are provided in the following
495+
section.
496+
420497
### Selecting and transforming columns
421498

422499
You can also use the [`select`](@ref)/[`select!`](@ref) and
423-
[`transform`](@ref)/[`transform!`](@ref) functions to select, rename and transform
424-
columns in a data frame.
500+
[`transform`](@ref)/[`transform!`](@ref) functions to select, rename and
501+
transform columns in a data frame.
425502

426503
The `select` function creates a new data frame:
427504
```jldoctest dataframe
@@ -538,11 +615,12 @@ julia> df
538615
2 │ 4 6
539616
```
540617

541-
`transform` and `transform!` functions work identically to `select` and `select!` with the only difference that
542-
they retain all columns that are present in the source data frame. Here are some more advanced examples.
618+
`transform` and `transform!` functions work identically to `select` and
619+
`select!` with the only difference that they retain all columns that are present
620+
in the source data frame. Here are some more advanced examples.
543621

544-
First we show how to generate a column that is a sum of all other columns in the data frame
545-
using the `All()` selector:
622+
First we show how to generate a column that is a sum of all other columns in the
623+
data frame using the `All()` selector:
546624

547625
```jldoctest dataframe
548626
julia> df = DataFrame(x1=[1, 2], x2=[3, 4], y=[5, 6])
@@ -561,7 +639,10 @@ julia> transform(df, All() => +)
561639
1 │ 1 3 5 9
562640
2 │ 2 4 6 12
563641
```
564-
Using the `ByRow` wrapper, we can easily compute for each row the name of column with the highest score:
642+
643+
Using the `ByRow` wrapper, we can easily compute for each row the name of column
644+
with the highest score:
645+
565646
```
566647
julia> using Random
567648
@@ -599,8 +680,10 @@ julia> transform(df, AsTable(:) => ByRow(argmax) => :prediction)
599680
9 │ 0.251662 0.287702 0.0856352 b
600681
10 │ 0.986666 0.859512 0.553206 a
601682
```
602-
In the following, most complex, example below we compute row-wise sum, number of elements, and mean,
603-
while ignoring missing values.
683+
684+
In the following, most complex, example below we compute row-wise sum, number of
685+
elements, and mean, while ignoring missing values.
686+
604687
```
605688
julia> using Statistics
606689
@@ -628,17 +711,21 @@ julia> transform(df, AsTable(:) .=>
628711
```
629712

630713
While the DataFrames.jl package provides basic data manipulation capabilities,
631-
users are encouraged to use querying frameworks for more convenient and powerful operations:
714+
users are encouraged to use querying frameworks for more convenient and powerful
715+
operations:
632716
- the [Query.jl](https://github.com/davidanthoff/Query.jl) package provides a
633-
[LINQ](https://en.wikipedia.org/wiki/Language_Integrated_Query)-like interface to a large number of data sources
717+
[LINQ](https://en.wikipedia.org/wiki/Language_Integrated_Query)-like interface
718+
to a large number of data sources
634719
- the [DataFramesMeta.jl](https://github.com/JuliaStats/DataFramesMeta.jl)
635-
package provides interfaces similar to LINQ and [dplyr](https://dplyr.tidyverse.org)
720+
package provides interfaces similar to LINQ and
721+
[dplyr](https://dplyr.tidyverse.org)
636722

637723
See the [Data manipulation frameworks](@ref) section for more information.
638724

639725
## Summarizing Data
640726

641-
The `describe` function returns a data frame summarizing the elementary statistics and information about each column:
727+
The `describe` function returns a data frame summarizing the elementary
728+
statistics and information about each column:
642729

643730
```jldoctest dataframe
644731
julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
@@ -660,8 +747,10 @@ julia> describe(df)
660747
2 │ B F M 0 String
661748
```
662749

663-
If you are interested in describing only a subset of columns then the easiest way
664-
to do it is to pass a subset of an original data frame to `describe` like this:
750+
If you are interested in describing only a subset of columns then the easiest
751+
way to do it is to pass a subset of an original data frame to `describe` like
752+
this:
753+
665754
```jldoctest dataframe
666755
julia> describe(df[!, [:A]])
667756
1×7 DataFrame
@@ -671,15 +760,19 @@ julia> describe(df[!, [:A]])
671760
1 │ A 2.5 1 2.5 4 0 Int64
672761
```
673762

674-
Of course, one can also compute descriptive statistics directly on individual columns:
763+
Of course, one can also compute descriptive statistics directly on individual
764+
columns:
765+
675766
```jldoctest dataframe
676767
julia> using Statistics
677768
678769
julia> mean(df.A)
679770
2.5
680771
```
681772

682-
We can also apply a function to each column of a `DataFrame` using `combine`. For example:
773+
We can also apply a function to each column of a `DataFrame` using `combine`.
774+
For example:
775+
683776
```jldoctest dataframe
684777
julia> df = DataFrame(A = 1:4, B = 4.0:-1.0:1.0)
685778
4×2 DataFrame
@@ -706,8 +799,8 @@ julia> combine(df, names(df) .=> sum, names(df) .=> prod)
706799
1 │ 10 10.0 24 24.0
707800
```
708801

709-
If you would prefer the result to have the same number of rows as the source data
710-
frame use `select` instead of `combine`.
802+
If you would prefer the result to have the same number of rows as the source
803+
data frame use `select` instead of `combine`.
711804

712805
## Handling of Columns Stored in a `DataFrame`
713806

@@ -731,8 +824,8 @@ julia> df2.A === df.A
731824
false
732825
```
733826

734-
On the other hand, in-place functions, whose names end with `!`, may mutate the column vectors of the
735-
`DataFrame` they take as an argument, for example:
827+
On the other hand, in-place functions, whose names end with `!`, may mutate the
828+
column vectors of the `DataFrame` they take as an argument, for example:
736829

737830
```jldoctest dataframe
738831
julia> x = [3, 1, 2];
@@ -817,8 +910,9 @@ true
817910
Note that a column obtained from a `DataFrame` using one of these methods should
818911
not be mutated without caution.
819912

820-
The exact rules of handling columns of a `DataFrame` are explained in
821-
[The design of handling of columns of a `DataFrame`](@ref man-columnhandling) section of the manual.
913+
The exact rules of handling columns of a `DataFrame` are explained in [The
914+
design of handling of columns of a `DataFrame`](@ref man-columnhandling) section
915+
of the manual.
822916

823917

824918
## Replacing Data
@@ -836,7 +930,8 @@ Replacement operations affecting a single column can be performed using `replace
836930
```jldoctest replace
837931
julia> using DataFrames
838932
839-
julia> df = DataFrame(a = ["a", "None", "b", "None"], b = 1:4, c = ["None", "j", "k", "h"], d = ["x", "y", "None", "z"])
933+
julia> df = DataFrame(a = ["a", "None", "b", "None"], b = 1:4,
934+
c = ["None", "j", "k", "h"], d = ["x", "y", "None", "z"])
840935
4×4 DataFrame
841936
Row │ a b c d
842937
│ String Int64 String String

0 commit comments

Comments
 (0)