22
33## Examining the Data
44
5- The default printing of ` DataFrame ` objects only includes a sample of rows and columns that fits on screen:
5+ The default printing of ` DataFrame ` objects only includes a sample of rows and
6+ columns that fits on screen:
67
78``` jldoctest dataframe
89julia> using DataFrames
@@ -85,9 +86,12 @@ julia> DataFrame(a = 1:2, b = [1.0, missing],
8586we can observe that:
8687
8788* the first column ` :a ` can hold elements of type ` Int64 ` ;
88- * the second column ` :b ` can hold ` Float64 ` or ` Missing ` , which is indicated by ` ? ` printed after the name of type;
89- * the third column ` :c ` can hold categorical data; here we notice ` … ` , which indicates that the actual name of the type was long and got truncated;
90- * the type information in fourth column ` :d ` presents a situation where the name is both truncated and the type allows ` Missing ` .
89+ * the second column ` :b ` can hold ` Float64 ` or ` Missing ` , which is indicated by
90+ ` ? ` printed after the name of type;
91+ * the third column ` :c ` can hold categorical data; here we notice ` … ` , which
92+ indicates that the actual name of the type was long and got truncated;
93+ * the type information in fourth column ` :d ` presents a situation where the name
94+ is both truncated and the type allows ` Missing ` .
9195
9296## Taking a Subset
9397
@@ -160,7 +164,8 @@ julia> df[[3, 1], [:C]]
160164 2 │ 1
161165```
162166
163- Do note that ` df[!, [:A]] ` and ` df[:, [:A]] ` return a ` DataFrame ` object, while ` df[!, :A] ` and ` df[:, :A] ` return a vector:
167+ Do note that ` df[!, [:A]] ` and ` df[:, [:A]] ` return a ` DataFrame ` object, while
168+ ` df[!, :A] ` and ` df[:, :A] ` return a vector:
164169
165170``` jldoctest dataframe
166171julia> df[!, [:A]]
@@ -222,7 +227,8 @@ that a single column vector should be extracted. Note that in the first case a
222227vector is required to be passed (not just any iterable), so e.g. `df[ :, (: x1 ,
223228: x2 )] ` is not allowed, but ` df[ :, [ : x1 , : x2 ]] ` is valid.
224229
225- It is also possible to use a regular expression as a selector of columns matching it:
230+ It is also possible to use a regular expression as a selector of columns
231+ matching it:
226232``` jldoctest dataframe
227233julia> df = DataFrame(x1=1, x2=2, y=3)
2282341×3 DataFrame
@@ -294,9 +300,9 @@ julia> df[:, Cols(x -> startswith(x, "x"))] # keep columns whose name starts wit
294300 1 │ 2 3
295301```
296302
297- The following examples show a more complex use of the ` Cols ` selector, which moves all
298- columns whose names match ` r"x" ` regular expression respectively to the front
299- and to the end of the data frame:
303+ The following examples show a more complex use of the ` Cols ` selector, which
304+ moves all columns whose names match ` r"x" ` regular expression respectively to
305+ the front and to the end of the data frame:
300306``` jldoctest dataframe
301307julia> df[:, Cols(r"x", :)]
3023081×4 DataFrame
@@ -313,7 +319,8 @@ julia> df[:, Cols(Not(r"x"), :)]
313319 1 │ 1 4 2 3
314320```
315321
316- The indexing syntax can also be used to select rows based on conditions on variables:
322+ The indexing syntax can also be used to select rows based on conditions on
323+ variables:
317324
318325``` jldoctest dataframe
319326julia> df = DataFrame(A = 1:2:1000, B = repeat(1:10, inner=50), C = 1:500)
@@ -385,7 +392,9 @@ julia> df[(df.A .> 500) .& (300 .< df.C .< 400), :]
385392 99 │ 797 8 399
386393 84 rows omitted
387394```
388- Where a specific subset of values needs to be matched, the ` in() ` function can be applied:
395+
396+ Where a specific subset of values needs to be matched, the ` in() ` function can
397+ be applied:
389398
390399``` jldoctest dataframe
391400julia> df[in.(df.A, Ref([1, 5, 601])), :]
@@ -409,19 +418,87 @@ a function object that tests whether each value belongs to the subset
409418
410419 The only indexing situations where data frames will **not** return a copy are:
411420
412- - when a `!` is placed in the first indexing position (`df[!, :A]`, or `df[!, [:A, :B]]`),
421+ - when a `!` is placed in the first indexing position
422+ (`df[!, :A]`, or `df[!, [:A, :B]]`),
413423 - when using `.` (`getpropery`) notation (`df.A`),
414424 - when a single row is selected using an integer (`df[1, [:A, :B]]`)
415425 - when `view` or `@view` is used (e.g. `@view df[1:3, :A]`).
416426
417427 More details on copies, views, and references can be found
418428 in the [`getindex` and `view`](@ref) section.
419429
430+ ### Subsetting functions
431+
432+ An alternative approach to row subsetting in a data frame is to use
433+ the [ ` subset ` ] ( @ref ) function, or the [ ` subset! ` ] ( @ref ) function,
434+ which is its in-place variant.
435+
436+ These functions take a data frame as their first argument. The
437+ following positional arguments (one or more) are filtering condition
438+ specifications that must be jointly met. Each condition should be passed as a
439+ ` Pair ` consisting of source column(s) and a function specifying the filtering
440+ condition taking this or these column(s) as arguments:
441+
442+ ``` jldoctest dataframe
443+ julia> subset(df, :A => a -> a .< 10, :C => c -> isodd.(c))
444+ 3×3 DataFrame
445+ Row │ A B C
446+ │ Int64 Int64 Int64
447+ ─────┼─────────────────────
448+ 1 │ 1 1 1
449+ 2 │ 5 1 3
450+ 3 │ 9 1 5
451+ ```
452+
453+ It is a frequent situation that ` missing ` values might be present in the
454+ filtering columns, which could then lead the filtering condition to return
455+ ` missing ` instead of the expected ` true ` or ` false ` . In order
456+ to handle this situation one can either use the ` coalesce ` function or pass
457+ the ` skipmissing=true ` keyword argument to ` subset ` . Here is an example:
458+
459+ ``` jldoctest dataframe
460+ julia> df = DataFrame(x=[1, 2, missing, 4])
461+ 4×1 DataFrame
462+ Row │ x
463+ │ Int64?
464+ ─────┼─────────
465+ 1 │ 1
466+ 2 │ 2
467+ 3 │ missing
468+ 4 │ 4
469+
470+ julia> subset(df, :x => x -> coalesce.(iseven.(x), false))
471+ 2×1 DataFrame
472+ Row │ x
473+ │ Int64?
474+ ─────┼────────
475+ 1 │ 2
476+ 2 │ 4
477+
478+ julia> subset(df, :x => x -> iseven.(x), skipmissing=true)
479+ 2×1 DataFrame
480+ Row │ x
481+ │ Int64?
482+ ─────┼────────
483+ 1 │ 2
484+ 2 │ 4
485+ ```
486+
487+ Additionally DataFrames.jl extends the [ ` filter ` ] ( @ref ) and [ ` filter! ` ] ( @ref )
488+ functions provided in Julia Base and they also allow to subset a data frame.
489+ Please refer to their documentation for details.
490+
491+ It is worth to mention that the [ ` subset ` ] ( @ref ) was designed in a way that is
492+ consistent how column transformations are specified in functions like
493+ [ ` combine ` ] ( @ref ) , [ ` select ` ] ( @ref ) , and [ ` transform ` ] ( @ref ) . Examples of column
494+ transformations accepted by these functions are provided in the following
495+ section.
496+
420497### Selecting and transforming columns
421498
422499You can also use the [ ` select ` ] ( @ref ) /[ ` select! ` ] ( @ref ) and
423- [ ` transform ` ] ( @ref ) /[ ` transform! ` ] ( @ref ) functions to select, rename and transform
424- columns in a data frame.
500+ [ ` transform ` ] ( @ref ) /[ ` transform! ` ] ( @ref ) functions to select, rename and
501+ transform columns in a data frame.
425502
426503The ` select ` function creates a new data frame:
427504``` jldoctest dataframe
@@ -538,11 +615,12 @@ julia> df
538615 2 │ 4 6
539616```
540617
541- ` transform ` and ` transform! ` functions work identically to ` select ` and ` select! ` with the only difference that
542- they retain all columns that are present in the source data frame. Here are some more advanced examples.
618+ ` transform ` and ` transform! ` functions work identically to ` select ` and
619+ ` select! ` with the only difference that they retain all columns that are present
620+ in the source data frame. Here are some more advanced examples.
543621
544- First we show how to generate a column that is a sum of all other columns in the data frame
545- using the ` All() ` selector:
622+ First we show how to generate a column that is a sum of all other columns in the
623+ data frame using the ` All() ` selector:
546624
547625``` jldoctest dataframe
548626julia> df = DataFrame(x1=[1, 2], x2=[3, 4], y=[5, 6])
@@ -561,7 +639,10 @@ julia> transform(df, All() => +)
561639 1 │ 1 3 5 9
562640 2 │ 2 4 6 12
563641```
564- Using the ` ByRow ` wrapper, we can easily compute for each row the name of column with the highest score:
642+
643+ Using the ` ByRow ` wrapper, we can easily compute for each row the name of column
644+ with the highest score:
645+
565646```
566647julia> using Random
567648
@@ -599,8 +680,10 @@ julia> transform(df, AsTable(:) => ByRow(argmax) => :prediction)
599680 9 │ 0.251662 0.287702 0.0856352 b
600681 10 │ 0.986666 0.859512 0.553206 a
601682```
602- In the following, most complex, example below we compute row-wise sum, number of elements, and mean,
603- while ignoring missing values.
683+
684+ In the following, most complex, example below we compute row-wise sum, number of
685+ elements, and mean, while ignoring missing values.
686+
604687```
605688julia> using Statistics
606689
@@ -628,17 +711,21 @@ julia> transform(df, AsTable(:) .=>
628711```
629712
630713While the DataFrames.jl package provides basic data manipulation capabilities,
631- users are encouraged to use querying frameworks for more convenient and powerful operations:
714+ users are encouraged to use querying frameworks for more convenient and powerful
715+ operations:
632716- the [ Query.jl] ( https://github.com/davidanthoff/Query.jl ) package provides a
633- [ LINQ] ( https://en.wikipedia.org/wiki/Language_Integrated_Query ) -like interface to a large number of data sources
717+ [ LINQ] ( https://en.wikipedia.org/wiki/Language_Integrated_Query ) -like interface
718+ to a large number of data sources
634719- the [ DataFramesMeta.jl] ( https://github.com/JuliaStats/DataFramesMeta.jl )
635- package provides interfaces similar to LINQ and [ dplyr] ( https://dplyr.tidyverse.org )
720+ package provides interfaces similar to LINQ and
721+ [ dplyr] ( https://dplyr.tidyverse.org )
636722
637723See the [ Data manipulation frameworks] ( @ref ) section for more information.
638724
639725## Summarizing Data
640726
641- The ` describe ` function returns a data frame summarizing the elementary statistics and information about each column:
727+ The ` describe ` function returns a data frame summarizing the elementary
728+ statistics and information about each column:
642729
643730``` jldoctest dataframe
644731julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
@@ -660,8 +747,10 @@ julia> describe(df)
660747 2 │ B F M 0 String
661748```
662749
663- If you are interested in describing only a subset of columns then the easiest way
664- to do it is to pass a subset of an original data frame to ` describe ` like this:
750+ If you are interested in describing only a subset of columns then the easiest
751+ way to do it is to pass a subset of an original data frame to ` describe ` like
752+ this:
753+
665754``` jldoctest dataframe
666755julia> describe(df[!, [:A]])
6677561×7 DataFrame
@@ -671,15 +760,19 @@ julia> describe(df[!, [:A]])
671760 1 │ A 2.5 1 2.5 4 0 Int64
672761```
673762
674- Of course, one can also compute descriptive statistics directly on individual columns:
763+ Of course, one can also compute descriptive statistics directly on individual
764+ columns:
765+
675766``` jldoctest dataframe
676767julia> using Statistics
677768
678769julia> mean(df.A)
6797702.5
680771```
681772
682- We can also apply a function to each column of a ` DataFrame ` using ` combine ` . For example:
773+ We can also apply a function to each column of a ` DataFrame ` using ` combine ` .
774+ For example:
775+
683776``` jldoctest dataframe
684777julia> df = DataFrame(A = 1:4, B = 4.0:-1.0:1.0)
6857784×2 DataFrame
@@ -706,8 +799,8 @@ julia> combine(df, names(df) .=> sum, names(df) .=> prod)
706799 1 │ 10 10.0 24 24.0
707800```
708801
709- If you would prefer the result to have the same number of rows as the source data
710- frame use ` select ` instead of ` combine ` .
802+ If you would prefer the result to have the same number of rows as the source
803+ data frame use ` select ` instead of ` combine ` .
711804
712805## Handling of Columns Stored in a ` DataFrame `
713806
@@ -731,8 +824,8 @@ julia> df2.A === df.A
731824false
732825```
733826
734- On the other hand, in-place functions, whose names end with ` ! ` , may mutate the column vectors of the
735- ` DataFrame ` they take as an argument, for example:
827+ On the other hand, in-place functions, whose names end with ` ! ` , may mutate the
828+ column vectors of the ` DataFrame ` they take as an argument, for example:
736829
737830``` jldoctest dataframe
738831julia> x = [3, 1, 2];
817910Note that a column obtained from a ` DataFrame ` using one of these methods should
818911not be mutated without caution.
819912
820- The exact rules of handling columns of a ` DataFrame ` are explained in
821- [ The design of handling of columns of a ` DataFrame ` ] (@ref man-columnhandling) section of the manual.
913+ The exact rules of handling columns of a ` DataFrame ` are explained in [ The
914+ design of handling of columns of a ` DataFrame ` ] (@ref man-columnhandling) section
915+ of the manual.
822916
823917
824918## Replacing Data
@@ -836,7 +930,8 @@ Replacement operations affecting a single column can be performed using `replace
836930``` jldoctest replace
837931julia> using DataFrames
838932
839- julia> df = DataFrame(a = ["a", "None", "b", "None"], b = 1:4, c = ["None", "j", "k", "h"], d = ["x", "y", "None", "z"])
933+ julia> df = DataFrame(a = ["a", "None", "b", "None"], b = 1:4,
934+ c = ["None", "j", "k", "h"], d = ["x", "y", "None", "z"])
8409354×4 DataFrame
841936 Row │ a b c d
842937 │ String Int64 String String
0 commit comments