Skip to content

Commit 04287cd

Browse files
docs: rewrite the doc entirely
Co-authored-by: Billy Lanchantin <[email protected]>
1 parent b6cbcd3 commit 04287cd

File tree

1 file changed

+109
-80
lines changed

1 file changed

+109
-80
lines changed

lib/explorer/data_frame.ex

Lines changed: 109 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -5455,98 +5455,127 @@ defmodule Explorer.DataFrame do
54555455
54565456
## Examples
54575457
5458-
Note how the dates in `gdp` and `population` don’t quite match. If we join them using
5459-
`join_asof` and `strategy: :backward`, then each date from `population` which doesn’t have an
5460-
exact match is matched with the closest earlier date from `gdp`:
5458+
As a reminder, let's start with an example of a `:left` join:
54615459
5462-
iex> gdp = Explorer.DataFrame.new(
5463-
...> date: [~D[2016-01-01], ~D[2017-01-01], ~D[2018-01-01], ~D[2019-01-01],~D[2020-01-01]],
5464-
...> gdp: [4164, 4411, 4566, 4696, 4827]
5465-
...> )
5466-
iex> population = Explorer.DataFrame.new(
5467-
...> date: [~D[2016-03-01], ~D[2018-08-01], ~D[2019-01-01]],
5468-
...> population: [82.19, 82.66, 83.12]
5469-
...> )
5470-
iex> Explorer.DataFrame.join_asof(population, gdp, strategy: :backward)
5471-
#Explorer.DataFrame<
5472-
Polars[3 x 3]
5473-
date date [2016-03-01, 2018-08-01, 2019-01-01]
5474-
population f64 [82.19, 82.66, 83.12]
5475-
gdp s64 [4164, 4566, 4696]
5476-
>
5460+
iex> alias Explorer.DataFrame, as: DF
5461+
iex> lhs = DF.new(number: [10, 20, 30], upper: ["A", "B", "C"])
5462+
iex> rhs = DF.new(number: [10, 20], lower: ["x", "y"])
5463+
iex> lhs |> DF.join(rhs, on: "number", how: :left) |> DF.to_table_string()
5464+
\"\"\"
5465+
+---------------------------------------------+
5466+
| Explorer DataFrame: [rows: 3, columns: 3] |
5467+
+-------------+---------------+---------------+
5468+
| number | upper | lower |
5469+
| <s64> | <string> | <string> |
5470+
+=============+===============+===============+
5471+
| 10 | A | x |
5472+
| 20 | B | y |
5473+
| 30 | C | nil |
5474+
+-------------+---------------+---------------+
5475+
\"\"\"
54775476
5478-
Note how:
5477+
Even though `rhs` has no corresponding row where `rhs["number"] == 30`, the
5478+
resulting join preserves that row from `lhs` because, per the definition of a
5479+
`:left` join, all rows from `lhs` must remain after the join.
54795480
5480-
* date `2016-03-01` from `population` is matched with `2016-01-01` from `gdp`;
5481-
* date `2018-08-01` from `population` is matched with `2018-01-01` from `gdp`.
5481+
The `:asof` join works in a similar way. All rows of `lhs` will be preserved,
5482+
but the matching criteria for the `:on` column is more flexible than checking
5483+
for strict equality. For example:
54825484
5483-
If we instead use `strategy: :forward`, then each date from `population` which doesn’t have an
5484-
exact match is matched with the closest later date from `gdp`:
5485+
iex> alias Explorer.DataFrame, as: DF
5486+
iex> lhs = DF.new(number: [10, 20, 30], upper: ["A", "B", "C"])
5487+
iex> rhs2 = DF.new(number: [ 1, 11, 21], lower: ["x", "y", "z"])
5488+
iex> lhs |> DF.join_asof(rhs2, strategy: :backward) |> DF.to_table_string()
5489+
\"\"\"
5490+
+---------------------------------------------+
5491+
| Explorer DataFrame: [rows: 3, columns: 3] |
5492+
+-------------+---------------+---------------+
5493+
| number | upper | lower |
5494+
| <s64> | <string> | <string> |
5495+
+=============+===============+===============+
5496+
| 10 | A | x |
5497+
| 20 | B | y |
5498+
| 30 | C | z |
5499+
+-------------+---------------+---------------+
5500+
\"\"\"
54855501
5486-
iex> gdp = Explorer.DataFrame.new(
5487-
...> date: [~D[2016-01-01], ~D[2017-01-01], ~D[2018-01-01], ~D[2019-01-01],~D[2020-01-01]],
5488-
...> gdp: [4164, 4411, 4566, 4696, 4827]
5489-
...> )
5490-
iex> population = Explorer.DataFrame.new(
5491-
...> date: [~D[2016-03-01], ~D[2018-08-01], ~D[2019-01-01]],
5492-
...> population: [82.19, 82.66, 83.12]
5493-
...> )
5494-
iex> Explorer.DataFrame.join_asof(population, gdp, strategy: :forward)
5495-
#Explorer.DataFrame<
5496-
Polars[3 x 3]
5497-
date date [2016-03-01, 2018-08-01, 2019-01-01]
5498-
population f64 [82.19, 82.66, 83.12]
5499-
gdp s64 [4411, 4696, 4696]
5500-
>
5502+
Here we've used `strategy: :backward`. This indicates that the matching
5503+
criteria is, for each row in `lhs`, to look for the first row in `rhs2` such
5504+
that `lhs["number"] <= rhs["number"]`.
55015505
5502-
Note how:
5506+
`strategy: :forward` works similarly except the criteria is `>=`:
55035507
5504-
* date `2016-03-01` from `population` is matched with `2017-01-01` from `gdp`;
5505-
* date `2018-08-01` from `population` is matched with `2019-01-01` from `gdp`.
5508+
iex> alias Explorer.DataFrame, as: DF
5509+
iex> lhs = DF.new(number: [10, 20, 30], upper: ["A", "B", "C"])
5510+
iex> rhs2 = DF.new(number: [ 1, 11, 21], lower: ["x", "y", "z"])
5511+
iex> lhs |> DF.join_asof(rhs2, strategy: :forward) |> DF.to_table_string()
5512+
\"\"\"
5513+
+---------------------------------------------+
5514+
| Explorer DataFrame: [rows: 3, columns: 3] |
5515+
+-------------+---------------+---------------+
5516+
| number | upper | lower |
5517+
| <s64> | <string> | <string> |
5518+
+=============+===============+===============+
5519+
| 10 | A | y |
5520+
| 20 | B | z |
5521+
| 30 | C | nil |
5522+
+-------------+---------------+---------------+
5523+
\"\"\"
55065524
5507-
Finally, `strategy: :nearest` gives us a mix of the two results above, as each date from
5508-
`population` which doesn’t have an exact match is matched with the closest date from `gdp`,
5509-
regardless of whether it’s earlier or later:
5525+
Again, all rows from `lhs` were preserved despite there being no row in `rhs2`
5526+
such that `lhs["number"] >= rhs2["number"]`.
55105527
5511-
iex> gdp = Explorer.DataFrame.new(
5512-
...> date: [~D[2016-01-01], ~D[2017-01-01], ~D[2018-01-01], ~D[2019-01-01],~D[2020-01-01]],
5513-
...> gdp: [4164, 4411, 4566, 4696, 4827]
5514-
...> )
5515-
iex> population = Explorer.DataFrame.new(
5516-
...> date: [~D[2016-03-01], ~D[2018-08-01], ~D[2019-01-01]],
5517-
...> population: [82.19, 82.66, 83.12]
5518-
...> )
5519-
iex> Explorer.DataFrame.join_asof(population, gdp, strategy: :nearest)
5520-
#Explorer.DataFrame<
5521-
Polars[3 x 3]
5522-
date date [2016-03-01, 2018-08-01, 2019-01-01]
5523-
population f64 [82.19, 82.66, 83.12]
5524-
gdp s64 [4164, 4696, 4696]
5525-
>
5528+
The last strategy `:nearest` combines `:backward` and `:forward` by doing both
5529+
then picking whichever's match was closer:
55265530
5527-
The `by` argument allows left-joining on another column (or columns) first,
5528-
before the asof join. In this example we left-join by `country` first, then
5529-
asof join by `date`, as above.
5531+
iex> alias Explorer.DataFrame, as: DF
5532+
iex> lhs = DF.new(number: [10, 20, 30], upper: ["A", "B", "C"])
5533+
iex> rhs2 = DF.new(number: [ 1, 11, 21], lower: ["x", "y", "z"])
5534+
iex> lhs |> DF.join_asof(rhs2, strategy: :nearest) |> DF.to_table_string()
5535+
\"\"\"
5536+
+---------------------------------------------+
5537+
| Explorer DataFrame: [rows: 3, columns: 3] |
5538+
+-------------+---------------+---------------+
5539+
| number | upper | lower |
5540+
| <s64> | <string> | <string> |
5541+
+=============+===============+===============+
5542+
| 10 | A | y |
5543+
| 20 | B | z |
5544+
| 30 | C | z |
5545+
+-------------+---------------+---------------+
5546+
\"\"\"
55305547
5531-
iex> gdp = Explorer.DataFrame.new(
5532-
...> date: [~D[2016-01-01], ~D[2017-01-01], ~D[2018-01-01], ~D[2019-01-01], ~D[2016-01-01], ~D[2017-01-01], ~D[2018-01-01], ~D[2019-01-01]],
5533-
...> country: ["Germany", "Germany", "Germany", "Germany", "Netherlands", "Netherlands", "Netherlands", "Netherlands"],
5534-
...> gdp: [4164, 4411, 4566, 4696, 784, 833, 914, 1000]
5535-
...> )
5536-
iex> population = Explorer.DataFrame.new(
5537-
...> date: [~D[2016-03-01], ~D[2018-08-01], ~D[2016-03-01], ~D[2018-08-01]],
5538-
...> country: ["Germany", "Germany", "Netherlands", "Netherlands"],
5539-
...> population: [82.19, 82.66, 17.08, 17.18]
5540-
...> )
5541-
iex> Explorer.DataFrame.join_asof(population, gdp, by: :country, on: :date, strategy: :nearest)
5542-
#Explorer.DataFrame<
5543-
Polars[4 x 4]
5544-
date date [2016-03-01, 2018-08-01, 2016-03-01, 2018-08-01]
5545-
country string ["Germany", "Germany", "Netherlands", "Netherlands"]
5546-
population f64 [82.19, 82.66, 17.08, 17.18]
5547-
gdp s64 [4164, 4696, 784, 1000]
5548-
>
5548+
Notice how the row `%{"number" => 21, "lower" => "z"}` from `rhs2` was matched
5549+
on twice since it was the nearest for both `%{"number" => 20, ...}` and
5550+
`%{"number" => 30, ...}`.
5551+
5552+
The `:by` option allows for additional matching criteria by also requiring
5553+
that matching rows from both DataFrames are strictly equal in the `:by`
5554+
column(s):
5555+
5556+
iex> alias Explorer.DataFrame, as: DF
5557+
iex> lhs_color = DF.new(number: [10, 20, 30], color: ["red", "blue", "blue"])
5558+
iex> rhs_blue = DF.new(number: [ 1, 11, 21], color: ["blue", "blue", "blue"], lower: ["x", "y", "z"])
5559+
iex> lhs_color |> DF.join_asof(rhs_blue, on: "number", by: "color") |> DF.to_table_string()
5560+
\"\"\"
5561+
+---------------------------------------------+
5562+
| Explorer DataFrame: [rows: 3, columns: 3] |
5563+
+-------------+---------------+---------------+
5564+
| number | color | lower |
5565+
| <s64> | <string> | <string> |
5566+
+=============+===============+===============+
5567+
| 10 | red | nil |
5568+
| 20 | blue | y |
5569+
| 30 | blue | z |
5570+
+-------------+---------------+---------------+
5571+
\"\"\"
55495572
5573+
This is somewhat like grouping the DataFrames by the `:by` column(s) first,
5574+
then checking for an "asof" match within each group only. In the example, rows
5575+
`%{"number" => 20, ...}` and `%{"number" => 30, ...}` in `lhs_color` match as
5576+
before because all rows in `rhs_blue` have `%{color: "blue", ...}`. But the
5577+
row `%{"number" => 10, ...}` gets no match because the `%{color: "red", ...}`
5578+
"group" has no rows in `rhs_blue`.
55505579
"""
55515580
@doc type: :multi
55525581
@spec join_asof(left :: DataFrame.t(), right :: DataFrame.t(), opts :: Keyword.t()) ::

0 commit comments

Comments
 (0)