@@ -5455,98 +5455,127 @@ defmodule Explorer.DataFrame do
54555455
54565456 ## Examples
54575457
5458- Note how the dates in `gdp` and `population` don’t quite match. If we join them using
5459- `join_asof` and `strategy: :backward`, then each date from `population` which doesn’t have an
5460- exact match is matched with the closest earlier date from `gdp`:
5458+ As a reminder, let's start with an example of a `:left` join:
54615459
5462- iex> gdp = Explorer.DataFrame.new(
5463- ...> date: [~D[2016-01-01], ~D[2017-01-01], ~D[2018-01-01], ~D[2019-01-01],~D[2020-01-01]],
5464- ...> gdp: [4164, 4411, 4566, 4696, 4827]
5465- ...> )
5466- iex> population = Explorer.DataFrame.new(
5467- ...> date: [~D[2016-03-01], ~D[2018-08-01], ~D[2019-01-01]],
5468- ...> population: [82.19, 82.66, 83.12]
5469- ...> )
5470- iex> Explorer.DataFrame.join_asof(population, gdp, strategy: :backward)
5471- #Explorer.DataFrame<
5472- Polars[3 x 3]
5473- date date [2016-03-01, 2018-08-01, 2019-01-01]
5474- population f64 [82.19, 82.66, 83.12]
5475- gdp s64 [4164, 4566, 4696]
5476- >
5460+ iex> alias Explorer.DataFrame, as: DF
5461+ iex> lhs = DF.new(number: [10, 20, 30], upper: ["A", "B", "C"])
5462+ iex> rhs = DF.new(number: [10, 20], lower: ["x", "y"])
5463+ iex> lhs |> DF.join(rhs, on: "number", how: :left) |> DF.to_table_string()
5464+ \" \" \"
5465+ +---------------------------------------------+
5466+ | Explorer DataFrame: [rows: 3, columns: 3] |
5467+ +-------------+---------------+---------------+
5468+ | number | upper | lower |
5469+ | <s64> | <string> | <string> |
5470+ +=============+===============+===============+
5471+ | 10 | A | x |
5472+ | 20 | B | y |
5473+ | 30 | C | nil |
5474+ +-------------+---------------+---------------+
5475+ \" \" \"
54775476
5478- Note how:
5477+ Even though `rhs` has no corresponding row where `rhs["number"] == 30`, the
5478+ resulting join preserves that row from `lhs` because, per the definition of a
5479+ `:left` join, all rows from `lhs` must remain after the join.
54795480
5480- * date `2016-03-01` from `population` is matched with `2016-01-01` from `gdp`;
5481- * date `2018-08-01` from `population` is matched with `2018-01-01` from `gdp`.
5481+ The `:asof` join works in a similar way. All rows of `lhs` will be preserved,
5482+ but the matching criteria for the `:on` column is more flexible than checking
5483+ for strict equality. For example:
54825484
5483- If we instead use `strategy: :forward`, then each date from `population` which doesn’t have an
5484- exact match is matched with the closest later date from `gdp`:
5485+ iex> alias Explorer.DataFrame, as: DF
5486+ iex> lhs = DF.new(number: [10, 20, 30], upper: ["A", "B", "C"])
5487+ iex> rhs2 = DF.new(number: [ 1, 11, 21], lower: ["x", "y", "z"])
5488+ iex> lhs |> DF.join_asof(rhs2, strategy: :backward) |> DF.to_table_string()
5489+ \" \" \"
5490+ +---------------------------------------------+
5491+ | Explorer DataFrame: [rows: 3, columns: 3] |
5492+ +-------------+---------------+---------------+
5493+ | number | upper | lower |
5494+ | <s64> | <string> | <string> |
5495+ +=============+===============+===============+
5496+ | 10 | A | x |
5497+ | 20 | B | y |
5498+ | 30 | C | z |
5499+ +-------------+---------------+---------------+
5500+ \" \" \"
54855501
5486- iex> gdp = Explorer.DataFrame.new(
5487- ...> date: [~D[2016-01-01], ~D[2017-01-01], ~D[2018-01-01], ~D[2019-01-01],~D[2020-01-01]],
5488- ...> gdp: [4164, 4411, 4566, 4696, 4827]
5489- ...> )
5490- iex> population = Explorer.DataFrame.new(
5491- ...> date: [~D[2016-03-01], ~D[2018-08-01], ~D[2019-01-01]],
5492- ...> population: [82.19, 82.66, 83.12]
5493- ...> )
5494- iex> Explorer.DataFrame.join_asof(population, gdp, strategy: :forward)
5495- #Explorer.DataFrame<
5496- Polars[3 x 3]
5497- date date [2016-03-01, 2018-08-01, 2019-01-01]
5498- population f64 [82.19, 82.66, 83.12]
5499- gdp s64 [4411, 4696, 4696]
5500- >
5502+ Here we've used `strategy: :backward`. This indicates that the matching
5503+ criteria is, for each row in `lhs`, to look for the first row in `rhs2` such
5504+ that `lhs["number"] <= rhs["number"]`.
55015505
5502- Note how :
5506+ `strategy: :forward` works similarly except the criteria is `>=` :
55035507
5504- * date `2016-03-01` from `population` is matched with `2017-01-01` from `gdp`;
5505- * date `2018-08-01` from `population` is matched with `2019-01-01` from `gdp`.
5508+ iex> alias Explorer.DataFrame, as: DF
5509+ iex> lhs = DF.new(number: [10, 20, 30], upper: ["A", "B", "C"])
5510+ iex> rhs2 = DF.new(number: [ 1, 11, 21], lower: ["x", "y", "z"])
5511+ iex> lhs |> DF.join_asof(rhs2, strategy: :forward) |> DF.to_table_string()
5512+ \" \" \"
5513+ +---------------------------------------------+
5514+ | Explorer DataFrame: [rows: 3, columns: 3] |
5515+ +-------------+---------------+---------------+
5516+ | number | upper | lower |
5517+ | <s64> | <string> | <string> |
5518+ +=============+===============+===============+
5519+ | 10 | A | y |
5520+ | 20 | B | z |
5521+ | 30 | C | nil |
5522+ +-------------+---------------+---------------+
5523+ \" \" \"
55065524
5507- Finally, `strategy: :nearest` gives us a mix of the two results above, as each date from
5508- `population` which doesn’t have an exact match is matched with the closest date from `gdp`,
5509- regardless of whether it’s earlier or later:
5525+ Again, all rows from `lhs` were preserved despite there being no row in `rhs2`
5526+ such that `lhs["number"] >= rhs2["number"]`.
55105527
5511- iex> gdp = Explorer.DataFrame.new(
5512- ...> date: [~D[2016-01-01], ~D[2017-01-01], ~D[2018-01-01], ~D[2019-01-01],~D[2020-01-01]],
5513- ...> gdp: [4164, 4411, 4566, 4696, 4827]
5514- ...> )
5515- iex> population = Explorer.DataFrame.new(
5516- ...> date: [~D[2016-03-01], ~D[2018-08-01], ~D[2019-01-01]],
5517- ...> population: [82.19, 82.66, 83.12]
5518- ...> )
5519- iex> Explorer.DataFrame.join_asof(population, gdp, strategy: :nearest)
5520- #Explorer.DataFrame<
5521- Polars[3 x 3]
5522- date date [2016-03-01, 2018-08-01, 2019-01-01]
5523- population f64 [82.19, 82.66, 83.12]
5524- gdp s64 [4164, 4696, 4696]
5525- >
5528+ The last strategy `:nearest` combines `:backward` and `:forward` by doing both
5529+ then picking whichever's match was closer:
55265530
5527- The `by` argument allows left-joining on another column (or columns) first,
5528- before the asof join. In this example we left-join by `country` first, then
5529- asof join by `date`, as above.
5531+ iex> alias Explorer.DataFrame, as: DF
5532+ iex> lhs = DF.new(number: [10, 20, 30], upper: ["A", "B", "C"])
5533+ iex> rhs2 = DF.new(number: [ 1, 11, 21], lower: ["x", "y", "z"])
5534+ iex> lhs |> DF.join_asof(rhs2, strategy: :nearest) |> DF.to_table_string()
5535+ \" \" \"
5536+ +---------------------------------------------+
5537+ | Explorer DataFrame: [rows: 3, columns: 3] |
5538+ +-------------+---------------+---------------+
5539+ | number | upper | lower |
5540+ | <s64> | <string> | <string> |
5541+ +=============+===============+===============+
5542+ | 10 | A | y |
5543+ | 20 | B | z |
5544+ | 30 | C | z |
5545+ +-------------+---------------+---------------+
5546+ \" \" \"
55305547
5531- iex> gdp = Explorer.DataFrame.new(
5532- ...> date: [~D[2016-01-01], ~D[2017-01-01], ~D[2018-01-01], ~D[2019-01-01], ~D[2016-01-01], ~D[2017-01-01], ~D[2018-01-01], ~D[2019-01-01]],
5533- ...> country: ["Germany", "Germany", "Germany", "Germany", "Netherlands", "Netherlands", "Netherlands", "Netherlands"],
5534- ...> gdp: [4164, 4411, 4566, 4696, 784, 833, 914, 1000]
5535- ...> )
5536- iex> population = Explorer.DataFrame.new(
5537- ...> date: [~D[2016-03-01], ~D[2018-08-01], ~D[2016-03-01], ~D[2018-08-01]],
5538- ...> country: ["Germany", "Germany", "Netherlands", "Netherlands"],
5539- ...> population: [82.19, 82.66, 17.08, 17.18]
5540- ...> )
5541- iex> Explorer.DataFrame.join_asof(population, gdp, by: :country, on: :date, strategy: :nearest)
5542- #Explorer.DataFrame<
5543- Polars[4 x 4]
5544- date date [2016-03-01, 2018-08-01, 2016-03-01, 2018-08-01]
5545- country string ["Germany", "Germany", "Netherlands", "Netherlands"]
5546- population f64 [82.19, 82.66, 17.08, 17.18]
5547- gdp s64 [4164, 4696, 784, 1000]
5548- >
5548+ Notice how the row `%{"number" => 21, "lower" => "z"}` from `rhs2` was matched
5549+ on twice since it was the nearest for both `%{"number" => 20, ...}` and
5550+ `%{"number" => 30, ...}`.
5551+
5552+ The `:by` option allows for additional matching criteria by also requiring
5553+ that matching rows from both DataFrames are strictly equal in the `:by`
5554+ column(s):
5555+
5556+ iex> alias Explorer.DataFrame, as: DF
5557+ iex> lhs_color = DF.new(number: [10, 20, 30], color: ["red", "blue", "blue"])
5558+ iex> rhs_blue = DF.new(number: [ 1, 11, 21], color: ["blue", "blue", "blue"], lower: ["x", "y", "z"])
5559+ iex> lhs_color |> DF.join_asof(rhs_blue, on: "number", by: "color") |> DF.to_table_string()
5560+ \" \" \"
5561+ +---------------------------------------------+
5562+ | Explorer DataFrame: [rows: 3, columns: 3] |
5563+ +-------------+---------------+---------------+
5564+ | number | color | lower |
5565+ | <s64> | <string> | <string> |
5566+ +=============+===============+===============+
5567+ | 10 | red | nil |
5568+ | 20 | blue | y |
5569+ | 30 | blue | z |
5570+ +-------------+---------------+---------------+
5571+ \" \" \"
55495572
5573+ This is somewhat like grouping the DataFrames by the `:by` column(s) first,
5574+ then checking for an "asof" match within each group only. In the example, rows
5575+ `%{"number" => 20, ...}` and `%{"number" => 30, ...}` in `lhs_color` match as
5576+ before because all rows in `rhs_blue` have `%{color: "blue", ...}`. But the
5577+ row `%{"number" => 10, ...}` gets no match because the `%{color: "red", ...}`
5578+ "group" has no rows in `rhs_blue`.
55505579 """
55515580 @ doc type: :multi
55525581 @ spec join_asof ( left :: DataFrame . t ( ) , right :: DataFrame . t ( ) , opts :: Keyword . t ( ) ) ::
0 commit comments