-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Background and motivation
Background
LINQ has a Join operator, which, like its SQL INNER JOIN counterpart, correlates elements of two sequences based on matching keys; the LINQ join implementation internally creates a Lookup for the inner sequence, and then loops over the outer sequence, doing a lookup for the matching inner elements. In SQL database parlance, this is known as the hash join strategy (SQL Server docs, PostgreSQL docs as well as this useful post).
In addition to the above, SQL also has LEFT JOIN, which returns outer elements even if there's no corresponding inner ones; LINQ, in contrast, lacks this operator. The LINQ conceptual documentation shows how to combine existing operators to achieve a left join:
var query = students
.GroupJoin(departments, student => student.DepartmentID, department => department.ID, (student, departmentList) => new { student, subgroup = departmentList })
.SelectMany(
joinedSet => joinedSet.subgroup.DefaultIfEmpty(),
(student, department) => new
{
student.student.FirstName,
student.student.LastName,
Department = department.Name
});There are two issues with the above suggestion:
- It's complicated, requiring combining 3 different LINQ operators in a specific way, and is easy to accidentally get wrong. Many EF users have complained about the complexity of this construct for expressing a simple SQL LEFT JOIN.
- It is inefficient - the combination of operators adds significant overhead compared to a single operator using a hash join (as Join does).
Proposal
This proposes introducing a 1st-class LeftJoin operator, which operates very similar to Join, except that it returns outer elements for which no inner element could be correlated. Aside from being much simpler to use than GroupJoin/SelectMany, it would also simply use Lookup internally - just like Join - and would therefore be much faster.
An initial implementation shows significant performance improvement compared to GroupJoin/SelectMany; LeftJoin is always faster than the equivalent GroupJoin/SelectMany construct, since GroupJoin itself constructs and uses a Lookup internally to implement an inner join internally - just like LeftJoin does - but also adds additional work on top.
BenchmarkDotNet v0.14.0, macOS Sequoia 15.1.1 (24B91) [Darwin 24.1.0]
Apple M2 Max, 1 CPU, 12 logical and 12 physical cores
.NET SDK 9.0.100
[Host] : .NET 9.0.0 (9.0.24.52809), Arm64 RyuJIT AdvSIMD
DefaultJob : .NET 9.0.0 (9.0.24.52809), Arm64 RyuJIT AdvSIMD
| Method | InnersPerOuter | OuterCount | Mean | Error | StdDev | Ratio | RatioSD | Gen0 | Gen1 | Gen2 | Allocated | Alloc Ratio |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LeftJoin | 1 | 1 | 97.73 ns | 0.402 ns | 0.376 ns | 0.64 | 0.00 | 0.0573 | - | - | 480 B | 0.71 |
| GroupJoin_SelectMany | 1 | 1 | 153.02 ns | 0.370 ns | 0.309 ns | 1.00 | 0.00 | 0.0811 | - | - | 680 B | 1.00 |
| LeftJoin | 1 | 10 | 488.66 ns | 5.764 ns | 5.391 ns | 0.57 | 0.01 | 0.2031 | - | - | 1704 B | 0.57 |
| GroupJoin_SelectMany | 1 | 10 | 856.40 ns | 4.305 ns | 4.027 ns | 1.00 | 0.01 | 0.3567 | - | - | 2984 B | 1.00 |
| LeftJoin | 1 | 100 | 4,814.61 ns | 19.884 ns | 17.627 ns | 0.59 | 0.00 | 1.7090 | 0.0534 | - | 14344 B | 0.54 |
| GroupJoin_SelectMany | 1 | 100 | 8,173.26 ns | 37.560 ns | 35.134 ns | 1.00 | 0.01 | 3.1586 | 0.1068 | - | 26424 B | 1.00 |
| LeftJoin | 1 | 1000 | 47,619.93 ns | 227.566 ns | 201.731 ns | 0.59 | 0.00 | 16.2964 | 3.4790 | - | 136728 B | 0.53 |
| GroupJoin_SelectMany | 1 | 1000 | 80,828.42 ns | 323.010 ns | 302.144 ns | 1.00 | 0.01 | 30.6396 | 6.5918 | - | 256808 B | 1.00 |
| LeftJoin | 10 | 1 | 300.03 ns | 3.089 ns | 2.580 ns | 0.70 | 0.01 | 0.1316 | - | - | 1104 B | 0.85 |
| GroupJoin_SelectMany | 10 | 1 | 430.06 ns | 3.919 ns | 3.273 ns | 1.00 | 0.01 | 0.1554 | - | - | 1304 B | 1.00 |
| LeftJoin | 10 | 10 | 2,954.84 ns | 28.864 ns | 25.587 ns | 0.87 | 0.01 | 0.9460 | 0.0076 | - | 7944 B | 0.86 |
| GroupJoin_SelectMany | 10 | 10 | 3,397.03 ns | 9.497 ns | 8.419 ns | 1.00 | 0.00 | 1.1024 | 0.0114 | - | 9224 B | 1.00 |
| LeftJoin | 10 | 100 | 31,873.63 ns | 120.088 ns | 106.455 ns | 0.81 | 0.02 | 9.1553 | 0.7324 | - | 76744 B | 0.86 |
| GroupJoin_SelectMany | 10 | 100 | 39,299.04 ns | 743.548 ns | 856.271 ns | 1.00 | 0.03 | 10.5591 | 0.8545 | - | 88824 B | 1.00 |
| LeftJoin | 10 | 1000 | 402,837.52 ns | 5,140.127 ns | 4,808.078 ns | 0.88 | 0.01 | 90.8203 | 30.2734 | - | 760728 B | 0.86 |
| GroupJoin_SelectMany | 10 | 1000 | 457,658.94 ns | 5,117.136 ns | 4,786.572 ns | 1.00 | 0.01 | 104.9805 | 34.6680 | - | 880808 B | 1.00 |
| LeftJoin | 100 | 1 | 1,906.55 ns | 9.026 ns | 8.443 ns | 0.83 | 0.01 | 0.6981 | 0.0019 | - | 5848 B | 0.97 |
| GroupJoin_SelectMany | 100 | 1 | 2,310.49 ns | 29.862 ns | 27.933 ns | 1.00 | 0.02 | 0.7210 | 0.0038 | - | 6048 B | 1.00 |
| LeftJoin | 100 | 10 | 18,861.98 ns | 366.707 ns | 560.001 ns | 0.85 | 0.03 | 6.5918 | 0.2747 | - | 55384 B | 0.98 |
| GroupJoin_SelectMany | 100 | 10 | 22,285.02 ns | 189.605 ns | 177.356 ns | 1.00 | 0.01 | 6.7444 | 0.2747 | - | 56664 B | 1.00 |
| LeftJoin | 100 | 100 | 197,041.60 ns | 2,033.092 ns | 1,802.283 ns | 0.83 | 0.01 | 65.6738 | 15.8691 | - | 551144 B | 0.98 |
| GroupJoin_SelectMany | 100 | 100 | 236,428.77 ns | 1,920.429 ns | 1,702.410 ns | 1.00 | 0.01 | 67.1387 | 16.6016 | - | 563224 B | 1.00 |
| LeftJoin | 100 | 1000 | 2,628,960.79 ns | 22,819.664 ns | 21,345.528 ns | 0.85 | 0.03 | 656.2500 | 316.4063 | - | 5504731 B | 0.98 |
| GroupJoin_SelectMany | 100 | 1000 | 3,113,006.15 ns | 62,259.419 ns | 95,076.717 ns | 1.00 | 0.04 | 671.8750 | 328.1250 | - | 5624811 B | 1.00 |
| LeftJoin | 1000 | 1 | 17,487.96 ns | 203.775 ns | 180.641 ns | 0.84 | 0.01 | 5.8594 | 0.1221 | - | 49056 B | 1.00 |
| GroupJoin_SelectMany | 1000 | 1 | 20,818.61 ns | 251.607 ns | 210.103 ns | 1.00 | 0.01 | 5.8594 | 0.1831 | - | 49256 B | 1.00 |
| LeftJoin | 1000 | 10 | 175,500.86 ns | 2,377.458 ns | 1,856.162 ns | 0.77 | 0.02 | 58.1055 | 11.9629 | - | 487464 B | 1.00 |
| GroupJoin_SelectMany | 1000 | 10 | 228,064.76 ns | 4,414.845 ns | 4,907.089 ns | 1.00 | 0.03 | 58.3496 | 12.6953 | - | 488744 B | 1.00 |
| LeftJoin | 1000 | 100 | 2,092,691.16 ns | 39,429.196 ns | 77,829.392 ns | 0.86 | 0.04 | 582.0313 | 250.0000 | - | 4871947 B | 1.00 |
| GroupJoin_SelectMany | 1000 | 100 | 2,422,638.76 ns | 47,560.763 ns | 84,539.215 ns | 1.00 | 0.05 | 582.0313 | 246.0938 | - | 4884027 B | 1.00 |
| LeftJoin | 1000 | 1000 | 42,875,766.40 ns | 738,934.130 ns | 691,199.444 ns | 0.89 | 0.02 | 5900.0000 | 1700.0000 | 100.0000 | 48712842 B | 1.00 |
| GroupJoin_SelectMany | 1000 | 1000 | 47,966,803.96 ns | 925,858.818 ns | 1,066,220.392 ns | 1.00 | 0.03 | 5888.8889 | 1666.6667 | 111.1111 | 48832891 B | 1.00 |
Benchmark code
[MemoryDiagnoser]
public class LeftJoinBenchmarks
{
[Params(1, 10, 100, 1000, Priority = 1)]
// [Params(1, Priority = 1)]
public int InnersPerOuter { get; set; }
[Params(1, 10, 100, 1000, Priority = 2)]
// [Params(1, Priority = 2)]
public int OuterCount { get; set; }
private Outer[] _outers = null!;
private Inner[] _inners = null!;
class Outer
{
public int Id { get; set; }
public string? OuterPayload { get; set; }
}
class Inner
{
public int Id { get; set; }
public int OuterId { get; set; }
public string? InnerPayload { get; set; }
}
private const int RandomSeed = 42;
[GlobalSetup]
public void Setup()
{
_outers = new Outer[OuterCount];
_inners = new Inner[OuterCount * InnersPerOuter];
var remainingInners = new List<Inner>(OuterCount * InnersPerOuter);
for (var outerId = 0; outerId < OuterCount; outerId++)
{
_outers[outerId] = new Outer { Id = outerId, OuterPayload = $"Outer{outerId}" };
for (var j = 0; j < InnersPerOuter; j++)
{
var innerId = outerId * j + j;
remainingInners.Add(new Inner { Id = innerId, OuterId = outerId, InnerPayload = $"Inner{innerId}" });
}
}
var random = new Random(RandomSeed);
for (var i = 0; i < _inners.Length; i++)
{
var j = random.Next(0, remainingInners.Count);
_inners[i] = remainingInners[j];
remainingInners.RemoveAt(j);
}
Debug.Assert(remainingInners.Count == 0);
}
[Benchmark]
public int LeftJoin()
=> _outers.LeftJoin(_inners, o => o.Id, i => i.OuterId, (o, i) => new { o.OuterPayload, i?.InnerPayload }).Count();
[Benchmark(Baseline = true)]
public int GroupJoin_SelectMany()
=> _outers
.GroupJoin(_inners, o => o.Id, i => i.OuterId, (o, inners) => new { Outer = o, Inners = inners })
.SelectMany(
joinedSet => joinedSet.Inners.DefaultIfEmpty(),
(o, i) => new
{
o.Outer.OuterPayload,
i?.InnerPayload
})
.Count();
}LeftJoin operator prototype implementation
private static IEnumerable<TResult> LeftJoinIterator<TOuter, TInner, TKey, TResult>(IEnumerable<TOuter> outer,
IEnumerable<TInner> inner, Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector,
Func<TOuter, TInner?, TResult> resultSelector, IEqualityComparer<TKey>? comparer)
{
using (IEnumerator<TOuter> e = outer.GetEnumerator())
{
if (e.MoveNext())
{
Lookup<TKey, TInner> innerLookup =
Lookup<TKey, TInner>.CreateForJoin(inner, innerKeySelector, comparer);
do
{
TOuter item = e.Current;
Grouping<TKey, TInner>? g = innerLookup.GetGrouping(outerKeySelector(item), create: false);
if (g is null)
{
yield return resultSelector(item, default);
}
else
{
int count = g._count;
TInner[] elements = g._elements;
for (int i = 0; i != count; ++i)
{
yield return resultSelector(item, elements[i]);
}
}
} while (e.MoveNext());
}
}
}Note: The current LINQ documentation for GroupJoin/SelectMany shows using AsQueryable, for no apparent reason. The addition of AsQueryable here adds very significant perf overhead - see dotnet/docs#43807 for benchmarks and a proposal to remove AsQueryable from that code sample.
Additional Notes
- This proposal also helps LINQ provider such as EF - the ability to directly express a SQL LEFT JOIN comes up regularly.
- For inner value types, the operator returns
defaultwhen an outer has no inners; this makes it impossible to distinguish between an inner not being found, and the inner being found but being null. This is similar to e.g. FirstOrDefault; although it's quite contrived for LeftJoin, see below for some notes and alternative API designs. - Mostly for symmetry's sake, we could also introduce
RightJoin(), which is the reverse ofLeftJoin()(i.e. elements from the inner sequence are returned if no correlated outer is found). Right joins are seldom used in SQL, and it's always possible to flip the sequences around to express the join as a left join instead. - In theory, we could also introduce an operator to perform a FULL OUTER JOIN, which is a combination of LEFT and RIGHT JOIN (elements from both outer and inner are returned even if there's no correlated element on the other side). This is, however, a rare operation and not generally very useful, and is more complicated to implement efficiently. We can do this later if the need arises, but should keep naming in mind.
- Naming-wise:
- Following the full operation name, we could call the operator LeftOuterJoin instead of LeftJoin; though the OUTER keyword is optional in all major SQL databases, and routinely dropped. LeftJoin is a lighter, more accessible name.
- It may seem a bit odd to introduce the very SQL-oriented LeftJoin, RightJoin. However, LINQ operators already follow SQL naming (e.g. Where/Select rather than Filter/Map).
- We could have JoinLeft, JoinRight to have all join operators grouped together in Intellisense (though the naming is less intuitive).
- Corresponding syntax could be added to LINQ expression syntax, on the C# side (just like
join) - but this is optional. Proposal: Proposal: introduce left and right join clauses to C# query expression syntax csharplang#8892. - We should consider an analyzer+code fix to transform the current recommended GroupJoin/SelectMany into the new LeftJoin - including with the unneeded AsQueryable shown in the docs - as many people have been copy-pasting that and have very slow code (see this).
/cc @jeffhandley @dotnet/area-system-linq @dotnet/efteam
API Proposal
namespace System.Linq;
public static class Enumerable
{
// Alternative naming: LeftOuterJoin
public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
this IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter, TInner?, TResult> resultSelector);
public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
this IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter, TInner?, TResult> resultSelector,
IEqualityComparer<TKey>? comparer);
// ... plus optionally RightJoin version of the above
}
public static class Queryable
{
public static System.Linq.IQueryable<TResult> LeftJoin<TOuter,TInner,TKey,TResult>(
this System.Linq.IQueryable<TOuter> outer,
System.Collections.Generic.IEnumerable<TInner> inner,
System.Linq.Expressions.Expression<Func<TOuter,TKey>> outerKeySelector,
System.Linq.Expressions.Expression<Func<TInner,TKey>> innerKeySelector,
System.Linq.Expressions.Expression<Func<TOuter,TInner?,TResult>> resultSelector);
public static System.Linq.IQueryable<TResult> Join<TOuter,TInner,TKey,TResult>(
this System.Linq.IQueryable<TOuter> outer,
System.Collections.Generic.IEnumerable<TInner> inner,
System.Linq.Expressions.Expression<Func<TOuter,TKey>> outerKeySelector,
System.Linq.Expressions.Expression<Func<TInner,TKey>> innerKeySelector,
System.Linq.Expressions.Expression<Func<TOuter,TInner?,TResult>> resultSelector,
System.Collections.Generic.IEqualityComparer<TKey>? comparer);
}
// If we decide to also introduce RightJoin (or RightOuterJoin):
public static class Enumerable
{
// Alternative naming: RightOuterJoin
public static IEnumerable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
this IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector,
Func<TOuter?, TInner, TResult> resultSelector);
public static IEnumerable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
this IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector,
Func<TOuter?, TInner, TResult> resultSelector,
IEqualityComparer<TKey>? comparer);
// ... plus optionally RightJoin version of the above
}
public static class Queryable
{
public static System.Linq.IQueryable<TResult> RightJoin<TOuter,TInner,TKey,TResult>(
this System.Linq.IQueryable<TOuter> outer,
System.Collections.Generic.IEnumerable<TInner> inner,
System.Linq.Expressions.Expression<Func<TOuter,TKey>> outerKeySelector,
System.Linq.Expressions.Expression<Func<TInner,TKey>> innerKeySelector,
System.Linq.Expressions.Expression<Func<TOuter?,TInner,TResult>> resultSelector);
public static System.Linq.IQueryable<TResult> RightJoin<TOuter,TInner,TKey,TResult>(
this System.Linq.IQueryable<TOuter> outer,
System.Collections.Generic.IEnumerable<TInner> inner,
System.Linq.Expressions.Expression<Func<TOuter,TKey>> outerKeySelector,
System.Linq.Expressions.Expression<Func<TInner,TKey>> innerKeySelector,
System.Linq.Expressions.Expression<Func<TOuter?,TInner,TResult>> resultSelector,
System.Collections.Generic.IEqualityComparer<TKey>? comparer);
}API Usage
var query = outers.LeftJoin(inners, o => o.Id, i => i.OuterId, (o, i) => new { o.OuterPayload, i?.InnerPayload });Value types and alternative LeftJoin API shapes
As pointed out above, the fact that the result selector accepts a defaultable inner means that it's impossible to distinguish between no inner being found for an outer, and the situation where the inner itself happens to be the default (thanks for discussion on this, @eiriktsarpalis). This problem isn't specific to this proposal, other operators (e.g. FirstOrDefault) have the same problem.
An alternative API to address this would pass a boolean to the result selector, representing whether an inner was found or not:
public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
this IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter, bool, TInner, TResult> resultSelector);Alternatively, with the upcoming introduction of discriminated unions to .NET, an Optional type could allow the same thing:
public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
this IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter, Optional<TInner>, TResult> resultSelector);However, it seems that cases where the distinction between "not found" and "found but default" matters are especially rare for joining; the inner key selector would have to accept a null/default inner and "extract" a key out of that (matching the outer key); not impossible, but definitely feels contrived. It's also possible to work around the ambiguity (in some cases) by switching to a nullable value type.
In other words, we should IMHO avoid making the API more complex/heavy for everyone because of a 1% case.
Previous related issues
- Please consider to extend Linq to add a new keyword leftjoin or rightjoin csharplang#361
- Add new keyword "leftjoin" to make LINQ more readable csharplang#508
- Add leftjoin/rightjoin syntax to linq to end the horrific syntax currently required #53236
- Support the new .NET 10 LeftJoin operators efcore#12793