Skip to content

[API Proposal]: Introduce LeftJoin LINQ operator #110292

@roji

Description

@roji

Background and motivation

Background

LINQ has a Join operator, which, like its SQL INNER JOIN counterpart, correlates elements of two sequences based on matching keys; the LINQ join implementation internally creates a Lookup for the inner sequence, and then loops over the outer sequence, doing a lookup for the matching inner elements. In SQL database parlance, this is known as the hash join strategy (SQL Server docs, PostgreSQL docs as well as this useful post).

In addition to the above, SQL also has LEFT JOIN, which returns outer elements even if there's no corresponding inner ones; LINQ, in contrast, lacks this operator. The LINQ conceptual documentation shows how to combine existing operators to achieve a left join:

var query = students
    .GroupJoin(departments, student => student.DepartmentID, department => department.ID, (student, departmentList) => new { student, subgroup = departmentList })
    .SelectMany(
        joinedSet => joinedSet.subgroup.DefaultIfEmpty(),
        (student, department) => new
        {
            student.student.FirstName,
            student.student.LastName,
            Department = department.Name
        });

There are two issues with the above suggestion:

  • It's complicated, requiring combining 3 different LINQ operators in a specific way, and is easy to accidentally get wrong. Many EF users have complained about the complexity of this construct for expressing a simple SQL LEFT JOIN.
  • It is inefficient - the combination of operators adds significant overhead compared to a single operator using a hash join (as Join does).

Proposal

This proposes introducing a 1st-class LeftJoin operator, which operates very similar to Join, except that it returns outer elements for which no inner element could be correlated. Aside from being much simpler to use than GroupJoin/SelectMany, it would also simply use Lookup internally - just like Join - and would therefore be much faster.

An initial implementation shows significant performance improvement compared to GroupJoin/SelectMany; LeftJoin is always faster than the equivalent GroupJoin/SelectMany construct, since GroupJoin itself constructs and uses a Lookup internally to implement an inner join internally - just like LeftJoin does - but also adds additional work on top.

BenchmarkDotNet v0.14.0, macOS Sequoia 15.1.1 (24B91) [Darwin 24.1.0]
Apple M2 Max, 1 CPU, 12 logical and 12 physical cores
.NET SDK 9.0.100
  [Host]     : .NET 9.0.0 (9.0.24.52809), Arm64 RyuJIT AdvSIMD
  DefaultJob : .NET 9.0.0 (9.0.24.52809), Arm64 RyuJIT AdvSIMD
Method InnersPerOuter OuterCount Mean Error StdDev Ratio RatioSD Gen0 Gen1 Gen2 Allocated Alloc Ratio
LeftJoin 1 1 97.73 ns 0.402 ns 0.376 ns 0.64 0.00 0.0573 - - 480 B 0.71
GroupJoin_SelectMany 1 1 153.02 ns 0.370 ns 0.309 ns 1.00 0.00 0.0811 - - 680 B 1.00
LeftJoin 1 10 488.66 ns 5.764 ns 5.391 ns 0.57 0.01 0.2031 - - 1704 B 0.57
GroupJoin_SelectMany 1 10 856.40 ns 4.305 ns 4.027 ns 1.00 0.01 0.3567 - - 2984 B 1.00
LeftJoin 1 100 4,814.61 ns 19.884 ns 17.627 ns 0.59 0.00 1.7090 0.0534 - 14344 B 0.54
GroupJoin_SelectMany 1 100 8,173.26 ns 37.560 ns 35.134 ns 1.00 0.01 3.1586 0.1068 - 26424 B 1.00
LeftJoin 1 1000 47,619.93 ns 227.566 ns 201.731 ns 0.59 0.00 16.2964 3.4790 - 136728 B 0.53
GroupJoin_SelectMany 1 1000 80,828.42 ns 323.010 ns 302.144 ns 1.00 0.01 30.6396 6.5918 - 256808 B 1.00
LeftJoin 10 1 300.03 ns 3.089 ns 2.580 ns 0.70 0.01 0.1316 - - 1104 B 0.85
GroupJoin_SelectMany 10 1 430.06 ns 3.919 ns 3.273 ns 1.00 0.01 0.1554 - - 1304 B 1.00
LeftJoin 10 10 2,954.84 ns 28.864 ns 25.587 ns 0.87 0.01 0.9460 0.0076 - 7944 B 0.86
GroupJoin_SelectMany 10 10 3,397.03 ns 9.497 ns 8.419 ns 1.00 0.00 1.1024 0.0114 - 9224 B 1.00
LeftJoin 10 100 31,873.63 ns 120.088 ns 106.455 ns 0.81 0.02 9.1553 0.7324 - 76744 B 0.86
GroupJoin_SelectMany 10 100 39,299.04 ns 743.548 ns 856.271 ns 1.00 0.03 10.5591 0.8545 - 88824 B 1.00
LeftJoin 10 1000 402,837.52 ns 5,140.127 ns 4,808.078 ns 0.88 0.01 90.8203 30.2734 - 760728 B 0.86
GroupJoin_SelectMany 10 1000 457,658.94 ns 5,117.136 ns 4,786.572 ns 1.00 0.01 104.9805 34.6680 - 880808 B 1.00
LeftJoin 100 1 1,906.55 ns 9.026 ns 8.443 ns 0.83 0.01 0.6981 0.0019 - 5848 B 0.97
GroupJoin_SelectMany 100 1 2,310.49 ns 29.862 ns 27.933 ns 1.00 0.02 0.7210 0.0038 - 6048 B 1.00
LeftJoin 100 10 18,861.98 ns 366.707 ns 560.001 ns 0.85 0.03 6.5918 0.2747 - 55384 B 0.98
GroupJoin_SelectMany 100 10 22,285.02 ns 189.605 ns 177.356 ns 1.00 0.01 6.7444 0.2747 - 56664 B 1.00
LeftJoin 100 100 197,041.60 ns 2,033.092 ns 1,802.283 ns 0.83 0.01 65.6738 15.8691 - 551144 B 0.98
GroupJoin_SelectMany 100 100 236,428.77 ns 1,920.429 ns 1,702.410 ns 1.00 0.01 67.1387 16.6016 - 563224 B 1.00
LeftJoin 100 1000 2,628,960.79 ns 22,819.664 ns 21,345.528 ns 0.85 0.03 656.2500 316.4063 - 5504731 B 0.98
GroupJoin_SelectMany 100 1000 3,113,006.15 ns 62,259.419 ns 95,076.717 ns 1.00 0.04 671.8750 328.1250 - 5624811 B 1.00
LeftJoin 1000 1 17,487.96 ns 203.775 ns 180.641 ns 0.84 0.01 5.8594 0.1221 - 49056 B 1.00
GroupJoin_SelectMany 1000 1 20,818.61 ns 251.607 ns 210.103 ns 1.00 0.01 5.8594 0.1831 - 49256 B 1.00
LeftJoin 1000 10 175,500.86 ns 2,377.458 ns 1,856.162 ns 0.77 0.02 58.1055 11.9629 - 487464 B 1.00
GroupJoin_SelectMany 1000 10 228,064.76 ns 4,414.845 ns 4,907.089 ns 1.00 0.03 58.3496 12.6953 - 488744 B 1.00
LeftJoin 1000 100 2,092,691.16 ns 39,429.196 ns 77,829.392 ns 0.86 0.04 582.0313 250.0000 - 4871947 B 1.00
GroupJoin_SelectMany 1000 100 2,422,638.76 ns 47,560.763 ns 84,539.215 ns 1.00 0.05 582.0313 246.0938 - 4884027 B 1.00
LeftJoin 1000 1000 42,875,766.40 ns 738,934.130 ns 691,199.444 ns 0.89 0.02 5900.0000 1700.0000 100.0000 48712842 B 1.00
GroupJoin_SelectMany 1000 1000 47,966,803.96 ns 925,858.818 ns 1,066,220.392 ns 1.00 0.03 5888.8889 1666.6667 111.1111 48832891 B 1.00
Benchmark code
[MemoryDiagnoser]
public class LeftJoinBenchmarks
{
    [Params(1, 10, 100, 1000, Priority = 1)]
    // [Params(1, Priority = 1)]
    public int InnersPerOuter { get; set; }

    [Params(1, 10, 100, 1000, Priority = 2)]
    // [Params(1, Priority = 2)]
    public int OuterCount { get; set; }

    private Outer[] _outers = null!;
    private Inner[] _inners = null!;

    class Outer
    {
        public int Id { get; set; }
        public string? OuterPayload { get; set; }
    }

    class Inner
    {
        public int Id { get; set; }
        public int OuterId { get; set; }
        public string? InnerPayload { get; set; }
    }

    private const int RandomSeed = 42;

    [GlobalSetup]
    public void Setup()
    {
        _outers = new Outer[OuterCount];
        _inners = new Inner[OuterCount * InnersPerOuter];

        var remainingInners = new List<Inner>(OuterCount * InnersPerOuter);
        for (var outerId = 0; outerId < OuterCount; outerId++)
        {
            _outers[outerId] = new Outer { Id = outerId, OuterPayload = $"Outer{outerId}" };

            for (var j = 0; j < InnersPerOuter; j++)
            {
                var innerId = outerId * j + j;
                remainingInners.Add(new Inner { Id = innerId, OuterId = outerId, InnerPayload = $"Inner{innerId}" });
            }
        }

        var random = new Random(RandomSeed);

        for (var i = 0; i < _inners.Length; i++)
        {
            var j = random.Next(0, remainingInners.Count);
            _inners[i] = remainingInners[j];
            remainingInners.RemoveAt(j);
        }

        Debug.Assert(remainingInners.Count == 0);
    }

    [Benchmark]
    public int LeftJoin()
        => _outers.LeftJoin(_inners, o => o.Id, i => i.OuterId, (o, i) => new { o.OuterPayload, i?.InnerPayload }).Count();

    [Benchmark(Baseline = true)]
    public int GroupJoin_SelectMany()
        => _outers
            .GroupJoin(_inners, o => o.Id, i => i.OuterId, (o, inners) => new { Outer = o, Inners = inners })
            .SelectMany(
                joinedSet => joinedSet.Inners.DefaultIfEmpty(),
                (o, i) => new
                {
                    o.Outer.OuterPayload,
                    i?.InnerPayload
                })
            .Count();
}
LeftJoin operator prototype implementation
private static IEnumerable<TResult> LeftJoinIterator<TOuter, TInner, TKey, TResult>(IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner, Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner?, TResult> resultSelector, IEqualityComparer<TKey>? comparer)
{
    using (IEnumerator<TOuter> e = outer.GetEnumerator())
    {
        if (e.MoveNext())
        {
            Lookup<TKey, TInner> innerLookup =
                Lookup<TKey, TInner>.CreateForJoin(inner, innerKeySelector, comparer);
            do
            {
                TOuter item = e.Current;
                Grouping<TKey, TInner>? g = innerLookup.GetGrouping(outerKeySelector(item), create: false);
                if (g is null)
                {
                    yield return resultSelector(item, default);
                }
                else
                {
                    int count = g._count;
                    TInner[] elements = g._elements;
                    for (int i = 0; i != count; ++i)
                    {
                        yield return resultSelector(item, elements[i]);
                    }
                }
            } while (e.MoveNext());
        }
    }
}

Note: The current LINQ documentation for GroupJoin/SelectMany shows using AsQueryable, for no apparent reason. The addition of AsQueryable here adds very significant perf overhead - see dotnet/docs#43807 for benchmarks and a proposal to remove AsQueryable from that code sample.

Additional Notes

  • This proposal also helps LINQ provider such as EF - the ability to directly express a SQL LEFT JOIN comes up regularly.
  • For inner value types, the operator returns default when an outer has no inners; this makes it impossible to distinguish between an inner not being found, and the inner being found but being null. This is similar to e.g. FirstOrDefault; although it's quite contrived for LeftJoin, see below for some notes and alternative API designs.
  • Mostly for symmetry's sake, we could also introduce RightJoin(), which is the reverse of LeftJoin() (i.e. elements from the inner sequence are returned if no correlated outer is found). Right joins are seldom used in SQL, and it's always possible to flip the sequences around to express the join as a left join instead.
  • In theory, we could also introduce an operator to perform a FULL OUTER JOIN, which is a combination of LEFT and RIGHT JOIN (elements from both outer and inner are returned even if there's no correlated element on the other side). This is, however, a rare operation and not generally very useful, and is more complicated to implement efficiently. We can do this later if the need arises, but should keep naming in mind.
  • Naming-wise:
    • Following the full operation name, we could call the operator LeftOuterJoin instead of LeftJoin; though the OUTER keyword is optional in all major SQL databases, and routinely dropped. LeftJoin is a lighter, more accessible name.
    • It may seem a bit odd to introduce the very SQL-oriented LeftJoin, RightJoin. However, LINQ operators already follow SQL naming (e.g. Where/Select rather than Filter/Map).
    • We could have JoinLeft, JoinRight to have all join operators grouped together in Intellisense (though the naming is less intuitive).
  • Corresponding syntax could be added to LINQ expression syntax, on the C# side (just like join) - but this is optional. Proposal: Proposal: introduce left and right join clauses to C# query expression syntax csharplang#8892.
  • We should consider an analyzer+code fix to transform the current recommended GroupJoin/SelectMany into the new LeftJoin - including with the unneeded AsQueryable shown in the docs - as many people have been copy-pasting that and have very slow code (see this).

/cc @jeffhandley @dotnet/area-system-linq @dotnet/efteam

API Proposal

namespace System.Linq;

public static class Enumerable
{
    // Alternative naming: LeftOuterJoin
    public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
        Func<TOuter, TInner?, TResult> resultSelector);

    public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
        Func<TOuter, TInner?, TResult> resultSelector,
        IEqualityComparer<TKey>? comparer);

// ... plus optionally RightJoin version of the above
}

public static class Queryable
{
    public static System.Linq.IQueryable<TResult> LeftJoin<TOuter,TInner,TKey,TResult>(
        this System.Linq.IQueryable<TOuter> outer,
        System.Collections.Generic.IEnumerable<TInner> inner,
        System.Linq.Expressions.Expression<Func<TOuter,TKey>> outerKeySelector,
        System.Linq.Expressions.Expression<Func<TInner,TKey>> innerKeySelector,
        System.Linq.Expressions.Expression<Func<TOuter,TInner?,TResult>> resultSelector);
	
    public static System.Linq.IQueryable<TResult> Join<TOuter,TInner,TKey,TResult>(
        this System.Linq.IQueryable<TOuter> outer,
        System.Collections.Generic.IEnumerable<TInner> inner,
        System.Linq.Expressions.Expression<Func<TOuter,TKey>> outerKeySelector,
        System.Linq.Expressions.Expression<Func<TInner,TKey>> innerKeySelector,
        System.Linq.Expressions.Expression<Func<TOuter,TInner?,TResult>> resultSelector,
        System.Collections.Generic.IEqualityComparer<TKey>? comparer);
}

// If we decide to also introduce RightJoin (or RightOuterJoin):

public static class Enumerable
{
    // Alternative naming: RightOuterJoin
    public static IEnumerable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector,
        Func<TOuter?, TInner, TResult> resultSelector);

    public static IEnumerable<TResult> RightJoin<TOuter, TInner, TKey, TResult>(
        this IEnumerable<TOuter> outer,  
        IEnumerable<TInner> inner,
        Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector,
        Func<TOuter?, TInner, TResult> resultSelector,
        IEqualityComparer<TKey>? comparer);

// ... plus optionally RightJoin version of the above
}

public static class Queryable
{
    public static System.Linq.IQueryable<TResult> RightJoin<TOuter,TInner,TKey,TResult>(
        this System.Linq.IQueryable<TOuter> outer,
        System.Collections.Generic.IEnumerable<TInner> inner,
        System.Linq.Expressions.Expression<Func<TOuter,TKey>> outerKeySelector,
        System.Linq.Expressions.Expression<Func<TInner,TKey>> innerKeySelector,
        System.Linq.Expressions.Expression<Func<TOuter?,TInner,TResult>> resultSelector);
	
    public static System.Linq.IQueryable<TResult> RightJoin<TOuter,TInner,TKey,TResult>(
        this System.Linq.IQueryable<TOuter> outer,
        System.Collections.Generic.IEnumerable<TInner> inner,
        System.Linq.Expressions.Expression<Func<TOuter,TKey>> outerKeySelector,
        System.Linq.Expressions.Expression<Func<TInner,TKey>> innerKeySelector,
        System.Linq.Expressions.Expression<Func<TOuter?,TInner,TResult>> resultSelector,
        System.Collections.Generic.IEqualityComparer<TKey>? comparer);
}

API Usage

var query = outers.LeftJoin(inners, o => o.Id, i => i.OuterId, (o, i) => new { o.OuterPayload, i?.InnerPayload });

Value types and alternative LeftJoin API shapes

As pointed out above, the fact that the result selector accepts a defaultable inner means that it's impossible to distinguish between no inner being found for an outer, and the situation where the inner itself happens to be the default (thanks for discussion on this, @eiriktsarpalis). This problem isn't specific to this proposal, other operators (e.g. FirstOrDefault) have the same problem.

An alternative API to address this would pass a boolean to the result selector, representing whether an inner was found or not:

public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
	this IEnumerable<TOuter> outer,  
	IEnumerable<TInner> inner,
	Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
	Func<TOuter, bool, TInner, TResult> resultSelector);

Alternatively, with the upcoming introduction of discriminated unions to .NET, an Optional type could allow the same thing:

public static IEnumerable<TResult> LeftJoin<TOuter, TInner, TKey, TResult>(
	this IEnumerable<TOuter> outer,  
	IEnumerable<TInner> inner,
	Func<TOuter, TKey> outerKeySelector,
        Func<TInner, TKey> innerKeySelector,
	Func<TOuter, Optional<TInner>, TResult> resultSelector);

However, it seems that cases where the distinction between "not found" and "found but default" matters are especially rare for joining; the inner key selector would have to accept a null/default inner and "extract" a key out of that (matching the outer key); not impossible, but definitely feels contrived. It's also possible to work around the ambiguity (in some cases) by switching to a nullable value type.

In other words, we should IMHO avoid making the API more complex/heavy for everyone because of a 1% case.

Previous related issues

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions