Skip to content

Implement distinct qualifier for aggregate function calls#5740

Merged
nwt merged 6 commits intomainfrom
distinct-for-aggregates
Apr 3, 2025
Merged

Implement distinct qualifier for aggregate function calls#5740
nwt merged 6 commits intomainfrom
distinct-for-aggregates

Conversation

@nwt
Copy link
Copy Markdown
Member

@nwt nwt commented Mar 26, 2025

This enables "count(distinct a)" etc.

Fixes #5708.

I didn't see a straightforward way to add this to the documentation so I didn't do so.

This enables "count(distinct a)" etc.
@nwt nwt force-pushed the distinct-for-aggregates branch from d826c43 to de76784 Compare April 2, 2025 15:47
@nwt nwt requested review from a team and philrz April 2, 2025 15:49
@nwt nwt marked this pull request as ready for review April 2, 2025 15:49
@philrz
Copy link
Copy Markdown
Contributor

philrz commented Apr 2, 2025

I've been testing out this branch as of commit 99f3988.

In positive news, the query I showed in linked issue #5708 now runs and returns the expected result.

$ super -version
Version: v1.18.0-357-g99f39885

$ SUPER_VAM=1 super -c "SELECT COUNT(DISTINCT client_ip) FROM 'bench2.csup'"
{count:2432416(uint64)}

However, I also have a bunch of ClickBench queries that were waiting on COUNT(DISTINCT ...) support, and now that I can run those, one of them causes a panic. Of course that doesn't mean that the changes in this branch caused the panic, but since I couldn't run it at all before, I have no way of being sure, so I submit it here for consideration and would be happy to put it in a separate issue if it's confirmed to be a separate issue. The referenced test data is available at s3://brim-sampledata/mgbench/bench0.parquet.

$ SUPER_VAM=1 super -c "SELECT RegionID, COUNT(DISTINCT UserID) AS u FROM 'bench0.parquet' GROUP BY RegionID ORDER BY u DESC LIMIT 10;"
panic: interface conversion: vector.Any is *vector.View, not *vector.Bytes

goroutine 49 [running]:
github.com/brimdata/super/runtime/vam/expr/agg.(*distinct).ConsumeAsPartial(0xc0bc840050, {0xb37ebf8, 0xc0be7ca180})
	/Users/phil/work/super/runtime/vam/expr/agg/distinct.go:44 +0x205
github.com/brimdata/super/runtime/vam/op/aggregate.(*superTable).update(0xc0bc840000, {0xc0b714e140, 0x1, 0x2}, {0xc0b714e150, 0x1, 0xc019ce2b18?})
	/Users/phil/work/super/runtime/vam/op/aggregate/aggtable.go:67 +0x506
github.com/brimdata/super/runtime/vam/op/aggregate.(*Aggregate).consume(0xc0002ece70, {0xc0b714e140, 0x1, 0x2}, {0xc0b714e150, 0x1, 0x1})
	/Users/phil/work/super/runtime/vam/op/aggregate/aggregate.go:102 +0x27e
github.com/brimdata/super/runtime/vam/op/aggregate.(*Aggregate).Pull.func1({0xc0b714e140, 0x2, 0xc019ce2c30?})
	/Users/phil/work/super/runtime/vam/op/aggregate/aggregate.go:83 +0x75
github.com/brimdata/super/vector.Apply(0x80?, 0xb37e598?, {0xc0b714e140?, 0x0?, 0xb13d900?})
	/Users/phil/work/super/vector/apply.go:17 +0xa3
github.com/brimdata/super/runtime/vam/op/aggregate.(*Aggregate).Pull(0xc0002ece70, 0x0?)
	/Users/phil/work/super/runtime/vam/op/aggregate/aggregate.go:82 +0x13b
github.com/brimdata/super/runtime/vam/op.(*Yield).Pull(0xc019cc75c0, 0x0)
	/Users/phil/work/super/runtime/vam/op/yield.go:27 +0x44
github.com/brimdata/super/runtime/vam.(*Materializer).Pull(0x0?, 0x0?)
	/Users/phil/work/super/runtime/vam/materialize.go:26 +0x27
github.com/brimdata/super/runtime/sam/op/sort.(*Op).run(0xc019cc8180)
	/Users/phil/work/super/runtime/sam/op/sort/sort.go:77 +0xda
created by github.com/brimdata/super/runtime/sam/op/sort.(*Op).Pull.func1 in goroutine 1
	/Users/phil/work/super/runtime/sam/op/sort/sort.go:50 +0x66

@philrz
Copy link
Copy Markdown
Contributor

philrz commented Apr 2, 2025

More testing on this branch, now at commit aa04445.

It looks like there's a correctness problem in vector runtime with the following ClickBench query. It uses the same test data as referenced in the prior comment.

$ super -version
Version: v1.18.0-359-gaa044456

$ SUPER_VAM=1 super -c "SELECT MobilePhoneModel, COUNT(DISTINCT UserID) AS u FROM 'bench0.parquet' WHERE MobilePhoneModel <> '' GROUP BY MobilePhoneModel ORDER BY u DESC LIMIT 10;"
{MobilePhoneModel:"iPad",u:5106874(uint64)}
{MobilePhoneModel:"iPhone",u:229128(uint64)}
{MobilePhoneModel:"A500",u:75723(uint64)}
{MobilePhoneModel:"N8-00",u:24880(uint64)}
{MobilePhoneModel:"ONE TOUCH 6030A",u:12436(uint64)}
{MobilePhoneModel:"iPho",u:10163(uint64)}
{MobilePhoneModel:"GT-P7300B",u:9217(uint64)}
{MobilePhoneModel:"3110000",u:8975(uint64)}
{MobilePhoneModel:"HTC Desire",u:7129(uint64)}
{MobilePhoneModel:"eagle75",u:6084(uint64)}

Whereas in sequential runtime it's:

$ super -c "SELECT MobilePhoneModel, COUNT(DISTINCT UserID) AS u FROM 'bench0.parquet' WHERE MobilePhoneModel <> '' GROUP BY MobilePhoneModel ORDER BY u DESC LIMIT 10;"
{MobilePhoneModel:"iPad",u:1090347(uint64)}
{MobilePhoneModel:"iPhone",u:45758(uint64)}
{MobilePhoneModel:"A500",u:16046(uint64)}
{MobilePhoneModel:"N8-00",u:5565(uint64)}
{MobilePhoneModel:"iPho",u:3300(uint64)}
{MobilePhoneModel:"ONE TOUCH 6030A",u:2759(uint64)}
{MobilePhoneModel:"GT-P7300B",u:1907(uint64)}
{MobilePhoneModel:"3110000",u:1871(uint64)}
{MobilePhoneModel:"GT-I9500",u:1598(uint64)}
{MobilePhoneModel:"eagle75",u:1492(uint64)}

And the sequential result matches the one from DuckDB.

$ duckdb -c "SELECT MobilePhoneModel, COUNT(DISTINCT UserID) AS u FROM 'bench0.parquet' WHERE MobilePhoneModel <> '' GROUP BY MobilePhoneModel ORDER BY u DESC LIMIT 10;"
┌──────────────────┬─────────┐
│ MobilePhoneModel │    u    │
│     varchar      │  int64  │
├──────────────────┼─────────┤
│ iPad             │ 1090347 │
│ iPhone           │   45758 │
│ A500             │   16046 │
│ N8-00            │    5565 │
│ iPho             │    3300 │
│ ONE TOUCH 6030A  │    2759 │
│ GT-P7300B        │    1907 │
│ 3110000          │    1871 │
│ GT-I9500         │    1598 │
│ eagle75          │    1492 │
├──────────────────┴─────────┤
│ 10 rows          2 columns │
└────────────────────────────┘

@nwt
Copy link
Copy Markdown
Member Author

nwt commented Apr 3, 2025

@philrz: Thanks for the bug reports. I've fixed both.

@philrz
Copy link
Copy Markdown
Contributor

philrz commented Apr 3, 2025

This is a functional 👍 for me too. In addition to confirming the two bugs cited above have been addressed for those specific queries, I've just completed a run across seven ClickBench queries I previously did in an all-pipes, two-step workaround due to lack of COUNT(DISTINCT ...) support, now running them with this branch at commit bfbc114 in their original SQL form that uses COUNT(DISTINCT ...), and they all ran successfully and produced the same query results as a traditional SQL RDBMS.

@nwt nwt merged commit 9b5d70c into main Apr 3, 2025
3 checks passed
@nwt nwt deleted the distinct-for-aggregates branch April 3, 2025 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support SQL COUNT(DISTINCT ...)

3 participants