Skip to content

Conversation

@mccanne
Copy link
Collaborator

@mccanne mccanne commented Oct 6, 2025

This commit adds type checking to the semantic analyzer thus bringing static typing to dynamic data. As far as we know, this has not been done before in a general fashion in any SQL-like query language for dynamically typed data (e.g., SQL++, Asterix, search languages, etc).

This works by computing fused types of each operator's output and propagating these types in a dataflow analysis. When types are unknown, the analysis flexibly models them as having any possible type. The CSUP and BSUP formats for dynamically typed data will be updated in future PRs to include fused-type information so robust type checking can be carried out for any super-structured data.

Type checking for built-in functions and aggregate functions is not yet done as we need support from the functions packages to provide type signatures. This will be done in a subsequent PR.

Many existing tests were updated since they had problematic type behavior. A number of new tests were added to test the type checker but coverage is light.

This commit adds type checking to the semantic analyzer thus bringing
static typing to dynamic data.  As far as we know, this has not been
done before in a general fashion in any SQL-like query language for
dynamically typed data (e.g., SQL++, Asterix, search languages, etc).

This works by computing fused types of each operator's output and propagating
these types in a dataflow analysis.  When types are unknown, the
analysis flexibly models them as having any possible type.  The CSUP
and BSUP formats for dynamically typed data will be updated in future PRs
to include fused-type information so robust type checking can be carried
out for any super-structured data.

Type checking for built-in functions and aggregate functions is not
yet done as we need support from the functions packages to provide type
signatures.  This will be done in a subsequent PR.

Many existing tests were updated since they had problematic type behavior.
A number of new tests were added to test the type checker but coverage
is light.
@philrz
Copy link
Contributor

philrz commented Oct 7, 2025

The sqllogictests spotted something that broke in this branch. Here's the baseline working as expected on current tip of main:

$ super -version
Version: 525853d26

$ super -f parquet -o data.parquet -c "values {col1:1,col2:1},{col1:2,col2:3},{col1:4,col2:4}" &&
  super -c "SELECT * FROM data.parquet WHERE col1 IN (col2);"

{col1:1,col2:1}
{col1:4,col2:4}

Here's Postgres handling it the same.

$ psql postgres
psql (17.6 (Homebrew))
Type "help" for help.

postgres=# CREATE TABLE DATA (col1 INTEGER, col2 INTEGER);
CREATE TABLE
postgres=# INSERT INTO DATA (col1, col2) VALUES (1,1),(2,3),(4,4);
INSERT 0 3

postgres=# SELECT * FROM data WHERE col1 IN (col2);
 col1 | col2 
------+------
    1 |    1
    4 |    4
(2 rows)

And here it is failing on the branch:

$ super -f parquet -o data.parquet -c "values {col1:1,col2:1},{col1:2,col2:3},{col1:4,col2:4}" &&
  super -c "SELECT * FROM data.parquet WHERE col1 IN (col2);"

bad type for right-hand side of in operator: int64 at line 1, column 43:
SELECT * FROM data.parquet WHERE col1 IN (col2);
                                          ~~~~

@mccanne
Copy link
Collaborator Author

mccanne commented Oct 7, 2025

@philrz this behavior was previously present but exposed by the type-checking system. Currently, the RHS of ... IN ( expr ) doesn't compile into a tuple/array but the runtime IN operator returns true for scalar IN scalar. I think this is questionable and would should revisit these semantics. So, we need to fix the parser to recognize ... IN ( expr ) as a tuple and update the runtime to produce dynamic errors when testing if something is IN a scalar value.

I would recommend merging this PR then fixing the existing problems in a subsequent PR.

Comment on lines +6 to +8
indexed entity is not indexable at line 1, column 9:
fn z(a):a[0]
~
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This error will be confusing in a program with multiple calls to z. Is there any way to get include the call site as context?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but we don't have a way to report errors tying together different locations in the code. We need to add this. How about we fix this in a subsequent PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit so later is fine.

outputs:
- name: stderr
data: |
"z" no such field at line 1, column 25:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Error might read better like this.

Suggested change
"z" no such field at line 1, column 25:
no such field "z" at line 1, column 25:

Comment on lines 5 to 6
values 10.1.1.1 + 1
~~~~~~~~
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe underline the entire offending expression instead of just one operand?

Comment on lines 8 to 9
fn foo(f,x):f(x)
~~~~
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is covered by this test only. If you can't maintain it here, maybe add a new test that covers it?

- name: stderr
data: |
join requires two upstream parallel query paths
join requires two query inputs at line 1, column 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "query" feels unnecessary here.

Suggested change
join requires two query inputs at line 1, column 1:
join requires two inputs at line 1, column 1:

Comment on lines 887 to 891
for _, t := range u.Types {
if hasUnknown(t) {
return true
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits:

Suggested change
for _, t := range u.Types {
if hasUnknown(t) {
return true
}
}
if slices.ContainsFunc(u.Types, hasUnknown) {
return true
}

Comment on lines 867 to 871
for _, t := range typ.Types {
if hasString(t) {
return true
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
for _, t := range typ.Types {
if hasString(t) {
return true
}
}
return slices.ContainsFunc(typ.Types, hasString(t))

Comment on lines 191 to 192
{Name: op.LeftAlias, Type: types[0]},
{Name: op.RightAlias, Type: types[1]},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
{Name: op.LeftAlias, Type: types[0]},
{Name: op.RightAlias, Type: types[1]},
super.NewField(op.LeftAlias, types[0]),
super.NewField(op.RightAlias, types[1]),

Comment on lines 851 to 855
for _, t := range u.Types {
if hasNumber(t) {
return true
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
for _, t := range u.Types {
if hasNumber(t) {
return true
}
}
if slices.ContainsFunc(u.Types, hasNumber) {
return true
}

@mccanne
Copy link
Collaborator Author

mccanne commented Oct 7, 2025

Ok, regarding the SQL problem with the IN operator, we decided on zoom today that IN semantics will include scalar equality in addition to containment equality. This means we don't need to change the runtime. Instead, I just pushed changes to update the type checker. Docs for the IN operator will come soon to reflect this.

@mccanne mccanne merged commit c7dcd26 into main Oct 7, 2025
3 checks passed
@mccanne mccanne deleted the type-checker branch October 7, 2025 18:11
@philrz
Copy link
Contributor

philrz commented Oct 7, 2025

FYI, once this merged to main (commit c7dcd26) I re-ran the 1+ million sqllogictests that had previously been successful and 100% of them passed. So, IN fix is confirmed a success and no other surprises were hiding behind that one. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants