Add an intermediate representation for the new compiler by mcy · Pull Request #419 · bufbuild/protocompile

mcy · 2025-01-08T17:57:14Z

This PR is the first draft of the intermediate representation for symbol resolution, type inference (of option types) and eventually, generation of FileDescriptorProtos.

This PR is essentially just data structure boilerplate: it does not contain everything (for example, there is no support for services yet) nor does it include code for constructing from an AST or an imported FDP. The purpose of this PR is to get the first draft checked in so that followup PRs can iterate on the API as symbol resolution, type checking, and option resolution are implemented.

experimental/ast/predeclared/predeclared.go

internal/enum/enum.go.tmpl

jhump · 2025-01-09T20:17:32Z

internal/enum/enum.go.tmpl

+	"github.com/bufbuild/protocompile/internal/iter"
+)
+
+var _ iter.Seq[int] // Mark iter as used.


I know this is the simplest way to do this. But it would be nice to just emit only the imports that are actually needed. You could emit:

import ( "fmt"

And then loop through methods and if/when you see "all" emit

"github.com/bufbuild/protocompile/internal/iter"

And emit the closing ) after that.

Another option might be not emit this line but still unconditionally emit the imports, and then always run make lintfix after go generate in the Makefile, which should fix up formatting, including organizing imports.

But I think just emitting the correct imports is fine for now.

Another option is, if the All method isn't used in very many places, just make them hand-written instead of generated.

I don't think it's really worth the complexity to get the imports right. No one is going to read this file (or should, at least).

experimental/internal/taxa/noun.go

jhump · 2025-01-09T20:26:56Z

experimental/ir/presence/kind.go

+
+	// The field is singular and presence is equivalent to the field having its
+	// zero value. This corresponds to a non-message field not marked "optional"
+	// in proto3.


or editions?

Yes, but this comment is mostly meant to be giving an example, not be exhaustive.

jhump · 2025-01-09T21:04:16Z

experimental/ir/option.go

+// Type is the element type of this value. For arrays, this is the type of the
+// element; Protobuf does not have a concept of an "array type".


It may behoove us to provide an "array" type, even though Protobuf descriptors do not. That could enable simpler code that inspects and works with types, to treat them more uniformly. Though I think doing so requires adding another byte to Type, to put an isArray flag there 🤷 (which may not be desirable since, for alignment, it means likely adding 4 or 8 bytes to the Type struct in practice...).

I am slightly skeptical of that. My gut instinct would be to not add that for now, and see how painful it makes things.

jhump · 2025-01-09T21:07:18Z

experimental/ir/option.go

+	// Unknown means this is a message type.
+	// Enum types are predeclared.Int32.
+	ty predeclared.Name


These seems unfortunate, that you can't inspect the enum type of an enum value. Maybe instead encode enums/messages/groups as negative values here and map to a separate enum for NamedType?

Well, one thing that you could do is say that predeclared.Unknown represents "is a user type", and determining whether it's a message or enum requires looking at the type.

experimental/ir/option.go

jhump · 2025-01-09T21:31:23Z

experimental/ir/option.go

+
+type rawMessageValueEntry struct {
+	key   intern.ID
+	field int32 // Index of the field within rawMessageValue.ty; -1 if unknown.


We need more here to handle extensions. I think you could instead include a ref[rawField] here, instead of this field, and then interpret its ref.file field as an index into fields of rawMessageValue.typ if it is negative but not -1. (And in such cases, ref.ptr is ignored.)

I think the representation for an expanded Any will be kind of funky, but there's really no avoiding that, and I think this can handle that correctly. We just have to resolve a type URL "field name" to a ref[rawType] and associate that with the field's value.

Yeah, I think that for any we need to have both a ref[rawField] and ref[rawType]. Which is unfortunate, since most messages will only contain an int32.

Perhaps we'll wind up having different kinds of entries... we'll see. I'm kind of trying to punt on extensions in message literals for now, since it's a niche feature and I want to prioritize getting things working first.

Added a TODO.

The TODO only mentions Any and not extensions. My original comment was about extensions -- you need more than just the tag number for those since you'll also need to know the file from which it came.

Also, FWIW, you shouldn't need both a ref[rawField] and ref[rawType] because you never need both at the same type. A field in a message literal is exactly one of:

A normal field

An extension field

A reference to a type URL for expanded Any syntax.

So you could overload a single 64-bit value (such as a ref[rawField]). Extension numbers are only 31 bits (usually 29, but 31 for the case of message set wire format). Normal field numbers are only 29 bits. So the scheme I mentioned above would work, and you could even interpret the reference as a type instead of field when the top two highest bits are set. (That allows over a billion types per file, which seems like it should be plenty...).

Turns out I was being dumb. All message literals know their type, so if you write { [google.com/foo.Bar]: { ... } } that will be typed correctly: the value's field's type will be Any, but the value as a message will be foo.Bar.

Changing to a ref[rawField] makes everything work out.

jhump · 2025-03-03T19:24:34Z

experimental/ir/option.go

+
+	isArray                  bool // If set, value is an arena.Pointer[[]rawValueBits].
+	isUninterpretedPath      bool // If set, value is an intern.ID.
+	isUninterpretedAggregate bool // If set, value is an intern.ID.


Perhaps this should be an arena.Pointer[string]?

Agreed that it's almost certainly not worth interning the aggregate value. I'm not even sure we need to compute it/store it all. I thin it would be better to leave value unset/nothing and rely solely on the AST expression node to interpret. If we ever do need the aggregate string (which would only be if we were to populate a DescriptorProto before/without interpreting options), it could be computed on-demand from the AST node.

jhump · 2025-03-03T19:26:09Z

experimental/ir/ir.go

+// We can't just embed With into File, because then it would be an exported
+// field.


My issue with this naming convention is that it's unclear what bug number is really referring to. I usually see this in test code to indicate an issue in the owning repo (like a protocompile bug number). I think it makes the code more readable to name it semantically and then put a full link to the bug at its definition.

jhump · 2025-03-03T19:39:41Z

experimental/ir/option.go

+
+type rawMessageValueEntry struct {
+	key   intern.ID
+	field int32 // Index of the field within rawMessageValue.ty; -1 if unknown.


The TODO only mentions Any and not extensions. My original comment was about extensions -- you need more than just the tag number for those since you'll also need to know the file from which it came.

Also, FWIW, you shouldn't need both a ref[rawField] and ref[rawType] because you never need both at the same type. A field in a message literal is exactly one of:

A normal field

An extension field

A reference to a type URL for expanded Any syntax.

So you could overload a single 64-bit value (such as a ref[rawField]). Extension numbers are only 31 bits (usually 29, but 31 for the case of message set wire format). Normal field numbers are only 29 bits. So the scheme I mentioned above would work, and you could even interpret the reference as a type instead of field when the top two highest bits are set. (That allows over a billion types per file, which seems like it should be plenty...).

jhump · 2025-03-03T19:41:26Z

experimental/ir/doc.go

I'm a little confused by this commit (2553cb4). Why do the files need an ir_ prefix to group together? What other prefixes do you expect to see in this package in the future?

I expect there to be a lower_ prefix for the AST->IR lowering code.

mcy added 2 commits January 8, 2025 11:37

take a *E in Wrap functions

b88e58b

add the ir package

826f69a

mcy requested a review from jhump January 8, 2025 17:57

jhump reviewed Jan 9, 2025

View reviewed changes

mcy added 3 commits January 10, 2025 12:09

cr

bc5296b

Merge remote-tracking branch 'origin/main' into mcy/ir2

74a8e25

merge fixes

7c2cae6

mcy requested a review from jhump February 18, 2025 20:38

mcy added 6 commits February 18, 2025 12:43

lint

955433c

preemptively add an ir_ prefix to file names so they sort together

2553cb4

lint

7c93787

capitalize package

317326c

Merge remote-tracking branch 'origin/main' into mcy/ir2

019f129

merge fixes

79634e3

jhump approved these changes Mar 3, 2025

View reviewed changes

mcy added 2 commits March 3, 2025 13:59

cr

224b7c2

lint

e554205

mcy enabled auto-merge (squash) March 3, 2025 22:01

mcy merged commit 0ef451e into main Mar 3, 2025
6 checks passed

mcy deleted the mcy/ir2 branch March 3, 2025 22:05

		// Type is the element type of this value. For arrays, this is the type of the
		// element; Protobuf does not have a concept of an "array type".

		// We can't just embed With into File, because then it would be an exported
		// field.

Conversation

mcy commented Jan 8, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants