Skip to content

Add an intermediate representation for the new compiler#419

Merged
mcy merged 13 commits intomainfrom
mcy/ir2
Mar 3, 2025
Merged

Add an intermediate representation for the new compiler#419
mcy merged 13 commits intomainfrom
mcy/ir2

Conversation

@mcy
Copy link
Member

@mcy mcy commented Jan 8, 2025

This PR is the first draft of the intermediate representation for symbol resolution, type inference (of option types) and eventually, generation of FileDescriptorProtos.

This PR is essentially just data structure boilerplate: it does not contain everything (for example, there is no support for services yet) nor does it include code for constructing from an AST or an imported FDP. The purpose of this PR is to get the first draft checked in so that followup PRs can iterate on the API as symbol resolution, type checking, and option resolution are implemented.

@mcy mcy requested a review from jhump January 8, 2025 17:57
"github.com/bufbuild/protocompile/internal/iter"
)

var _ iter.Seq[int] // Mark iter as used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is the simplest way to do this. But it would be nice to just emit only the imports that are actually needed. You could emit:

import (
	"fmt"

And then loop through methods and if/when you see "all" emit


	"github.com/bufbuild/protocompile/internal/iter"

And emit the closing ) after that.

Another option might be not emit this line but still unconditionally emit the imports, and then always run make lintfix after go generate in the Makefile, which should fix up formatting, including organizing imports.

But I think just emitting the correct imports is fine for now.

Another option is, if the All method isn't used in very many places, just make them hand-written instead of generated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's really worth the complexity to get the imports right. No one is going to read this file (or should, at least).


// The field is singular and presence is equivalent to the field having its
// zero value. This corresponds to a non-message field not marked "optional"
// in proto3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or editions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this comment is mostly meant to be giving an example, not be exhaustive.

Comment on lines 104 to 105
// Type is the element type of this value. For arrays, this is the type of the
// element; Protobuf does not have a concept of an "array type".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may behoove us to provide an "array" type, even though Protobuf descriptors do not. That could enable simpler code that inspects and works with types, to treat them more uniformly. Though I think doing so requires adding another byte to Type, to put an isArray flag there 🤷 (which may not be desirable since, for alignment, it means likely adding 4 or 8 bytes to the Type struct in practice...).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am slightly skeptical of that. My gut instinct would be to not add that for now, and see how painful it makes things.

Comment on lines 146 to 148
// Unknown means this is a message type.
// Enum types are predeclared.Int32.
ty predeclared.Name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seems unfortunate, that you can't inspect the enum type of an enum value. Maybe instead encode enums/messages/groups as negative values here and map to a separate enum for NamedType?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, one thing that you could do is say that predeclared.Unknown represents "is a user type", and determining whether it's a message or enum requires looking at the type.


type rawMessageValueEntry struct {
key intern.ID
field int32 // Index of the field within rawMessageValue.ty; -1 if unknown.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need more here to handle extensions. I think you could instead include a ref[rawField] here, instead of this field, and then interpret its ref.file field as an index into fields of rawMessageValue.typ if it is negative but not -1. (And in such cases, ref.ptr is ignored.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the representation for an expanded Any will be kind of funky, but there's really no avoiding that, and I think this can handle that correctly. We just have to resolve a type URL "field name" to a ref[rawType] and associate that with the field's value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that for any we need to have both a ref[rawField] and ref[rawType]. Which is unfortunate, since most messages will only contain an int32.

Perhaps we'll wind up having different kinds of entries... we'll see. I'm kind of trying to punt on extensions in message literals for now, since it's a niche feature and I want to prioritize getting things working first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO only mentions Any and not extensions. My original comment was about extensions -- you need more than just the tag number for those since you'll also need to know the file from which it came.

Also, FWIW, you shouldn't need both a ref[rawField] and ref[rawType] because you never need both at the same type. A field in a message literal is exactly one of:

  1. A normal field
  2. An extension field
  3. A reference to a type URL for expanded Any syntax.

So you could overload a single 64-bit value (such as a ref[rawField]). Extension numbers are only 31 bits (usually 29, but 31 for the case of message set wire format). Normal field numbers are only 29 bits. So the scheme I mentioned above would work, and you could even interpret the reference as a type instead of field when the top two highest bits are set. (That allows over a billion types per file, which seems like it should be plenty...).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out I was being dumb. All message literals know their type, so if you write { [google.com/foo.Bar]: { ... } } that will be typed correctly: the value's field's type will be Any, but the value as a message will be foo.Bar.

Changing to a ref[rawField] makes everything work out.

@mcy mcy requested a review from jhump February 18, 2025 20:38

isArray bool // If set, value is an arena.Pointer[[]rawValueBits].
isUninterpretedPath bool // If set, value is an intern.ID.
isUninterpretedAggregate bool // If set, value is an intern.ID.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be an arena.Pointer[string]?

Agreed that it's almost certainly not worth interning the aggregate value. I'm not even sure we need to compute it/store it all. I thin it would be better to leave value unset/nothing and rely solely on the AST expression node to interpret. If we ever do need the aggregate string (which would only be if we were to populate a DescriptorProto before/without interpreting options), it could be computed on-demand from the AST node.

Comment on lines 105 to 106
// We can't just embed With into File, because then it would be an exported
// field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issue with this naming convention is that it's unclear what bug number is really referring to. I usually see this in test code to indicate an issue in the owning repo (like a protocompile bug number). I think it makes the code more readable to name it semantically and then put a full link to the bug at its definition.


type rawMessageValueEntry struct {
key intern.ID
field int32 // Index of the field within rawMessageValue.ty; -1 if unknown.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO only mentions Any and not extensions. My original comment was about extensions -- you need more than just the tag number for those since you'll also need to know the file from which it came.

Also, FWIW, you shouldn't need both a ref[rawField] and ref[rawType] because you never need both at the same type. A field in a message literal is exactly one of:

  1. A normal field
  2. An extension field
  3. A reference to a type URL for expanded Any syntax.

So you could overload a single 64-bit value (such as a ref[rawField]). Extension numbers are only 31 bits (usually 29, but 31 for the case of message set wire format). Normal field numbers are only 29 bits. So the scheme I mentioned above would work, and you could even interpret the reference as a type instead of field when the top two highest bits are set. (That allows over a billion types per file, which seems like it should be plenty...).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by this commit (2553cb4). Why do the files need an ir_ prefix to group together? What other prefixes do you expect to see in this package in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect there to be a lower_ prefix for the AST->IR lowering code.

@mcy mcy enabled auto-merge (squash) March 3, 2025 22:01
@mcy mcy merged commit 0ef451e into main Mar 3, 2025
6 checks passed
@mcy mcy deleted the mcy/ir2 branch March 3, 2025 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants