Component Translation in HIR2 #360

bitwalker · 2024-12-14T20:33:03Z

bitwalker
Dec 14, 2024
Maintainer

This write-up is the result of trying to pin down the details of how components must be represented in the IR, and how that representation will be lowered into Miden Assembly. This representation must not only be suitable for translation from Wasm, but it must preserve enough of the original component structure (as described in WIT) so that the resulting Miden package can be processed as a Wasm component. Additionally, it must preserve all of the necessary metadata so that we can correctly emit lifting/lowering glue code at call sites which cross component (and Miden context) boundaries.

In this write-up, I document some assumptions and design considerations, as well as how I believe component translation should proceed (from Wasm to IR primarily, but to MASM as well).

Note that this write-up will at times combine: actual details of how components are represented in Wasm; conceptual details of how the concrete representation of components corresponds to components described in WIT; my personal opinion on how we should handle translating them into the IR and ultimately to MASM packages. As a result, you do have to keep in mind that some statements I may make below might state something matter-of-factly that is based on implied assumptions or design considerations, above and beyond what exists in the Wasm Component Model or in the current frontend. If you have any specific questions about a particular point, just leave a comment and I will follow up with more detail.

First, I would like to start by laying out some background/design considerations that will be relevant when we get into implementation details.

Producers

There are three classes of producer (i.e. compiler or other tool) of Wasm components that we need to consider in any design decisions we make:

First-party frontends, i.e. those languages/compilers for which we are maintaining the tooling. This is essentially just rustc for now. We can make assumptions about what is produced by these frontends, and in some cases may be able to even dictate how components are produced. I'd describe this as the second-most important producer class of interest at the moment.
Miden packages. These are not strictly speaking Wasm components, however the format in which a Miden package will ultimately be described is in Component Model terms, via WIT. The actual code of a package of course is MAST, and metadata is provided to map the raw MAST to Miden Assembly concepts, i.e. modules and procedures, but at the current point in time, there is no direct connection between a WIT component description and the underlying MASM modules/procedures. Furthermore, tooling does not yet exist for creating packages directly from a Miden Assembly source project, and such tooling will need to exist. What this means for this discussion is that we must have a convention for structuring Miden Assembly modules/procedures that corresponds to the component structure we expect to be described in WIT. For obvious reasons, this is the most important "producer" of components, in that it is the substrate for all distribution of code in the Miden ecosystem - so any constraints imposed on components due to requirements of packaging necessarily limit what we can accept from other producers.
Third-party frontends, i.e. languages/compilers/tools maintained by someone else, using Wasm as their output format, and leveraging midenc to compile to a Miden package. There may also be frontends that directly emit their own Miden packages, but we can assume for this discussion that we interact with those the same as any Miden package we emit. This class is primarily interesting from the perspective of what we're allowed to assume about the Wasm we're given. We are allowed to dictate constraints on what we'll accept, but we should not do so unless there is a fundamental reason why the constraint is necessary. As an example, we probably should not assume anything too specific about the exact way in which a component is instantiated. On the flip side, we should be able to dictate certain things, such as requiring there to be a single top-level component, and that dependencies on other Miden packages be represented as component-level imports, as well as what things we allow to be exported/imported at the component level.

Packaging

As a practical matter, it is essential that the lowering of Wasm components to Miden Assembly preserve enough component structure so that the assembler is able to reason about components in terms of Miden Assembly primitives (i.e. libraries, namespaces, modules, procedures). There are a few reasons for this:

All existing Miden projects are maintained as handwritten Miden Assembly sources. There must be a path for migrating these projects so that they can be packaged as components. It follows then, that the resulting project structure must have some kind of intuitive mapping to components.
The compiler itself is constrained by needing to emit Miden Assembly artifacts to feed into the assembler. The closer the correspondence between Wasm component-level concepts and MASM artifacts, the better, not only to avoid confusion about how a given component will be represented in MASM, but to avoid unnecessary complexity in the compiler itself.
Both the VM and the compiler must be able to load a component description from a package, and map that to the underlying MAST. Right now, the only way to reason about MAST is in terms of MASM libraries, modules, and procedures. We don't control the WIT format, and forking it isn't practical, so either we need a new WIT-like format that we use specifically for packages that allows us to be more precise about how the described component maps to MAST, or we rely on convention. My approach in this document is based on my belief that convention will be sufficient.

The bottom line though, is that packaging dictates the structure of the Miden Assembly we produce, which in turn puts significant pressure on us to structure the IR in such a way that the lowering to MASM is straightforward and preserves all the metadata needed for packaging. The more distance there is between the IR and the resulting MASM, the more complicated compilation will be. Conversely, the more distance there is between Wasm components and IR-level components, the more work must be done in the frontend to translate them, and the stricter we will need to be about the Wasm we accept.

Miden Assembly Components

So given the above, here's my take on the correspondence we should aim for between Wasm components and Miden Assembly primitives (what I will generally refer to from now on as a "Miden component"). I will tie them together using WIT terminology to make things a little clearer, and also provide a way to correlate this to how a user will describe their component in WIT.

Worlds and Interfaces

A WIT world corresponds to a top-level Wasm component. While a world can directly export functions, in practice worlds primarily export interfaces. A world corresponds to a MASM Library (or Program, in cases where the component represents something executable). The component Wasm currently emitted by rustc already maps cleanly to this model, i.e. we get a single top-level component which exports component instances corresponding to each interface of the world described in WIT.

In terms of Miden Assembly source projects, we can represent a world quite easily. Let's use the example of a WIT package called miden:base, version 1.0.0, which exports two interfaces core and tx, and a top-level function called init.

Note

There are some issues with the current implementation of LibraryPath in Miden Assembly that I'm going to gloss over a bit here. I think we should change the syntax of paths, and importantly, I think we should allow for an optional version component to the path, which always comes at the end when present, to allow for disambiguating modules/procedures when different versions of the same component are present in the compilation graph. The new syntax would support these variations:

package@version
package#item@version
package/module@version
package/module#item@version

Where item refers to anything exportable from a module, primarily procedures, but other items could be supported in the future.

An additional assumption here is that package@version, as a path, implicitly refers to the root module of the package. Thus package#item@version refers to an export item defined in mod.masm of the project.

Examples:

miden:[email protected]
miden:base#[email protected]
miden:base/[email protected]
miden:base/tx#[email protected]
miden:base/foo::bar#baz

With that out of the way, let's get back to how our example Miden Assembly component, called miden:base, would be structured in source form:

The root directory of the project corresponds to the miden:base namespace, which is derived from the package name (and so will be used as the resulting LibraryNamespace in MASM terms)
Top-level items (if any) are placed in mod.masm in the project root, this is where the init function will be exported. This module is not explicitly named, i.e. it's LibraryPath will consist only of the LibraryNamespace (and version in my proposed changes) with no additional path components.
Each interface exported from the world corresponds to a MASM module. So in our example, we'd expect to see core.masm and tx.masm source files, defining modules of the same name. Their LibraryPath would be composed of the miden:base namespace, and their module name, e.g. miden:base/tx. These modules would be presumed to export procedures corresponding to the functions exported from their respective WIT interface. Note, however, that these modules do not have to contain the actual definitions of those procedures, they could be re-exported from elsewhere in the project - the only requirement is that the interface is fully accounted for in that module.
The version string could either be made part of the namespace, or a new special component of any given LibraryPath. We'll ignore it for now.

When this project is assembled to MAST, we'll end up with the following metadata about that MAST:

Three modules: miden:base, miden:base/core, and miden:base/tx
Some number of procedures corresponding to exports of those modules. For example, the top-level init function would be exported with a LibraryPath where the namespace is miden:base and with a single path component init, plus the version component when fully-qualified.

With that metadata, and given the WIT for the package, we can use nothing other than the conventions described above to look up the MAST root corresponding to some function declared in the WIT. All in all, this is pretty straighforward, and gives us the clean mapping from WIT to MAST that

Wasm -> IR Translation

So now that we know what kind of MASM structure we want to end up with, we need to figure out how, in the compiler, we can receive a Wasm component as input, recover the information necessary to reason about that component in terms of WIT worlds and interfaces, and lower that to HIR. We must additionally work out what dependencies are required by the component, and how to resolve them. The former is what we're interested in here. Luckily for us, recovering the basic WIT structure is fairly straightforward:

Worlds and Interfaces

The Wasm component we are given by definition consists of a top-level Wasm component that corresponds to the WIT world from which the component was derived. Thus, any exports of the top-level component must correspond to items exported from the WIT world. In practice, these exports will almost always be of kind instance (a component instance, corresponding to a WIT interface), func (a top-level component function using the Component Model ABI), or type (which we can ignore for purposes of this discussion).

As a pre-requisite step, we must evaluate the instantiation of items in the top-level component, in order to work out not only what specifically is being exported (and with what names), but also to work out call site metadata for call sites which require ABI lifting/lowering, and the details of that. This is of particular importance, as lifting/lowering is completely implicit, so in order to determine when and what is needed, we must preserve not only the actual target of a function call (after linking the component and resolving the real callee), but the fact that the call passes through a canon lift or canon lower declaration.

I've been able to validate the above assumptions (so far) by looking at the Wasm generated by various tests in our test suite, and the sources from which it was derived. It appears that we should be able to safely rely on the fact that a top-level component instantiates and exports one or more component instances that correspond to WIT interfaces.

Functions

Component-level function exports, so far as I have been able to substantiate, are always the product of aliasing an export from a core module, lifting it into the Canonical ABI using canon lift, and then exporting the synthetic function defined by the canon lift declaration. In other words, they do not simply represent an alias of some core Wasm function definition, but rather an actual function definition that is intended to be synthesized or provided by the Wasm runtime, and when called, implements the lowering/lifting of function arguments and results, respectively, to adapt a core Wasm function to the high-level Canonical ABI. As a result, these synthetic functions must be preserved in the IR in some form.

It should also be noted that function exports from a component have a different name than the actual underlying core function definition. We must preserve both names, since the core function could be directly referenced by other functions in the same module (or in a sibling module), as well as via the component-level export name. I believe this is another reason why preserving the synthetic functions may be valuable.

Lastly, it seems to me that we should use the synthetic functions to hold any actual lowering/lifting code needed. This means that, in cases where ABI lowering/lifting code is required, the synthetic function declaration corresponds to an actual function definition that is generated by the compiler. In cases where the ABI does not require any glue code (e.g. because the arguments are all scalar integral values), then the synthetic function declaration would correspond to a re-export of the referenced core function definition once lowered to MASM.

To elaborate on how this is all expressed in Wasm component terms:

A function is exported from the component with (export "foo" (func <function_index>) (func (type <type_index>)))
The function referenced by <function_index> is a component function declared with (func (type <type_index>) (canon lift (core func <core_function_index>)))
The function referenced by <core_function_index> is a core function, exported from a core module, brought into scope with (alias core export <core_module_index> "interface@version#foo" (core func <core_function_index>))

In a downstream component, these component-level exports are consumed by:

Importing the component function into the component index space with (import "foo" ..)
Declaring a core function at component-level using (core func (canon lower <function_index>))
Instantiating the core module and providing the core function synthesized in the previous step as the definition for an import of the form (import "package/module@version" "foo" (func <core_function_index>))

As you can see, the component-level function declarations via canon lift and canon lower are important, and represent something more than just an alias of the underlying core function definition.

IR Components

Fundamentally, I think we want the IR to represent things in a more WIT-like manner, i.e.:

A Component represents the top-level component/world, and ultimately corresponds to a Miden package. A Component consists of one or more Interface or Module items.
An Interface represents component instances exported from the top-level component, and corresponds to the original WIT interfaces. Functions in a Interface can be declarations or definitions, but in both cases always use the Canonical ABI.
A Module represents a core module within a component. Functions in a Module always use the core Wasm ABI (or Miden ABI).

Interfaces are lowered to MASM by either re-exporting a procedure from a sibling Module, or by lowering the function definition in the Interface itself (which presumably internally references a procedure in a sibling Module). Interfaces consisting solely of declarations can be used to represent information about external dependencies in the IR.

Modules are lowered to MASM 1:1, as you'd expect.

This does not get into how data segments and global variables are handled, suffice to say that those are less problematic than the items mentioned here, and we have more freedom on how to handle them.

--

NOTE: I have a few follow up thoughts/notes as I've explored things further, but the above represents more or less the direction I'm planning on taking things in with this, depending on the outcome of this discussion.

greenhat · 2024-12-16T06:42:32Z

greenhat
Dec 16, 2024
Collaborator

Sounds great!

There are some issues with the current implementation of LibraryPath in Miden Assembly that I'm going to gloss over a bit here. I think we should change the syntax of paths, and importantly, I think we should allow for an optional version component to the path, which always comes at the end when present, to allow for disambiguating modules/procedures when different versions of the same component are present in the compilation graph. The new syntax would support these variations:

Is there a particular reason why the version component should always come at the end? In Wasm CM if the function name is specified in the path, the version is specified before the function name, i.e. namespace:package/interface@version#function. It'd be nice to keep our naming scheme in sync with Wasm CM.

1 reply

bitwalker Dec 17, 2024
Maintainer Author

Is there a particular reason why the version component should always come at the end?

Yes, primarily because it should be possible to optionally specify a version to disambiguate a reference to a procedure, while allowing the assembler to infer the version for symbols in the current package being assembled. In general, Miden Assembly projects do not need to use versions at all, but to the extent that they wish to do so, adding it at the end is the least invasive, syntactically.

Something to bear in mind is that symbols in Miden Assembly do not look like the Component Model identifiers, even though they contain essentially the same information, i.e. it is possible for us to print a LibraryPath like a Component Model identifier, but the syntax for these paths is not the same in Miden Assembly itself.

Fundamentally, there are three components to these identifiers:

Namespace (e.g. the package name in WIT). All items exported from a package should share this namespace. This has a clean conceptual mapping from the Component Model to Miden Assembly today. To be clear, everything before the / in the Component Model identifier is being considered part of the namespace here.
Name (e.g. interfaces or other top-level exports in WIT). All items exported from a packages must have a unique name within the namespace. In Miden Assembly, we must also support referencing internal symbols using absolute paths, which may or may not be nested under the top-level package namespace. For example, something like miden:base/foo::bar#baz. This would correspond to a module whose source file was placed in <root>/foo/bar.masm. In this example, the bar module itself is not exported from the package.
Version (e.g. the package version in WIT, technically optional). In Miden Assembly sources, these are essentially useless, and we definitely do not want to require them for referencing symbols within the same project. Where this component comes into play though, is when merging together MAST that includes procedures from different versions of the same package. Versioned symbols allow us to safely disambiguate callers. The compiler will always use versioned symbols when available, and the assembler will attach version information to symbols internally as they are resolved. We can then define rules for how to resolve unversioned symbols to a versioned symbol (if resolving a procedure by name, and multiple versions of the symbol are present), but I suspect that we will not need this except in the assembler internals.

Basically, versions are not just another segment of the symbol name, but a distinct, semantically significant property of a symbol. In other words, one could match a symbol by name while ignorning the version. By making the version always trail the symbol, it is more obvious that it is vestigial and not required. Additionally, this is how versioned symbols in other environments work as well - e.g. if dynamically linking symbols from multiple versions of libc into a program, you'll see those symbols suffixed with the version they came from, with something like @2.0.0.

I do want to ensure that we can faithfully translate Component Model identifiers into Miden Assembly and back into the Component Model (i.e. roundtripping without loss of information), but we must also consider how any changes affect existing and future Miden Assembly source projects. There is also the distinction between what information can be represented in the AST, and what information can be represented syntactically, as well as between what we can assume the compiler emits as AST, and what the MASM parser emits as AST. The latter is much more user-oriented/flexible (and should be for now, by design). In the future we can be stricter, require more explicit syntax, once MASM is no longer the predominant way in which people build for the Miden VM.

So mostly, I'm looking to thread the needle here - allow us to represent Component Model things in Miden Assembly, and Miden Assembly things in the Component Model, and ensuring that we can describe both in terms of the other, or more precisely, in WIT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Component Translation in HIR2 #360

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Component Translation in HIR2 #360

Uh oh!

bitwalker Dec 14, 2024 Maintainer

Producers

Packaging

Miden Assembly Components

Worlds and Interfaces

Wasm -> IR Translation

Worlds and Interfaces

Functions

IR Components

Replies: 1 comment · 1 reply

Uh oh!

greenhat Dec 16, 2024 Collaborator

Uh oh!

bitwalker Dec 17, 2024 Maintainer Author

bitwalker
Dec 14, 2024
Maintainer

Replies: 1 comment 1 reply

greenhat
Dec 16, 2024
Collaborator

bitwalker Dec 17, 2024
Maintainer Author