TODO

Lessons from Nyxt

Nyxt continuous testing/packaging
Good Common Lisp Documentation Patterns
- Dynamic/special/global variables with *earmuffs*.
- Constants with +plus-signs+.
- If internal functions and variables must be exported, wrap them with %percent-signs%.
- Predicates with -p suffix.
- README layout:
  1. Title and description - General information about the project. Goal and features.
  2. Getting started - How to install and load.
  3. Examples - Short code snippets for the most common use-cases.
  4. How it works - Explanations of the underlying model and implementation details.
- Function/method doc-strings:
```
"First line as a single sentence less than 80 chars long—for readability.
Return VALUE or
- VALUE-1
- VALUE-2
- etc.

ARGUMENT-1 represents ...
BOOLEAN-ARGUMENT mandates whether ...

See also `another-symbol'."
        
```
- Generics should come from a defgeneric form for increased inspectability and reliability of access. Methods can be undocumented, because they are generally just implementations for a documented generic. Method documentation is where exceptional implementation details end up. https://aartaka.me/my-generics.html https://lispcookbook.github.io/cl-cookbook/clos.html#methods https://www.reddit.com/r/Common_Lisp/comments/1j6u0wc/comment/mgrmnd2
- Give very clear documentation about macros. Include many examples for all common cases in the documentation!

Look at Others

These are some other HDL langauges that I may want to look at.

HIRL HDL
SAIL2 - Single source of truth for ISA. From formal & functional specification, can be:
- Mechanically checked
- Generate human-readable documentation
- Generate boilerplate for simulators & compilers & HDLs
VHDLisp
DLang has a bunch of cool features. Here are 10 cool features.

Interesting Common Lisp Libraries

Many of these libraries are listed here.

Clingon - Command Line argument parser
cl-readline - Add readline functions to REPL
Linedit - Readline-style library
BordeauxThreads - Portable threading
Sento - Message passing actors, similar to Erlang
ACL2 - Logic and programming language for modeling & proving
closer-mop - Compatibility layer for using CLOS’s MOP
postmodern - Interact with PostgreSQL
binary-structures - Make parsing & interacting with binary files/structures easier
Parachute - Testing framework that Kandria uses
trivial-features - Make interacting with *FEATURES* more portable
- *FEATURES* has things like which platform or ISA you are running on.
unix-opts - Make your command-line interface to your program a Unix one.
- I don’t know how this compares against Clingon yet.
cl-difflib - Produce diffs between sequences
- This could be useful if I want to be able to diff ASTs or something.
log4cl-extras - Extra goodies for log4cl logging, JSON formatting for instance.

More interesting libraries can be found on 40 Ant’s Lisp Project of the Day.

Compilation Targets

A variety of compilation targets should be available, either for debugging or performance. To make code “steppable” you must have set debug to be greater than speed, space, and compilation-speed. This can be done with the declaim function.

Developer/Designer Experience

LSP support would be an important part of this. Bull No 1 wrote an article about how to make an LSP server in something other than the “default” Node.JS runtime.

LSP is a way for langage-specific tools to provide language-specific information to any generic text editor without needing to implement language-specific features in each editor. Succinctly, this means the $M × N$ problem of language support turns into $M + N$, just like in a compiler. LSP has become very popular (as of making this TODO list) and is supported by Emacs, Vim, NeoVim, Helix, Zen, VS Code, and most other editors.

If I use a Lisp-based language, the REPL-driven development style already strongly supports this behavior by virtue of working inside a live and dynamic programming environment. However, LSPs also offer refactoring tools, like “rename all instances of this symbol” of “find all referrers to this symbol”. These would be particularly important to have when writing hardware code, since symbols are often defined far away from where they are used (at least in my experience).

This blog post discusses how to make an LSP client (not server!) in “just 200 lines of Clojure” (https://vlaaad.github.io/lsp-client-in-200-lines-of-code).

Optimizing the Common Lisp in the Compiler

https://www.fosskers.ca/en/blog/optimizing-common-lisp

Testing

For any tests that are unrelated, they should be launched in their own process? https://sunshowers.io/posts/nextest-process-per-test/
Property-based testing of pure functions. For example, chil:log2up should be property tested. This way I only have to write the properties required, and not any actual implementation. Possible candidates:
- check-it (Last update 2015-06-05) Does not play well with lisp-unit2 because lisp-unit2:assert-error does not return a value when an error is thrown.
```
;; This should be a successful test.
(lisp-unit2:assert-error 'simple-error (chil/utils:log2up -1))
; No values

;; This should be a failing assertion/test.
(lisp-unit2:assert-error 'simple-error (- 2 1))
1 (1 bit, #x1, #o1, #b1)
        
```
  However, check-it expects the lambda predicate to return true or false depending on the result of the value.
- cl-quickcheck (Last update 2020-05-08) (Seems abandoned.)
- Write my own in the style of guile-quickcheck or Racket’s Quickcheck.
  - Another QuickCheck-like implementation direction is the one used by Rust’s Proptest. Proptest generates and shrinks based on a Strategy object, rather than the types alone. See Proptest’s manual for how it works and its README for a brief comparison between Proptest and Quickcheck-like systems. If I write my own, we should read Adrian Sampson’s “Automated Test-Case Reduction” blog post.
  - https://stevana.github.io/the_sad_state_of_property-based_testing_libraries.html
  - https://www.well-typed.com/blog/2019/01/qsm-in-depth/
Property-based testing of single modules. Module is written like normal. Asserts are present in module. Provide random inputs to module to verify correctness. Follows “stateful property-based testing” from https://stevana.github.io/the_sad_state_of_property-based_testing_libraries.html Inputs should be random both in terms of value (the bit/byte value) AND in their arrival distribution.
Automated generation of test programs for modules requiring simulation (integration testing). Interesting works in software:
- YARPGen
The unit testing framework should allow for a variety of underlying runners. A runner is the thing that runs the test. The default runner for cargo test is shared-process, where every test runs inside the same process. But one alternative is to have a process per test, leveraging process-boundary isolation to prevent shared memory gotchas. https://sunshowers.io/posts/nextest-process-per-test/ Once this far, you could go the next step and have test-per-container for namespace/filesystem isolation. If this problem gets reframed into an actor model, then each test is an actor and there can be a hierarchy, allowing for distributing these tests across machines easily.
Generated output (Verilog, VHDL) should be checked against simulators for linting. For Verilog, use Verilator & Icarus. For VHDL, use GHDL.
There should be an interpreter/simulator for the top-level language that is used (Host language simulation). See the Simulator Section. This solves the problem where only the emitted language can be verified, and not the host language.
Any unit tests for modules (whether in the standard library or written by the designer) must be synthesizable. Down to the low-level language.
Need the ability to collect host-language coverage information out of tests. The more semantic information available should mean tracking coverage and finding cases where there is no test-case coverage should be eaiser. For example, the higher-level language knows what is an FSM, and should be able to test all possible cases for it. The lower-level generated language may not understand that information and just blindly test.
AFTER EVERYTHING ELSE DONE: EDA tooling for Chil. Design Verification workflows & debug should be able to be performed on Chil, rather than its outputs.

Formal Methods

Hardware is extensively validated and verified with formal methods. Chil should support writing a formal specification of the hardware, which means we need a way to express these kinds of concepts. There are several kinds of formal methods that we should investigate and try to support:

Model & Property Checking (Lightweight formal methods) We can take the core of our randomized property testing from guile-quickcheck? The forge language built on top of Racket might also be a good resource to look at.
Formal Specification & Theorem Proving (Heavyweight formal methods) We might be able to piggy-back of ACL2 for this.

I am not sure we want to support this in Chil directly, because this might be more generally useful. It might make more sense for this to be a separate project that Chil then relies on. It remains to be seen which is better, but initial development will start here I think. If it seems better to factor these formal methods tools out to a separate repository, then we will tackle that problem later.

Many of the concepts discussed in this section come from Jakob Kreuze’s blog post about their expeirence with formal methods in courses.

Need the ability to embed arbitrary property assertions, without having to shell out to other languages/tools. For example, temporal assertions (TLA-style) should be native to the language, and not an afterthought requiring inlining another language in the host language.

Higher-level Hardware

Create higher-level versions of chil:module that is less painful to use, but can be converted into low-level Verilog-like format currently being used. Should support an implicit reset & clock, which can be overridden with a (with-reset/clock ...) macro(?).
- Higher-level version should NOT have Verilog-specific information included in its definition. This includes things like timescale. timescale should be handled at the Verilog level, but needs to be passed through as metadata attached to the higher-level module.
This higher-level hardware should support things like mixins. Chisel has the ability to create a new module that extend-s another, so that the new one inherits that hardware. It also has the ability to use composition, so you can say a signal “bundle” must and will contain these other signals, which have certain methods already defined for them.
- See Chapter 2.1 (Hooks) of Common Lisp Condition System for underying idea on how to implement mixins similar to Chisel. Should use catch/signal/error/handler-bind for real thing though. See Chapter 2.2 for that.
- Might want to use restarts instead?
- Reference the Common Lisp Cookbook
- Investigate how Kandria did mixins for their simulator. https://github.com/Shinmera/talks/tree/master
- I THINK mixins would be most useful for RISC-V CSRs. This way you can define the CSR and say it is WARL/WPRI/whatever without having to write the specific Reg & when logic. This would also make it easier to figure out exactly what is going on with a CSR without needing to actually read its implementation.
One-way enum for FSM Specialization of an enum/FSM that only allows you to traverse in one direction. (next oneway-enum signal) moves you to next state when signal goes high. Special-case this because complicated FSMs typically have cycles in their control flow (looping).
Enumerated values must be supported. Chisel gained support for proper enumerated values quite some time ago, but they still lack some quality-of-life tools. For example, take the micro-op enumeration from BOOM:
```
/**
 * Mixin for scalar operation constants
 */
trait ScalarOpConstants
{
  // Micro-op opcodes
  // TODO change micro-op opcodes into using enum
  val UOPC_SZ = 7
  val uopX    = BitPat.dontCare(UOPC_SZ)
  val uopNOP  =  0.U(UOPC_SZ.W)
  val uopLD   =  1.U(UOPC_SZ.W)
  val uopSTA  =  2.U(UOPC_SZ.W)  // store address generation
  val uopSTD  =  3.U(UOPC_SZ.W)  // store data generation
  val uopLUI  =  4.U(UOPC_SZ.W)
  // ...
  // Many more uops
}
    
```
I envision this being something like the following
```
(chil:defenum boom-micro-ops
  UOPC_X ; Don't care. Should be auto-generated?
  UOPC_SZ ; Auto-generated based on final number of elements in enum
  UOP_LD
  UOP_STA
  UOP_STD
  UOP_LUI
  ;; ...
  )
    
```
Ideally, there would be some kind of namespacing here too, so you could not just refer to UOP_STA, but must qualify it somehow.

You should be allowed to pattern match on enumerated values, and exhaustiveness must be required. The pattern matching should be proper pattern matching with no fall-through semantics. Obviously, we will need a catch-all/don’t care representation, since you likely will not want to work with all of the enum cases at the same time. We also need a way to state that a branch of the pattern-match should be unreachable.

When pattern matching, we need to decide if the patterns are ordered or not. Lisp (both cond and trivia:match) and Haskell use first-to-last/top-down matching. Effectively, the patterns are matched in the order they are written in. Is there something interesting we could do by pattern matching against all branches simultaneously?

Union/Bit-mask style enum. The goal with this is two-fold. First, if the user uses a non-enumerated value, an error is raised. Second, the user needs to only define the enumerated values, and you get masking “for free”. This second point means the user does not need to select hard-coded values for the enumerated values any more. This is more subtle, so an example in Chisel from BOOM’s functional unit bit-mask is shown below:

/**
 * Functional unit constants
 */
object FUConstants
{
  // bit mask, since a given execution pipeline may support multiple functional units
  val FUC_SZ = 10
  val FU_X   = BitPat.dontCare(FUC_SZ)
  val FU_ALU =   1.U(FUC_SZ.W)
  val FU_JMP =   2.U(FUC_SZ.W)
  val FU_MEM =   4.U(FUC_SZ.W)
  val FU_MUL =   8.U(FUC_SZ.W)
  val FU_DIV =  16.U(FUC_SZ.W)
  val FU_CSR =  32.U(FUC_SZ.W)
  val FU_FPU =  64.U(FUC_SZ.W)
  val FU_FDV = 128.U(FUC_SZ.W)
  val FU_I2F = 256.U(FUC_SZ.W)
  val FU_F2I = 512.U(FUC_SZ.W)

  // FP stores generate data through FP F2I, and generate address through MemAddrCalc
  val FU_F2IMEM = 516.U(FUC_SZ.W)
}

See how each functional unit needed to have a bit value manually assigned to it and FU_F2IMEM is a bit-mask that needed to be manually computed? Chil would ideally provide some helper syntax for this:

(chil:defenum FUConstants
  ;; :properties could be hidden behind a function too
  :properties '(bitmask)
  FU_X ; Perhaps this should be auto-generated?
  FU_SZ ; Auto-generated based on final width of all elements in bitmask.
  FU_ALU
  FU_JMP
  FU_MEM
  ;; ...
  FU_F2I

  (FU_F2IMEM FU_F2I FU_MEM)
  ;; Or identically
  (FU_F2IMEM FU_F2I FU_MEM))

The :properties field could allow us to define other kinds of enums too, like one-hot enums.

Like Chisel have Valid and Decoupled wrapper modules, but prevent data use/writing without first entering an environment/scope where the valid~/~ready signal is first checked. Something like
```
;; This should work
(with-valid wrapper-bundle (assign local-wire underlying-bits))
;; This should fail, since we are not in an environment/scope where valid has been checked.
(assign local-wire (bits wrapper-bundle))

;; These should probably desugar to a cond/Mux.
(defmacro with-valid (wrapper-bundle @body b)
  (cond (valid wrapper-bundle)
    (t b)
    (else do-nothing)))

(with-ready wrapper-bundle (assign underlying-bits 3) (assert-valid))
    
```
This would prevent use-without-valid and signal asynchrony errors as identified by “Debugging in the Brave New World of Reconfigurable Hardware”.
- We may not want to introduce another layer of indirection like Chisel’s Valid does. Chisel Valid adds a valid bit and moves all the fields of the bundle provided to it under the .bits namespace. This was probably to avoid naming collisions. But there are cases (namely around nesting depth and length of identifiers) where this would become extremely annoying (See BOOM’s ROB). We need to check to ensure the user does not have a .valid signal in their bundle already.
I want a way to mark implementations that are deliberately incomplete. This is like Rust’s ~todo!()~ and ~unimplemented!()~ macros or Scala’s Predef.??? operator.
The equivalent to Chisel’s Flipped constructor could be a macro that just switches all (inputs ...) to (outputs ...). (defmacro ... `(,module (inputs ,(module-outputs)) (outputs ,(module-inputs)) rest is same?)
Need to provide a way to disable any implicit signals installed (clock, reset, etc.). Implicit clocks make it harder to specify clock domains & gating logic when interfacing with non-Chil hardware. (Perhaps this is obviated by the fact that Chil will read Verilog & add it to the final IR?) Implicit resets make it harder to pipeline reset logic & add balanced flop trees.
Need a way to control naming.
You should be able to mark components (modules, bundles/IO) as debug at creation time, to prevent them from being optimized away in normal builds. You need to run a high-speed low-debug build to have these be optimized away. Effectively, I want a way to mark things as Chisel’s dontTouch at their point of instantiation and never have them get optimized away unless I have explicitly asked for it. This could also be extended as a kind of “optional part” where the user can choose to not instantiate the component marked as debug so that irrelevant pieces can be omitted from the source.

A Language to Build Correct-by-Construction Concurrent Software

Dezyne is a language that is designed to produce correct-by-construction concurrent software. It has a variety of tools to specify, validate, verify, simulate, document, and implement concurrent control software.

Dezyne has model-based checking tools and design-by-contract systems. Further, Dezyne has its formal semantics expressed for verification.

Importantly, Dezyne treats the language primitives of the message passing programming model as first-class citizens.

This could be an interesting upper-level language to lower into a Chil language.

Separating FSM Definition from Their Usage

Hardware is controlled (almost) completely by finite state machines. Traditional hardware languages (Verilog/SystemVerilog and VHDL) and even modern HDLs (Chisel, SpinalHDL, etc.) do not let you define a finite state machine and its transitions separately from the FSMs use. In other languages, defining an FSM would be a completely different step compared to using it.

;; Syntax taken from "Sham: A DSL for Fast DSLs".
(define-fsa M init (end)
  [init ([c more])]
  [more ([a more] [d more] [r end])]
  [end ()])

;; (define-fsa name start (final ...)
;;   [state ([input next] ...)] ...)

Language for Writing Encode/Decode Tables

Rocket had to write a decode table class for their instruction decoder. They used Quine-McCluskey minimization, but also support Espresso.

The problem is, the error messages for their decode table implementation do not explain why things are going wrong. It would be nice to have to cover all patterns somehow.

Diplomacy-like System for Delayed Hardware Design

Chisel has a tool called Diplomacy, which is a way to delay hardware generation until parameters are fully known. Some parameters in a hardware design are not known by the programmer at the time they write the HDL. For example, how many address bits do you need in a cross-bar? That depends on the number of devices attached to the cross-bar. What if you want to make the cross-bar implementation a library, to reuse the cross-bar everywhere? How can you get the number of devices without having the whole design?

Diplomacy solves these problems by introducing a new phase before Chisel hardware generation. You (as the designer) mark Chisel modules as “diplomatic” by introducing Diplomacy parameters to the module. Then, when compiling, the Diplomacy framework goes over a design, passing these parameters around to all the diplomatic modules in the design. The parameters are then concretized into the Chisel code before the Chisel compiler is run.

Modules in this setup need to be marked as lazy, so that the Chisel compiler will accept the symbol’s definition as being valid, without having an actual definition yet. (lazy is a lazy evaluation in this case). This lazy marker is required to make sure the compiler does not complain when a module has an implementation that depends on resolved diplomatic parameters.

I wonder what would happen if we flip the script and make everything diplomatic, rather than having to explicitly opt-in. If modules do not need diplomatic parameters, the outer wrapper can be silently unwrapped. With Lisp’s code-staging through symbol recognition (gexps in Guix are just symbols that are a “specially-named quote” in this metaphor), the notion of lazy may not be needed anymore.

Network-on-Chip Extension

This section is taken from Jerry Zhou’s Constellation NoC generator. Can a Diplomacy-like framework in Chil allow for expression of NoCs? Chisel’s Diplomacy cannot do this because Diplomacy can only describe acyclic networks. UC-Berkeley has implemented Constellation’s cyclic descriptions into Diplomacy-generated acyclic ones by providing translators.

Would a general cyclic NoC language be able to express any acyclic interconnect system too? Are there problems there? Can you prove the acyclic interconnect out of a potentially cyclic description and then change tactics (for example, more aggressive optimization)?

Such an expression language must include:

A specification language that includes the topology, routing, protocol, and coherence.
- Logical specification: Flows & endpoints. How many nodes (endpoints) are there? How are they logically connected? What are the logical flows the NoC must handle? What are the conditions for deadlock-free execution (conditions to always make forward progress) in the NoC? As part of the flow specification, we can limit what design points we generate HW for, because not all flows are possible given allthe other constraints in the specification.
- Physical specification: Topology, microarch, and channels. What are the physical properties of this network? How wide is a channel? What is the topology of nodes in the network? What is the specific implementation details of the nodes? How many buffer entries are in the network?
- Routing specification: Routing policy, allocation, and arbitration. How do packets/flits reach one end of the network from the other? What resources are allocated as a packet/flit traverse the network? What is the arbitration scheme to determine what resources get allocated? “Marries logical spec to physical” (Zhou, 2023). The routing table will be generated for each router node:
  1. Compute all possible paths for all possiblef lows.
  2. For each router, compute precisely which flows might arrive.
  3. Construct an abstract truth table for routing.
    1. Input is flow, currently occupied Virtual Channel
    2. Output is a Boolean for each output Virtual Channel
  4. Use logic minimization to generate HW implementation of routing table. Espesso will often be better here because the routing table is likely to be quite large and exact minimization algorithms (Quine-McCluskey) will take inordinate amounts of time.
A specification translator that can generate behavioral and transactional simulators. These will be used to verify correctness of implementations of this specification.
A language for implementing the behaviors of the network itself.
Multi-protocol networks, where multiple protocols either interface through endpoints/adapters, or work on the exact same physical specification.
Multi-network systems:
- Separate performance-critical traffic from control traffic. The performance network can be high-bandwidth, high-power, and low-latency, while control can be lower-bandwidth.

This NoC framework must validate (and preferably prove):

The network is actually routable.
There is no deadlock in the protocol’s specification
There is no deadlock in the protocol’s implementation.

Basic notes about NoCs:

Packets are used
Packets may be bigger than what the network can actually transmit. In this case, packets are further decomposed into flits. There is a header/tailer flit to encode the start/end of a packet stream.
Wormhole routing is a fairly standard way to implement a routing policy. In this case, flits move through the network, one at a time. The header flit starts the process and subsequent flits exactly trail the header as it moves through the network. This makes the sequence of flits look like a worm moving through the network. Such a routing policy means wormhole routing is just a resource-allocation policy.

All of this can be done with normal Lisp code, without needing to drop to Chil, because no hardware has been generated yet. Only once the spec and its implementation have been shown to not cause problems is hardware actually generated.

Precise & Formal ISA Specification

This is a topic that more ISA specifications & CPU vendors are using nowadays. Instead of an imprecise natural-language specification of the instructions in an ISA, use a formalized programming language to precisely describe what each instruction does. Using this precise language, golden emulators and simulators can be generated, along with documentation rendered nicely.

Sail is the language that RISC-V uses to specify their ISA, generate their golden emulators and simulators, and generate test cases for “differential simulation”.

This could be combined with Compiler Instruction Generation/Backend to generate compiler backends that are tightly coupled to the ISA’s specification in the formal language.

Alastair Reid has several good blog posts about why Machine-Readable Specs are important:

Compiler Instruction Generation/Backend

This is an idea taken from/inspired by ”Generate Compilers from Hardware Models”.

The essence is that compiler backends need to know the ISA of the CPU architecture they are targeting. The RTL needs to know the ISA of the CPU architecture that is being implemented. Currently, these two users create different implementations and then need to communicate information between these two. The paper presents an idea where the compiler’s understanding of the ISA’s instructions is generated by the hardware that implements those instructions. This combines with the RTL-level knowledge of how the instructions are implemented to provide the necessary costs to the instruction selection system.

What if this is combined with Precise & Formal ISA Specification?

Building/Elaborating

For any realistic Chil project, a build system will be needed to automate the work of taking a Chil description and lowering it to another format. Look through Build Systems à la Carte for more information about this topic.

Implementing this could be done just by piggy-backing off of Common Lisp’s already-present asdf. Then for larger scale automation, some utilities may be provided.

Veryl is very similar to Verilog, with minor conveniences added to it. Its real draw is that it has a set of integrated tools that help manage your project, with commands similar to Rust’s cargo tool.

There should be a define-able style guide which can be enforced by a linter. An example of a Verilog Style Guide.

Something that SBT does that I think is really nice is that you can add a ~~~ to any sbt command, and it will “watch” the dependencies. This means that if you update a dependency for the command, the command is automatically re-run. For example, after saving edits to a file, the unit tests for that file run again automatically, with the necessary builds done in between.

Montana offered to use a database behind-the-scenes to manage compilation, which allowed tool-writers to hook into the compilation flow itself. This provided features similar to LSPs and high-quality IR semantic analyzers today, before those were widely available for languages like C++.

Scala’s Mill is kind of what I am aiming for.

Compilation of modules should be thread-safe, so two separate functions can be generated and compiled at the same time.

Notes after reading Build Systems à la Carte

We want a suspending scheduler for the build system, where each thread/process building the project can be paused until its inputs are ready. But given Common Lisp’s restart system, a restarting scheduler could be far more feasible. Another problem for suspending scheduler is that Common Lisp does not have good support for continuation-passing style?

Comparison to Chisel

Chisel uses the Scala Build System (SBT) to define and declare projects, and uses Java’s default file hierarchy to find files. But SBT does not work for projects that need to leave the Scala world? Hence, larger projects like Chipyard need a combination of scripts, Makefiles, and Scala-generated Makefiles to make everything happen.

Chisel, Chipyard, Rocket, etc. all moved to using Mill instead of SBT.

Annotations

My thoughts about Annotations and Hardware Construction Languages and how they can be used in Chil:

Annotations should not be an after-thought.
They are a key way to pass circuit metadata down through the compiler’s phases.
Should annotations be allowed in the circuit description itself? Or in another file altogether?
Annotations indirectly refer to parts of the circuit. Just use the name, rather than a pointer or another structure. This naming indirection allows passes to rename components in the actual circuit without needing to do massive cross-cutting modifications.

Non-Compilation Passes

In addition to lowering passes needed to compile a high-level circuit construction to the final circuit, we also need to provide passes that do not alter the circuit. These passes can provide information or feedback about your circuit at points in its life. The Nanopass framework supports this with transforms that take a language in and do not produce an output language.

Some ideas for these passes include:

FIRRTL Pass for Area and Timing
Generating target-device-specific configuration files. For example, an accelerator may need an XML file to describe the hardware that is being added. A pass could take in the IR, figure out what is being asked, and return an XML file describing the written circuit.

Type-Checking

Considerations on Codecrafting has a blog post about how they believe you should make good type errors in a typed language. Designing type inference for high quality type errors.

Documentation

Language documentation should be clear and easy to read. When possible, it should be concise, but should not limit itself when deeper explanation is necessary. The entire public-facing interface for the language should be documented, and hopefully all the internals too.

The list below is taken from the blog post Why is language documentation still so terrible?:

A canonical language documentation written for real human beings
Docs themselves should be versioned, so you do not have to sift through information that doesn’t apply to the version you care about
A reference/appendix section that contains the language specification (syntax, operator precedence, keywords, etc.)
An individual page for each standard library class or built in type
- Class and method descriptions should answer at least the first 2, but preferably all 3 of the following questions:
  1. What does this do (effect)?
  2. How does it do it (internal implementation)?
  3. Why would I want it to (use-case, comparison to similar methods, etc.)?
- Link directly to the source code of the internal implementation.
- That page must be as uncluttered as possible
- That page must contain (not link to) every method, and the descriptions of those methods, that can be called by that class, preferably including all inherited functions.
  - Most methods should have at least 1 example
  - There should be a sidebar or equivalent that contains all the method names in alphabetical order for easy searching and jumping
- Code examples should be at least lightly syntax highlighted
- examples, descriptions, and function signatures should link internally as much as possible
- non-cryptic names, or at least like… tell me what your 8 byte contraction expands to
Preferably on a publicly accessible website, styled in a way that doesn’t make my eyes bleed (dark mode option), and that responds appropriately to at least both full screen (16:9) and half screen (8:9) sizes
A search function that isn’t just lmgtfy?????? Are we for real???

The language documentation the author believes satisfied all of these criteria was Rust’s standard library documentation system. The author further pointed out that even 3rd party crates get a similar documentation website generated for them, just by using the doc-comments in the files, and publicly-exported tools.

Toolchain Driver

If I intend to support multiple input formats and output formats, there will need to be a series of steps to define actions to take to produce an output. This may involve running the Chil compiler, but it might also involve running other tools (like a script to convert a JSON description of memory into a dat format). If I also want to have a “workflow” kind of language so that I can provide a design and the desired end target, then I would need this too. Effectively, this would become the unified way to work with anything in my Chil language.

fud2 - A Compiler driver for orchestrating the Calyx ecosystem. It handles building a design (including lowering from Dahlia, their HLS language) and turning it into SystemVerilog, which is then merged with their SystemVerilog standard library. It can interpret the Calyx using their interpreter, Cider. It can also take the final SystemVerilog and run it through Verilator, Icarus, or even FPGA workflows for synthesis. Currently (2024-08-16), fud2 uses a breadth-first search to find a path in the graph of operations from the input to the requested output. However, they are also investigating other methods, like using E-Graphs (Equivalence Graphs) through egglog, or constraint programming through Datalog.

Common Lisp has an implementation of Datalog as a DSL on GitHub called cl-datalog. Datalog was originally implemented in Clojure, with this Overview of Datalog?

Information Gathered from fud2

fud2 is undergoing a rework that I think is “the right” approach for making a toolchain driver. Instead of representing the state of things you are working on as a graph of states with transforms between the states as a simple directed graph, represent the thing as a hypergraph.

Hypergraphs have some interesting possible representations:

An incidence (adjacency) matrix format
A bipartite graph with edges on one side and vertices on another

There are several things to consider here:

Data Model

How do we represent the hypergraph, its nodes and its hyperedges? The data model specifies the data structures for states (nodes) and ops/transforms (edges). Ultimately, the quality of the Data Model will determine who esay/hard it is to make/use the other components.

Something we must support is a way for a single node to flow into multiple for later. For example, Calyx uses a wrapper for the AXI protocol family called YXI.

     +-----calyx-----+
yxi--+--axi-wrapper--+--combine-yxi-calyx--> next

With Calyx, the YXI must flow through an op to produce two states, one produces Calyx RTL and the other produces an AXI wrapper. In a later step, both of those files (Calyx and AXI wrapper) must be combined by another op to produce another state/node. According to Issue 1958, this produces a hyperedge in the hyperpath.

Scripting Interface

The scripting interface is (effectively) how you (as a user) declare the nodes and edges in the hypergraph. How do you translate a set of input nodes into a set of output nodes?

This is language that you use to manipulate and work with the data model.

Something that we should consider is that states may want to be parametric. For example, Calyx has multiple states to handle Verilog:

verilog
verilog-noverify
synth-verilog
verilog-refmem
verilog-refmem-noverify

Many of these are duplicates about pre-/post-verification and whether or not what is produced should be synthesizable. The fud2 Next Steps Doc presents the idea that a state be parameterizable. For example, there is a single verilog state that accepts what to do as parameters: verilog[verify?, synth?, refmem?].

One interesting thing this setup allows is a transformer/edge to be less strict about the inputs it allows. For example, an op that takes in only synthesizable verilog would specify its Verilog input as verilog[X, true, X], where X means “don’t care”.

Planner

The planner is a tool that semi-automatically explores the hypergraph to determine what must be done and in what order. It will start from the set of starting points to produce a set of final/output points. What the planner finds is called a hyperpath. Finding such a hyperpath is not an easy task, but there are solutions/algorithms out there. There are likely a bunch of algorithms to look at from recommender systems, image retrieval, and competition networks. A potential problem about our case is that we provide multiple non-sibling starting points in the hypergraph which might not even have a connection between them.

One common set of starting points is that you have both an RTL design and a memory file to give to a simulator. But what if the memory file is described as a textual file (or even a program!) such as JSON? You need to first compile that into a raw binary that the simulator understands. fud2 has this exact problem as taken from Issue 1958.

calyx --calyx--> verilog --icarus--> simulator --+
dat --dat2sim--> simdata ------------------------+--simulate--> dat

We must provide a “no-op”/manual planner; a planner that does nothing and falls back to doing what the user says. This amounts to not using a planner at all. In my opinion, the manual planner should accept a hyperpath of steps to take from a file, similar to a task description graph.

One interesting way to construct this planner would be to use Datalog as the contraint handler. It would be interesting to explore and minimize this space by using e-graphs, but I don’t know if e-graphs would even work on hypergraphs.

Execution Engine

fud2 is written in a combination of Rust and Rhai and outputs Ninja files. These Ninja files contain the steps required to do what the planner decided and is what is actually executed. The Rust and Rhai of fud2 are intended just to represent the problem space, search it for a way to complete the action the user requested, and generate the Ninja. Once the Ninja is generated, Calyx “leaves” the world of Rust and Rhai and just invoke Ninja with the generated Ninja files.

Since we are in the dynamic world of Common Lisp, could actually have our system run inside of the “current” Common Lisp image itself. Whether or not that is a good idea, remains to be seen.

Driver Plan Serialization

fud2, the Calyx infrastructure’s toolchain driver, is adding plan serialization and deserialization support. I think this is a good idea, because it could let us bypass having to invoke the hypergraph-traversal in common cases. For example, think of the common case where you are doing quick incremental development (fixing an error, recompiling, loop). Traversing the hypergraph might be a very expensive operation (I don’t know yet, but I imagine it will be). In this quick case, it is unlikely that the chosen plan will change much, if at all. By serializing and caching the plan that the hypergraph traversal found, we could make Chil’s toolchain react much faster.

fud2 currently supports JSON and is planning on adding a custom language of their own, called flang. In my case, I think the plan will just be an s-exp tree. Converting the s-exp tree to JSON is relatively trivial and we do not need to do too much to create a parser for the language (just read the s-exps).

Optimization

Within Chil, I would like to have an optimization framework for the higher-level language. I am not sure how much optimization is possible in the long-run. But for the small actively-working capacity of my mind, the Nanopass Framework makes the most sense to me.

I might have to implement the Nanopass Framework in ANSI Common Lisp…
- If I did that, I might be able to get that upstreamed?

Pass Ideas

Nanopass uses very small passes that do relatively little work. They rewrite, modify, or analyze a very small subset of an AST to do something. One example is to convert instances of let* in Scheme to a ladder of let and lambda.

Some ideas for passes that I could write are:

CheckWidths: FIRRTL has a pass to check if dynamic shifting uses a dynamic shift amount that has a bit-width $> 20$. This is the firrtl.passes.CheckWidths pass, particularly the $DshlTooBig top-level function.

Outputs

Generate other low-level HDLs.

FIRRTL?
CIRCT?
VHDL
SystemVerilog

Simulator

Chil should include a simulator alongside it. Requirements:

Should be multi-threaded, to improve execution speed, if possible.
If a “core” assertion in the simulation testbench fails, then a Lisp core image should be saved (sb-ext:save-lisp-and-die).
This core image should allow for “rewinding” the world to see the sequence of events that caused an assertion violation.
We should support both 2-state and 4-state simulation. This helps reveal initialization errors that propagate through the circuit. As a reminder, 2-state only allows 0 and 1, with nets initialized to 0; 4-state allows 0, 1, X (unknown), Z (competing drivers, floating, high-impedance).

Methods to achieve requirements:

Simulator should use transactional memory?
- SMTX Common Lisp library makes it easy to use transactional memory in CL.
- This may also make multithreading the simulator easier?
- If the simulator’s core image dump (sb-ext:save-lisp-and-die) includes the log of memory transactions internally, rewinding the image is simple, without dependencies.
- Goblins implemented this with transactional heaps. Goblins Distributed Debugger with Time Travel is almost exactly what I would like.
- Could use Lisp Flavoured Erlang too, and have Erlang actors handle that. I don’t know if there is a way for a “core dump” to be made though, as LFE compiles to BEAM bytecode and runs on top of there.
- Transactional Heaps?
  - Safe and Efficient Persistent Heaps
  - NV-Heaps: Making Persistent Objects Fast and Safe
Simulator must record the state changes in the circuit to a DB for rewind? Does the transactional memory allow that too? If the transaction log of memory allows for recording to disk, then replay should be somewhat trivial.
- Jason recommended RRDTool as a time-series database. If a database is needed, that might make more sense.
Propagators?

Kanata Logs

Konata is a tool to interactively view how instructions flow through a pipeline. It also supports Out-of-order execution information.

Konata uses a log format called Kanata. The log file is a text file format whose format is described here.

Verification

Proof-Carrying Code
Compare/contrast with SymbiYosis, Yosys’s front-end to formal HW verification flows

Synthesis

There are three main parts to synthesizing a design from HDL down to actual circuits. There are actually many sub-portions to each of these tasks, but these highlight the major steps when lowering an HDL to circuits.

Logical Synthesis (Synthesis in Vivado’s terms) Turns your HDL into a technology-independent netlist. Many optimizations are done at this level, because the most information is available now. This can be used to do very rough timing analysis, analyze potential critical paths, and most importantly, see what your HDL actually synthesizes into.
Technology Mapping/Library Binding This is like instruction selection in compilers. You must figure out and optimize the set of gates that the manufacturer has implemented for that technology for what you synthesized into. For example, an AOI3 can have a special circuit mapping.
Physical Synthesis (Implementation in Vivado’s terms) This takes the logical description of physical components and maps them onto the actual hardware. This involves layout compaction, partitioning, floorplanning, placement, and routing.

Vivado Synthesis Steps

The information for this section is taken from: AMD’s Vivado Synthesis User Guide (UG901), AMD’s Vivado Implementation User Guide (UG904), and this Vivado Synthesis question & response. You can look at AMD’s Vivado Design Suite User & Reference Guides (UG949) to get a top-level view of all user-guides.

Synthesis (Logical Synthesis)
1. Elaborates the design, resolving parameters, generate blocks, and other high-level RTL details. At the end of this, there is an instantiated module and connection for everything. Vivado’s output from this are “Generic Technology Cells”. GTCs are abstract items, like addres, comparators, registers, arbitrarily wide gates, infinite fan-out, etc. This is an abstract netlist.
2. Apply constraints. These constraints are specified in the XDC format, Xilinx’s extension to the standard SDC format. XDC = Xilinx Design Constraints, SDC = Synopsys Design Constraints.
3. Perform high-level optimizations. These optimizations take advantage of the constraints that we placed on the netlist. They can condense multi-level combinational logic, add abstract buffers for timing, and anything else that does not rely on implementation specific information. In particular, the following optimizations cannot happen yet:
  - Implementation device selection (mapping an abstract adder to a DSP slice for instance.)
  - Implementation timing latencies (BRAM vs. LUT for large logic storage)
  - Implementation power profiles (BRAM vs. LUT for large logic storage)
4. Perform technology mapping. Vivado needs to know what you are targeting, and attempts to map multiple levels of logic to components on the physical device. At this point, the device’s features are the limiting factor; routing, power consumption, and latency/timing do not play a major factor here.
5. Perform lower-level optimizations to logic design. Optimizations at this point can take advantage of the fact that particular portions of the circuit have been mapped to specific pieces of the device.
Implementation (Physical Synthesis)
1. Opt Design: Optimizes the logical design to make it easier to fit onto the target AMD device.
2. Power Opt Design (Optional): Optimize physical design to reduce power demands
3. Place Design: Place the abstract physical design onto the target device. Fan-out replication is performed here.
4. Post-place Power Optimization (Optional): Use placement knowledge to reduce power.
5. Post-place Physical Optimization Design (Optional): Use placement knowledge to improve timing.
6. Route Design: Route the design on the target device.
7. Post-Route Physical Optimization (Optional): Optimize the design using the placement and routing knowledge. This optimization step can take advantage of the highly-accurate and device-specific timing information present on the final device.
8. Write Bitstream: Generate the design bitstream for flashing.

Examples

Simple counter
ALU
Single-Error Correct, Double-Error Detect ECC Unit
N-point FFT
Branchless UTF-8 Encoder https://cceckman.com/writing/branchless-utf8-encoding/
Cryptographic cores/accelerators
1. AES-256
2. SHA-256
IEEE 754 compliant Floaing-point unit (Similar to Berkeley’s hardfloat)
1. Addition
2. Subtraction
3. Multiplication
4. Division
5. Pipelined
Communications protocol (AXII, AHB-to-APB bridge)
1. AXI4 Implementation for AXI4, AXI4 Lite, and AXI4 Stream.
RISC-V core (Should support RISC-V GC, to boot Linux) Getting many of these built will make my stuff equivalent to Berkeley’s RISC-V SODOR.
1. Hardware support for single-, double-, and quad-precision floating point. See Berkeley’s HardFloat.
2. Single-cycle
3. Multi-cycle
4. Pipelined (single issue)
  1. mrisc32
5. Multi-issue in-order pipelined
6. Single-issue out-of-order
  1. Design feature? Loads/stores are 2 uops, one to compute the absolute address with result going into physical register, the second uop uses that absolute address and submits the memory request.
7. Multi-issue out-of-order
  1. SoomRV is an example of this.
tiny-gpu: A minimal GPU that executes a single kernel at a time with many threads per core. This architecture also includes a small amount of possible configuration too.
raster-i: A minimal rasterizing GPU for tile-based rendering. NOTE: This repo is implemented in C++ and Chisel 3.
turbo9: Pipelined Motorola 6809 design
Caster: Electrophoretics Display (eInk) Controller. Used by Glider.
CHERI in Hardware This has already been done with ARM, MIPS, and recently RISC-V. But I want to implement on this.
Custom architecture Something to think about

https://docs.kernel.org/next/RCU/whatisRCU.html#whatisrcu

… [T]he typical RCU update sequence goes something like the following:

a. Remove pointers to a data structure, so that subsequent readers cannot gain a reference to it. b. Wait for all previous readers to complete their RCU read-side critical sections. c. At this point, there cannot be any readers who hold references to the data structure, so it now may safely be reclaimed (e.g., kfree()d).

Step (b) above is the key idea underlying RCU’s deferred destruction. The ability to wait until all readers are done allows RCU readers to use much lighter-weight synchronization, in some cases, absolutely no synchronization at all. In contrast, in more conventional lock-based schemes, readers must use heavy-weight synchronization in order to prevent an updater from deleting the data structure out from under them. This is because lock-based updaters typically update data items in place, and must therefore exclude readers. In contrast, RCU-based updaters typically take advantage of the fact that writes to single aligned pointers are atomic on modern CPUs, allowing atomic insertion, removal, and replacement of data items in a linked structure without disrupting readers. Concurrent RCU readers can then continue accessing the old versions, and can dispense with the atomic operations, memory barriers, and communications cache misses that are so expensive on present-day SMP computer systems, even in absence of lock contention.

FilesExpand file tree

TODO.org

Latest commit

History