FSM Style

This is a work in progress.

make diagram of comb logic -> regular state machine -> to both auto pipelined and clock-marked fsm style (root is comb logic)

This page describes the 'finite state machine style' of writing PipelineC code. Inspired by Silice's clock step operator, PipelineC introduced a __clk() function that allows the generation of complex derived finite state machines.

Basic Example

Consider these three examples:

// No clock cycles, combinatorial logic
uint32_t test0(uint32_t x)
{
  uint32_t rv = x + 1;
  return rv;
}

// One clock cycle before add, a FSM
uint32_t test1(uint32_t x)
{
  __clk();
  uint32_t rv = x + 1;
  return rv;
}

// One clock cycle after add, a FSM
uint32_t test2(uint32_t x)
{
  uint32_t rv = x + 1;
  __clk();
  return rv;
}

and how they behave over time: sim0 These test cases drive the input x with 2, thus expecting 3 as the returned output. test0 is combinatorial logic. The output is immediately, in the same cycle, available at the output.

test1 introduces a clock cycle __clk() of delay. Notice how the return value registers_r.state_regs.rv for test1 compares with test2: there is a one clock cycle delay in test1 before the rv state register takes its value.

However note the return value output of the function return_output.return_output and its associated handshake valid signal return_output.output_valid: In both test1 and test2 the returned value is 3 during the second cycle. This is because both functions are describing two clock cycles of computation, only slightly different combinatorial logic data paths.

In test1 the x input value is registers and saved for after the __clk(). Meaning the data path for test1 is from an x register through an adder to the output return value port. (the rv register is actually unused and optimizes away in synthesis)

In test2 the x input value is used before the __clk(). This means the data path for test2 is from an input port x (no register) through an adder as combinatorial logic to an rv register. In the second cycle the rv register drives the output return value port.

__clk() is essentially used as dividing marker for how combinatorial logic should be divided / where registers will be used.

Syntax

It is possible to write a state machine in 'regular' PipelineC without __clk() operators. Instead each FSM state and transition logic is explicitly written as PipelineC code (similar to regular HDL) - this is verbose but works. Instead, the FSM style discussed here is essentially a layer of code generation creating that verbose code for the user, states are derived from __clk() use.

The full syntax for declaring an instance of the test1 FSM looks like so:

uint32_t test1(uint32_t x)
{
  __clk();
  uint32_t rv = x + 1;
  return rv;
}
#include "test1_FSM.h"
#pragma MAIN test1_wrapper
uint32_t test1_wrapper()
{
  test1_INPUT_t i;
  i.x = 2;
  i.input_valid = 1;
  i.output_ready = 1;
  test1_OUTPUT_t o = test1_FSM(i);
  return o.return_output;
}

The #include "test1_FSM.h" signals that the test1 function is a finite state machine, and is also the name of the generated file where PipelineC code describing the derived FSM will go.

These FSM style functions include input and output handshaking signals _valid/_ready - for function entry and return. As such, inside the generated _FSM.h are structs that define input and output types like test1_INPUT_t and test1_OUTPUT_t. These types are used for the generated test1_FSM function, which is used when instantiating the module in the top level MAIN of the design.

Using the generated _FSM() function and handshake signal structs from generated code is only needed when the function is at the top level of the design or connecting with other non-FSM style PipelineC. I.e. Calling other FSM functions from in an FSM looks as expected, regular C function call syntax.

Loops

Consider the below function that first waits for 5 clocks, then 4, ... down to 1. It does this via a wait function the pressuably does __clk() a number of times specified by the input argument. This code compiles fine.

uint32_t wait_test()
{
  uint32_t clocks_to_wait = 5;
  while(clocks_to_wait > 0)
  {
    wait(clocks_to_wait);
    clocks_to_wait -= 1;
  }
  return clocks_to_wait;
}

This implementation of the wait() function will not compile:

uint32_t wait(uint32_t clocks)
{
  while(clocks > 0)
  {
    clocks -= 1;
  }
  return clocks;
}

By default all loops are unrolled, i.e. could decrement the clocks value any number of times in one clock cycle by concatentating subtractors in combinatorial logic. The tool does not know how to spread the loop across multiple clock cycles unless instructed how to do so. A __clk() can be added to the loop body and the code now compiles fine.

uint32_t wait(uint32_t clocks)
{
  while(clocks > 0)
  {
    __clk();
    clocks -= 1;
  }
  return clocks;
}

The wait_test() function does not need a __clk() in it's loop body since use of the FSM style wait function now includes the clock delay markers.

sim1

Single Instance FSMs / Shared Regions

By default in PipelineC, multiple function calls at multiple times/places in code does not refer to the same hardware. Instead, typically new duplicate instances of the subroutine FSM logic are created for each ~'call location'.

This can be overridden by signalling that a certain function call should not be duplicated, and only a single instance will be shared among many calling locations in code.

uint8_t atomic_add(uint8_t val, uint8_t tid)
{
  static uint8_t the_reg;
  uint8_t new_reg_val = the_reg + val;
  printf("Thread %d incremented value from %d -> %d.\n",
    tid, the_reg, new_reg_val);
  the_reg = new_reg_val;
  return the_reg;
}
// Signal to tool to derive a single instance
// FSM region, ex. only one 8b the_reg in the design
// shared among N calling instances
#include "atomic_add_SINGLE_INST.h"

The above function FSM only exists once in the design, arbitration logic/muxing/etc is added to share the valid/ready FSM entry/return handshaking signals.

A setup of N 'thread' instances of a incrementer_thread FSM (function that repeatedly calls atomic_add) is seen below:

// Top level N parallel instances of incrementer_thread
#define N_THREADS 10
// Connect FSM outputs to top level output
// So doesnt synthesize away
typedef struct n_thread_outputs_t
{
  incrementer_thread_OUTPUT_t data[N_THREADS];
}n_thread_outputs_t;
#pragma MAIN_MHZ main 100.0
n_thread_outputs_t main()
{
  // Registers keeping time count and knowning when done
  static uint32_t clk_cycle_count;
  
  // N parallel instances of incrementer_thread
  uint8_t tid;
  incrementer_thread_INPUT_t inputs[N_THREADS];
  n_thread_outputs_t outputs;
  for(tid=0; tid<N_THREADS; tid+=1)
  {
    inputs[tid].tid = tid;
    inputs[tid].input_valid = 1;
    inputs[tid].output_ready = 1;
    outputs.data[tid] = incrementer_thread_FSM(inputs[tid]);
    if(outputs.data[tid].input_ready)
    {
      printf("Clock# %d: Thread %d starting.\n",
        clk_cycle_count, tid);  
    }
    if(outputs.data[tid].output_valid)
    {
      printf("Clock# %d: Thread %d output value = %d.\n",
        clk_cycle_count, tid, outputs.data[tid].return_output);
    }
  }
  
  // Func occuring each clk cycle
  clk_cycle_count += 1;
  
  // Wire up outputs
  return outputs;
}

The above code has the following behavior in simulation: sim4 sim5text

As expected, only a single calling FSM 'thread' at a time is able to have access and ~'atomically increment' the register value inside the atomic_add function. The rest of the threads block, waiting for their turn for access to the FSM in a later cycle.

Sharing a single FSM among N calling threads can at worst be 1/N times slower than having a not-shared duplicate copied of the FSM nearby to invoke whenever the calling threads needs. Additionally, there is a cost to arbitrating between the N calling threads: 1-to-N and N-to-1 muxes for both the input arguments and output return value of the function (ex. if the function takes alot of inputs or outputs then sharing becomes costly quickly). Finally if trying to round-robin/pipeline/time-share single instance FSMs among N calling threads, it is best to have N or more other single instance resources/'steps of the computation' in sequence before returning to the same shared resource again. Ex. having a thread that just repeatedly in a tight loop invoked a shared resource is not efficient. Instead invoke single instance functions and move on to do some work elsewhere, giving other threads the chance to access the resource with less competition.

Multiple IO Functions / Yield

Only works/makes sense for FSMs wired to non-FSM code. I.e. using _FSM() functions from generated headers.

FSM Style

Basic Example

Syntax

Loops

Single Instance FSMs / Shared Regions

Multiple IO Functions / Yield

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally