Skip to content

Shared Resource Bus

Julian Kemmerer edited this page Apr 26, 2023 · 104 revisions

This page describes 'shared resource buses' which are very similar to system buses like AXI, where managers make requests to and receive responses from subordinates using five separate valid+ready handshake channels for simultaneous reads and writes.

These buses are used in a graphics demo to have multiple 'host threads' share frame buffers to play Game of Life at 480p.

Requests and Responses

PipelineC's shared_resource_bus.h generic shared bus header is used to create hosts that make read+write requests to and receive read+write responses from devices.

diagram

For example, if the device is a simple byte-address memory mapped RAM, request and responses can be configured like so:

  • write_req_data_t: Write request data
    • ex. RAM needs address to request write
      typedef struct write_req_data_t
      {
        uint32_t addr;
        /*// AXI:
        // Address
        //  "Write this stream to a location in memory"
        id_t awid;
        addr_t awaddr; 
        uint8_t awlen; // Number of transfer cycles minus one
        uint3_t awsize; // 2^size = Transfer width in bytes
        uint2_t awburst;*/
      }write_req_data_t;
  • write_data_word_t: Write data word
    • ex. RAM writes some data element, some number of bytes, to addressed location
      typedef struct write_data_word_t
      {
        uint8_t data;
        /*// AXI:
        // Data stream to be written to memory
        uint8_t wdata[4]; // 4 bytes, 32b
        uint1_t wstrb[4];*/
      }write_data_word_t;
  • write_resp_data_t: Write response data
    • ex. RAM write returns dummy single valid/done/complete bit
      typedef struct write_resp_data_t
      {
        uint1_t dummy;
        /*// AXI:
        // Write response
        id_t bid;
        uint2_t bresp; // Error code*/
      } write_resp_data_t;
  • read_req_data_t: Read request data
    • ex. RAM needs address to request read
      typedef struct read_req_data_t
      {
        uint32_t addr;
        /*// AXI:
        // Address
        //   "Give me a stream from a place in memory"
        id_t arid;
        addr_t araddr;
        uint8_t arlen; // Number of transfer cycles minus one
        uint3_t arsize; // 2^size = Transfer width in bytes
        uint2_t arburst;*/
      } read_req_data_t;
  • read_data_resp_word_t: Read data and response word
    • ex. RAM read returns some data element
      typedef struct read_data_resp_word_t
      {
        uint8_t data;
        /*// AXI:
        // Read response
        id_t rid;
        uint2_t rresp;
        // Data stream from memory
        uint8_t rdata[4]; // 4 bytes, 32b;*/
      } read_data_resp_word_t;

Valid + Ready Handshakes

Shared resource buses use valid+ready handshaking just like AXI. Each of the five channels (write request, write data, write response, read request, and read data) has its own handshaking signals.

Bursts and Pipelining

Again, like AXI, these buses have burst (packet last boundary) and pipelining (multiple IDs for transactions in flight) signals.

The Shared Bus Declaration

A kind/type of 'shared bus' is declared by using the SHARED_BUS_TYPE_DEF macro. Instances of the shared bus are declared using the SHARED_BUS_DECL macro. The macros declare types, helper functions, and a pair of global wires. One of the global variable wires is used for device to host data, while the other wire is for the opposite direction host to device. PipelineC's #pragma INST_ARRAY shared global variables are used to resolve multiple simultaneous drivers of the wire pairs into shared resource bus arbitration.

SHARED_BUS_TYPE_DEF(
  ram_bus, // Bus 'type' name
  uint32_t, // Write request type (ex. RAM address)
  uint8_t, // Write data type (ex. RAM data)
  uint1_t, // Write response type (ex. dummy value for RAM)
  uint32_t, // Read request type (ex. RAM address)
  uint8_t // Read data type (ex. RAM data)
)

SHARED_BUS_DECL(
  ram_bus, // Bus 'type' name
  uint32_t, // Write request type (ex. RAM address)
  uint8_t, // Write data type (ex. RAM data)
  uint1_t, // Write response type (ex. dummy value for RAM)
  uint32_t, // Read request type (ex. RAM address)
  uint8_t, // Read data type (ex. RAM data)
  the_bus_name, // Instance name
  NUM_HOST_PORTS,
  NUM_DEV_PORTS
)

Connecting the Device to the Shared Bus

The SHARED_BUS_TYPE_DEF declares types like <bus_type>_dev_to_host_t and <bus_type>_host_to_dev_t.

Arbitrary devices are connected to the bus via controller modules that ~convert host to-from device signals into device specifics.

Again for example, a simple RAM device might have a controller module like:

// Controller Outputs:
typedef struct ram_ctrl_t{
  // ex. RAM inputs
  uint32_t addr;
  uint32_t wr_data;
  uint32_t wr_enable;
  // Bus signals driven to host
  ram_bus_dev_to_host_t to_host;
}ram_ctrl_t;
ram_ctrl_t ram_ctrl(
  // Controller Inputs:
  // Ex. RAM outputs
  uint32_t rd_data,
  // Bus signals from the host
  ram_bus_host_to_dev_t from_host
);

Inside that ram_ctrl module RAM specific signals are connected to the five valid+ready handshakes going to_host (ex. out from RAM) and from_host (ex. into RAM).

A full example of a controller can be found in the shared frame buffer example code discussed in later sections.

Multiple Hosts and Instances of Devices

In the above sections a controller function, ex. ram_ctrl, describes how a device connects to a shared bus. Using the SHARED_BUS_ARB macro the below code instantiates the arbitration that connects the multiple hosts and devices together through <instance_name>_from_host, <instance_name>_to_host wires for each device.

MAIN_MHZ(ram_arb_connect, DEV_CLK_MHZ)
void ram_arb_connect()
{
  // Arbitrate M hosts to N devs
  // Macro declares the_bus_name_from_host and the_bus_name_to_host
  SHARED_BUS_ARB(ram_bus, the_bus_name, NUM_DEV_PORTS)

  // Connect devs to arbiter ports
  uint32_t i;
  for (i = 0; i < NUM_DEV_PORTS; i+=1)
  {
    ram_dev_ctrl_t port_ctrl
      = ram_ctrl(<RAM outputs>, the_bus_name_from_host[i]);
    <RAM inputs> = port_ctrl....;
    the_bus_name_to_host[i] = port_ctrl.to_host;
  }
}

Temporary Extra Syntax Required

Not everything can be squeezed into macros. Specifically, the five clock crossing asynchronous FIFOs (one for each request/response handshake) needed for each host<->device connection being arbitrated must be individually #included separately like so:

// First host port FIFO declaration
SHARED_BUS_ASYNC_FIFO_DECL(ram_bus, the_bus_name, 0)
// Temporary extra syntax specifically for 5 handshaking FIFOs
#include "clock_crossing/the_bus_name_fifo0_write_req.h"
#include "clock_crossing/the_bus_name_fifo0_write_data.h"
#include "clock_crossing/the_bus_name_fifo0_write_resp.h"
#include "clock_crossing/the_bus_name_fifo0_read_req.h"
#include "clock_crossing/the_bus_name_fifo0_read_data.h"
// A second host port FIFO declaration
SHARED_BUS_ASYNC_FIFO_DECL(ram_bus, the_bus_name, 1)
// Temporary extra syntax specifically for 5 handshaking FIFOs
#include "clock_crossing/the_bus_name_fifo1_write_req.h"
#include "clock_crossing/the_bus_name_fifo1_write_data.h"
#include "clock_crossing/the_bus_name_fifo1_write_resp.h"
#include "clock_crossing/the_bus_name_fifo1_read_req.h"
#include "clock_crossing/the_bus_name_fifo1_read_data.h"
// As many host ports, fifo2, etc

Each of those FIFOs must be connected on the host-side and device-side of the clock crossing like so:

// Wire ASYNC FIFOs to dev-host wires
MAIN_MHZ(host_side_fifo_wiring, HOST_CLK_MHZ)
void host_side_fifo_wiring()
{
  SHARED_BUS_ASYNC_FIFO_HOST_WIRING(ram_bus, the_bus_name, 0)
  SHARED_BUS_ASYNC_FIFO_HOST_WIRING(ram_bus, the_bus_name, 1)
  // As many host ports, 2, etc
}
MAIN_MHZ(dev_side_fifo_wiring, DEV_CLK_MHZ)
void dev_side_fifo_wiring()
{
  SHARED_BUS_ASYNC_FIFO_DEV_WIRING(ram_bus, the_bus_name, 0)
  SHARED_BUS_ASYNC_FIFO_DEV_WIRING(ram_bus, the_bus_name, 1)
  // As many host ports, 2, etc
}

Using the Device from Host Threads

The SHARED_BUS_DECL macro declares derived finite state machine helper functions for reading and writing the shared resource bus. These functions are to be used from NUM_HOST_PORTS simultaneous host FSM 'threads'.

Below, for example, shows generated signatures for reading and writeing the example shared bus RAM:

uint8_t the_bus_name_read(uint32_t addr);
uint1_t the_bus_name_write(uint32_t addr, uint8_t data); // Dummy return value

Graphics Demo

graphicsdemodiagram

Dual Frame Buffer

The graphics_demo.c file is an example exercising two frame buffer devices as shared bus resources from shared_dual_frame_buffer.c.

For example the state machine that reads from shared resource frame_buf0_shared_bus or frame_buf1_shared_bus based on a select variable:

uint1_t frame_buffer_read_port_sel;
n_pixels_t dual_frame_buf_read(uint16_t x_buffer_index, uint16_t y)
{
  uint32_t addr = pos_to_addr(x_buffer_index, y);
  n_pixels_t resp;
  if(frame_buffer_read_port_sel){
    resp = frame_buf1_shared_bus_read(addr);
  }else{
    resp = frame_buf0_shared_bus_read(addr);
  }
  return resp;
}

One of the host threads using the frame buffers is always-reading logic to push pixels out the VGA port for display.

void host_vga_reader()
{
  vga_pos_t vga_pos;
  while(1)
  {
    // Read the pixels at x,y pos
    uint16_t x_buffer_index = vga_pos.x >> RAM_PIXEL_BUFFER_SIZE_LOG2;
    n_pixels_t pixels = dual_frame_buf_read(x_buffer_index, vga_pos.y);
    
    // Write it into async fifo feeding vga pmod for display
    pmod_async_fifo_write(pixels);

    // Execute a cycle of vga timing to get x,y and increment for next time
    vga_pos = vga_frame_pos_increment(vga_pos, RAM_PIXEL_BUFFER_SIZE);
  }
}

Threads + Kernel

RAM Access Width

The frame buffer from frame_buffer.c stores RAM_PIXEL_BUFFER_SIZE pixels at each RAM address. This is done by defining a wrapper 'chunk of n pixels' struct.

// Must be divisor of FRAME_WIDTH across x direction
typedef struct n_pixels_t{
  uint1_t data[RAM_PIXEL_BUFFER_SIZE];
}n_pixels_t;

Computation Kernel

The pixels_buffer_kernel function reads n_pixels_t worth of pixels, operates on them running some kernel function on each pixel sequentially, and then writes the resulting group of pixel values back.

void pixels_buffer_kernel(uint16_t x_buffer_index, uint16_t y)
{
  // Read the pixels from the 'read' frame buffer
  n_pixels_t pixels = dual_frame_buf_read(x_buffer_index, y);

  // Run kernel for each pixel
  uint32_t i;
  uint16_t x = x_buffer_index << RAM_PIXEL_BUFFER_SIZE_LOG2;
  for (i = 0; i < RAM_PIXEL_BUFFER_SIZE; i+=1)
  { 
    pixels.data[i] = some_kernel_func(pixels.data[i], x+i, y);
  }  
  
  // Write pixels back to the 'write' frame buffer 
  dual_frame_buf_write(x_buffer_index, y, pixels);
}

The pixels_kernel_seq_range function iterates over a range of frame area executing pixels_buffer_kernel for reach set of pixels. The frame area is defined by start and end x and y positions.

void pixels_kernel_seq_range(
  uint16_t x_start, uint16_t x_end, 
  uint16_t y_start, uint16_t y_end)
{
  uint16_t x_buffer_index_start = x_start >> RAM_PIXEL_BUFFER_SIZE_LOG2;
  uint16_t x_buffer_index_end = x_end >> RAM_PIXEL_BUFFER_SIZE_LOG2;
  uint16_t x_buffer_index;
  uint16_t y;
  for(y=y_start; y<=y_end; y+=1)
  {
    for(x_buffer_index=x_buffer_index_start; x_buffer_index<=x_buffer_index_end; x_buffer_index+=1)
    {
      pixels_buffer_kernel(x_buffer_index, y);
    }
  }
}

Multiple Threads

Multiple host threads can be reading and writing the frame buffers trying to execute their own copy of pixels_kernel_seq_range. This is accomplished by manually instantiating multiple derived FSM thread pixels_kernel_seq_range_FSM modules inside of a function called render_frame. The NUM_TOTAL_THREADS = (NUM_X_THREADS*NUM_Y_THREADS) copies of pixels_kernel_seq_range all run in parallel, splitting the FRAME_WIDTH by NUM_X_THREADS threads and FRAME_HEIGHT by NUM_Y_THREADS.

// Module that runs pixel_kernel for every pixel
// by instantiating multiple simultaneous 'threads' of pixel_kernel_seq_range
void render_frame()
{
  // Wire up N parallel pixel_kernel_seq_range_FSM instances
  uint1_t thread_done[NUM_X_THREADS][NUM_Y_THREADS];
  uint32_t i,j;
  uint1_t all_threads_done;
  while(!all_threads_done)
  {
    pixels_kernel_seq_range_INPUT_t fsm_in[NUM_X_THREADS][NUM_Y_THREADS];
    pixels_kernel_seq_range_OUTPUT_t fsm_out[NUM_X_THREADS][NUM_Y_THREADS];
    all_threads_done = 1;
    
    uint16_t THREAD_X_SIZE = FRAME_WIDTH / NUM_X_THREADS;
    uint16_t THREAD_Y_SIZE = FRAME_HEIGHT / NUM_Y_THREADS;
    for (i = 0; i < NUM_X_THREADS; i+=1)
    {
      for (j = 0; j < NUM_Y_THREADS; j+=1)
      {
        if(!thread_done[i][j])
        {
          fsm_in[i][j].input_valid = 1;
          fsm_in[i][j].output_ready = 1;
          fsm_in[i][j].x_start = THREAD_X_SIZE*i;
          fsm_in[i][j].x_end = (THREAD_X_SIZE*(i+1))-1;
          fsm_in[i][j].y_start = THREAD_Y_SIZE*j;
          fsm_in[i][j].y_end = (THREAD_Y_SIZE*(j+1))-1;
          fsm_out[i][j] = pixels_kernel_seq_range_FSM(fsm_in[i][j]);
          thread_done[i][j] = fsm_out[i][j].output_valid;
        }
        all_threads_done &= thread_done[i][j];
      }
    }
    __clk();
  }
  // Final step in rendering frame is switching to read from newly rendered frame buffer
  frame_buffer_read_port_sel = !frame_buffer_read_port_sel;
}

render_frame is then simply run in a loop, trying for the fastest frames per second possible.

void main()
{
  while(1)
  {
    render_frame();
  }
}

Game of Life Demo

Using the multi-threaded dual frame buffer graphics demo setup discussed above, the final specifics for a Game of Life demo are ready to assemble:

The per-pixel kernel function implementing Game of Life runs the familiar alive neighbor cell counting algorithm to compute the cell's next alive/dead state:

// Func run for every n_pixels_t chunk
void pixels_buffer_kernel(uint16_t x_buffer_index, uint16_t y)
{
  // Read the pixels from the 'read' frame buffer
  n_pixels_t pixels = dual_frame_buf_read(x_buffer_index, y);

  // Run Game of Life kernel for each pixel
  uint32_t i;
  uint16_t x = x_buffer_index << RAM_PIXEL_BUFFER_SIZE_LOG2;
  for (i = 0; i < RAM_PIXEL_BUFFER_SIZE; i+=1)
  { 
    pixels.data[i] = cell_next_state(pixels.data[i], x+i, y);
  }  
  
  // Write pixels back to the 'write' frame buffer 
  dual_frame_buf_write(x_buffer_index, y, pixels);
}

Working N Pixels at a Time

Memory is accessed RAM_PIXEL_BUFFER_SIZE pixels/cells at a time. However, simple implementations of Game of Life typically have individually addressable pixels/cells. To accommodate this a wrapper pixel_buf_read function is used to read single pixels.

// Frame buffer reads N pixels at a time
// ~Convert that into function calls reading 1 pixel/cell at a time
// VERY INEFFICIENT, reading N pixels to return just 1 for now...
uint1_t pixel_buf_read(uint32_t x, uint32_t y)
{
  // Read the pixels from the 'read' frame buffer
  uint16_t x_buffer_index = x >> RAM_PIXEL_BUFFER_SIZE_LOG2;
  n_pixels_t pixels = dual_frame_buf_read(x_buffer_index, y);
  // Select the single pixel offset of interest (bottom bits of x)
  ram_pixel_offset_t x_offset = x;
  return pixels.data[x_offset];
}

Results and Improvements

The initially with just one thread rendering the entire screen the frame rate is a slow ~0.5 FPS:

// ~0.5 FPS
#define NUM_X_THREADS 1 
#define NUM_Y_THREADS 1
#define HOST_CLK_MHZ 25.0 // Threads clock
#define DEV_CLK_MHZ 100.0 // Frame buffers clock
#define RAM_PIXEL_BUFFER_SIZE 16 // Frame buffer RAM width in pixels

Multiple Threads

The first and easiest way of scaling up this design for higher FPS is to use more threads to tile the screen rendering.

// 1.1 FPS
// 2 threads
#define NUM_X_THREADS 2
#define NUM_Y_THREADS 1
// 2.3 FPS
// 4 threads
#define NUM_X_THREADS 2
#define NUM_Y_THREADS 2
// 4.5 FPS
// 8 threads
#define NUM_X_THREADS 4
#define NUM_Y_THREADS 2

At this point my personal FPGA runs is at about 50% LUT resource usage. These derived finite state machine threads have lots of room for optimizations and are what generally limit expansion to more threads at this time.

Cached Reads

As described above the current 'read a single cell' function is very inefficient. It reads RAM_PIXEL_BUFFER_SIZE pixels and selects just one to return.

uint1_t pixel_buf_read(uint32_t x, uint32_t y);

However, in computing the Game of Life next state kernel function the same section of RAM_PIXEL_BUFFER_SIZE pixels is read multiple times (when counting living neighbor cells).

Simple 1 Entry Cache

The simplest way to avoid repeated reads is to keep around the last read's result and re-use it if requested again:

// 8.0 FPS
// 8 threads, 1 cached read
uint1_t pixel_buf_read(uint32_t x, uint32_t y)
{
  // Cache registers
  static uint16_t cache_x_buffer_index = FRAME_WIDTH; // invalid init
  static uint16_t cache_y = FRAME_HEIGHT; // invalid init
  static n_pixels_t cache_pixels;
  // Read the pixels from the 'read' frame buffer or from cache
  n_pixels_t pixels;
  uint16_t x_buffer_index = x >> RAM_PIXEL_BUFFER_SIZE_LOG2;
  uint1_t cache_match = (x_buffer_index==cache_x_buffer_index) & (y==cache_y);
  if(cache_match)
  {
    // Use cache
    pixels = cache_pixels;
  }
  else
  {
    // Read RAM and update cache
    pixels = dual_frame_buf_read(x_buffer_index, y);
    cache_x_buffer_index = x_buffer_index;
    cache_y = y;
    cache_pixels = pixels;
  }
  // Select the single pixel offset of interest (bottom bits of x)
  ram_pixel_offset_t x_offset = x;
  return pixels.data[x_offset];
}

This increases rendering to 8.0 FPS.

3 Line Cache

Game of Life repeatedly reads in a 3x3 grid around each cell. RAM_PIXEL_BUFFER_SIZE is typically >3 so one read can capture the entire x direction for several cells, but multiple reads are needed for the three y direction lines:

// 13.9 FPS
// 3 'y' lines of reads cached
uint1_t pixel_buf_read(uint32_t x, uint32_t y)
{
  // Cache registers
  static uint16_t cache_x_buffer_index = FRAME_WIDTH;
  static uint16_t cache_y[3] = {FRAME_HEIGHT, FRAME_HEIGHT, FRAME_HEIGHT};
  static n_pixels_t cache_pixels[3];
  // Read the pixels from the 'read' frame buffer or from cache
  n_pixels_t pixels;
  uint16_t x_buffer_index = x >> RAM_PIXEL_BUFFER_SIZE_LOG2;
  // Check cache for match (only one will match)
  uint1_t cache_match = 0;
  uint8_t cache_sel; // Which of 3 cache lines
  uint32_t i;
  for(i=0; i<3; i+=1)
  {
    uint1_t match_i = (x_buffer_index==cache_x_buffer_index) & (y==cache_y[i]);
    cache_match |= match_i;
    if(match_i){
      cache_sel = i;
    }
  } 
  if(cache_match)
  {
    pixels = cache_pixels[cache_sel];
  }
  else
  {
    // Read RAM and update cache
    pixels = dual_frame_buf_read(x_buffer_index, y);
    // If got a new x pos to read then default clear/invalidate entire cache
    if(x_buffer_index != cache_x_buffer_index)
    {
      ARRAY_SET(cache_y, FRAME_HEIGHT, 3)
    }
    cache_x_buffer_index = x_buffer_index;
    // Least recently used style shift out cache entries
    // to make room for keeping new one most recent at [0]
    ARRAY_1SHIFT_INTO_BOTTOM(cache_y, 3, y)
    ARRAY_1SHIFT_INTO_BOTTOM(cache_pixels, 3, pixels)
  }
  // Select the single pixel offset of interest (bottom bits of x)
  ram_pixel_offset_t x_offset = x;
  return pixels.data[x_offset];
}

Rendering speed is increased to 13.9 FPS.

Wider RAM Data Bus

In the above tests RAM_PIXEL_BUFFER_SIZE=16, that is, a group of N=16 pixels is stored at each RAM address. Increasing the width of the data bus (while keeping clock rates the same) will result in more available memory bandwidth, possible for more pixels per second written/read from RAM.

// 16.1 FPS
#define RAM_PIXEL_BUFFER_SIZE 32

If the width is made very wide, ex. RAM_PIXEL_BUFFER_SIZE=FRAME_WIDTH, then essentially the RAM is a per-line buffer of FRAME_HEIGHT lines (the entire screen).

However, especially given the read data caching described above, this design is not currently memory bandwidth limited. Increasing the RAM data width shows diminishing returns, ex. 64 bits gets to just 17 FPS for double the caching resources.

It is actually of greater benefit to save resources using the original 16b wide RAM data as this just barely allows for another doubling of the number of threads:

// ~30 FPS, 16 threads
#define RAM_PIXEL_BUFFER_SIZE 16
#define NUM_X_THREADS 4
#define NUM_Y_THREADS 4

Faster Clock Rates

Increasing the clock rates and eventually 'overclocking' the design is the final easy axis to explore for increasing rendering speed.

// ~30 FPS, 16 threads
#define HOST_CLK_MHZ 25.0 // Threads clock
#define DEV_CLK_MHZ 100.0 // Frame buffers clock
#define RAM_PIXEL_BUFFER_SIZE 16
#define NUM_X_THREADS 4
#define NUM_Y_THREADS 4

The host and device clock domains were chosen as easily meet timing. However, there is room to push the design further and see where timing fails, and where visible glitches occur.

The host clock domain of user threads is currently frequency limited by excess logic produced from derived finite state machines. The device clock domain containing the frame buffer RAM and arbitration logic is currently limited by the arbitration implementation built into shared_resource_bus.h.

The design begins to fail timing with just a slightly higher host clock of 30MHz:

// 32 FPS, 16 threads
#define HOST_CLK_MHZ 30.0 
#define DEV_CLK_MHZ 100.0

However, in hardware testing the design can run with no visible issues using as fast as 45MHz for the host clock:

// 46 FPS, 16 threads
#define HOST_CLK_MHZ 45 
#define DEV_CLK_MHZ 100.0

Any faster host clock results in design that fails to work: shows only a still image, execution is stalled, locked on the very first frame it seems.

Finally, the device clock running the frame buffers can be increased as well:

// 48 FPS, 16 threads
#define HOST_CLK_MHZ 45 
#define DEV_CLK_MHZ 150.0

This frame buffer RAM clock has been seen to work at as high as ~250MHz, however, this is unnecessary since the design is not memory bandwidth limited and the returns quickly diminish, jumping only by 2FPS for an extra 50MHz clock rate increase from the original 100Mhz. Beyond 250MHz device clock the design fails to display any image on the monitor.

Clone this wiki locally