-
Notifications
You must be signed in to change notification settings - Fork 55
Shared Resource Bus
WORK IN PROGRESS
This page describes 'shared resource buses' which are very similar to system buses like AXI, where managers make requests to and receive responses from subordinates using five separate valid+ready handshake channels for simultaneous reads and writes.
These buses are used in a graphics demo to have multiple 'host threads' share frame buffers to play Game of Life at 480p.
PipelineC's shared_resource_bus.h generic shared bus header is used to create hosts that make read+write requests to and receive read+write responses from devices.
For example, if the device is a simple byte-address memory mapped RAM, request and responses can be configured like so:
-
write_req_data_t: Write request data- ex. RAM needs
addressto request writetypedef struct write_req_data_t { uint32_t addr; /*// AXI: // Address // "Write this stream to a location in memory" id_t awid; addr_t awaddr; uint8_t awlen; // Number of transfer cycles minus one uint3_t awsize; // 2^size = Transfer width in bytes uint2_t awburst;*/ }write_req_data_t;
- ex. RAM needs
-
write_data_word_t: Write data word- ex. RAM writes some
dataelement, some number of bytes, to addressed locationtypedef struct write_data_word_t { uint8_t data; /*// AXI: // Data stream to be written to memory uint8_t wdata[4]; // 4 bytes, 32b uint1_t wstrb[4];*/ }write_data_word_t;
- ex. RAM writes some
-
write_resp_data_t: Write response data- ex. RAM write returns dummy single valid/done/complete bit
typedef struct write_resp_data_t { uint1_t dummy; /*// AXI: // Write response id_t bid; uint2_t bresp; // Error code*/ } write_resp_data_t;
- ex. RAM write returns dummy single valid/done/complete bit
-
read_req_data_t: Read request data- ex. RAM needs
addressto request readtypedef struct read_req_data_t { uint32_t addr; /*// AXI: // Address // "Give me a stream from a place in memory" id_t arid; addr_t araddr; uint8_t arlen; // Number of transfer cycles minus one uint3_t arsize; // 2^size = Transfer width in bytes uint2_t arburst;*/ } read_req_data_t;
- ex. RAM needs
-
read_data_resp_word_t: Read data and response word- ex. RAM read returns some
dataelementtypedef struct read_data_resp_word_t { uint8_t data; /*// AXI: // Read response id_t rid; uint2_t rresp; // Data stream from memory uint8_t rdata[4]; // 4 bytes, 32b;*/ } read_data_resp_word_t;
- ex. RAM read returns some
Shared resource buses use valid+ready handshaking just like AXI. Each of the five channels (write request, write data, write response, read request, and read data) has its own handshaking signals.
Again, like AXI, these buses have burst (packet last boundary) and pipelining (multiple IDs for transactions in flight) signals.
A kind/type of 'shared bus' is declared by using the SHARED_BUS_TYPE_DEF macro. Instances of the shared bus are declared using the SHARED_BUS_DECL macro. The macros declare types, helper functions, and a pair of global wires. One of the global variable wires is used for device to host data, while the other wire is for the opposite direction host to device. PipelineC's #pragma INST_ARRAY shared global variables are used to resolve multiple simultaneous drivers of the wire pairs into shared resource bus arbitration.
SHARED_BUS_TYPE_DEF(
ram_bus, // Bus 'type' name
uint32_t, // Write request type (ex. RAM address)
uint8_t, // Write data type (ex. RAM data)
uint1_t, // Write response type (ex. dummy value for RAM)
uint32_t, // Read request type (ex. RAM address)
uint8_t // Read data type (ex. RAM data)
)
SHARED_BUS_DECL(
ram_bus, // Bus 'type' name
uint32_t, // Write request type (ex. RAM address)
uint8_t, // Write data type (ex. RAM data)
uint1_t, // Write response type (ex. dummy value for RAM)
uint32_t, // Read request type (ex. RAM address)
uint8_t, // Read data type (ex. RAM data)
the_bus_name, // Instance name
NUM_HOST_PORTS,
NUM_DEV_PORTS
)The SHARED_BUS_TYPE_DEF declares types like <bus_type>_dev_to_host_t and <bus_type>_host_to_dev_t.
Arbitrary devices are connected to the bus via controller modules that ~convert host to-from device signals into device specifics.
Again for example, a simple RAM device might have a controller module like:
// Controller Outputs:
typedef struct ram_ctrl_t{
// ex. RAM inputs
uint32_t addr;
uint32_t wr_data;
uint32_t wr_enable;
// Bus signals driven to host
ram_bus_dev_to_host_t to_host;
}ram_ctrl_t;
ram_ctrl_t ram_ctrl(
// Controller Inputs:
// Ex. RAM outputs
uint32_t rd_data,
// Bus signals from the host
ram_bus_host_to_dev_t from_host
);Inside that ram_ctrl module RAM specific signals are connected to the five valid+ready handshakes going to_host (ex. out from RAM) and from_host (ex. into RAM).
A full example of a controller can be found in the shared frame buffer example code discussed in later sections.
In the above sections a controller function, ex. ram_ctrl, describes how a device connects to a shared bus. Using the SHARED_BUS_ARB macro the below code instantiates the arbitration that connects the multiple hosts and devices together through <instance_name>_from_host, <instance_name>_to_host wires for each device.
MAIN_MHZ(ram_arb_connect, DEV_CLK_MHZ)
void ram_arb_connect()
{
// Arbitrate M hosts to N devs
// Macro declares the_bus_name_from_host and the_bus_name_to_host
SHARED_BUS_ARB(ram_bus, the_bus_name, NUM_DEV_PORTS)
// Connect devs to arbiter ports
uint32_t i;
for (i = 0; i < NUM_DEV_PORTS; i+=1)
{
ram_dev_ctrl_t port_ctrl
= ram_ctrl(<RAM outputs>, the_bus_name_from_host[i]);
<RAM inputs> = port_ctrl....;
the_bus_name_to_host[i] = port_ctrl.to_host;
}
}Not everything can be squeezed into macros. Specifically, the five clock crossing asynchronous FIFOs (one for each request/response handshake) needed for each host<->device connection being arbitrated must be individually #included separately like so:
// First host port FIFO declaration
SHARED_BUS_ASYNC_FIFO_DECL(ram_bus, the_bus_name, 0)
// Temporary extra syntax specifically for 5 handshaking FIFOs
#include "clock_crossing/the_bus_name_fifo0_write_req.h"
#include "clock_crossing/the_bus_name_fifo0_write_data.h"
#include "clock_crossing/the_bus_name_fifo0_write_resp.h"
#include "clock_crossing/the_bus_name_fifo0_read_req.h"
#include "clock_crossing/the_bus_name_fifo0_read_data.h"
// A second host port FIFO declaration
SHARED_BUS_ASYNC_FIFO_DECL(ram_bus, the_bus_name, 1)
// Temporary extra syntax specifically for 5 handshaking FIFOs
#include "clock_crossing/the_bus_name_fifo1_write_req.h"
#include "clock_crossing/the_bus_name_fifo1_write_data.h"
#include "clock_crossing/the_bus_name_fifo1_write_resp.h"
#include "clock_crossing/the_bus_name_fifo1_read_req.h"
#include "clock_crossing/the_bus_name_fifo1_read_data.h"
// As many host ports, fifo2, etcEach of those FIFOs must be connected on the host-side and device-side of the clock crossing like so:
// Wire ASYNC FIFOs to dev-host wires
MAIN_MHZ(host_side_fifo_wiring, HOST_CLK_MHZ)
void host_side_fifo_wiring()
{
SHARED_BUS_ASYNC_FIFO_HOST_WIRING(ram_bus, the_bus_name, 0)
SHARED_BUS_ASYNC_FIFO_HOST_WIRING(ram_bus, the_bus_name, 1)
// As many host ports, 2, etc
}
MAIN_MHZ(dev_side_fifo_wiring, DEV_CLK_MHZ)
void dev_side_fifo_wiring()
{
SHARED_BUS_ASYNC_FIFO_DEV_WIRING(ram_bus, the_bus_name, 0)
SHARED_BUS_ASYNC_FIFO_DEV_WIRING(ram_bus, the_bus_name, 1)
// As many host ports, 2, etc
}The SHARED_BUS_DECL macro declares derived finite state machine helper functions for reading and writing the shared resource bus. These functions are to be used from NUM_HOST_PORTS simultaneous host FSM 'threads'.
Below, for example, shows generated signatures for reading and writeing the example shared bus RAM:
uint8_t the_bus_name_read(uint32_t addr);
uint1_t the_bus_name_write(uint32_t addr, uint8_t data); // Dummy return valueThe graphics_demo.c file is an example exercising two frame buffer devices as shared bus resources from shared_dual_frame_buffer.c.
For example the state machine that reads from shared resource frame_buf0_shared_bus or frame_buf1_shared_bus based on a select variable:
uint1_t frame_buffer_read_port_sel;
n_pixels_t dual_frame_buf_read(uint16_t x_buffer_index, uint16_t y)
{
uint32_t addr = pos_to_addr(x_buffer_index, y);
n_pixels_t resp;
if(frame_buffer_read_port_sel){
resp = frame_buf1_shared_bus_read(addr);
}else{
resp = frame_buf0_shared_bus_read(addr);
}
return resp;
}One of the host threads using the frame buffers is always-reading logic to push pixels out the VGA port for display.
void host_vga_reader()
{
vga_pos_t vga_pos;
while(1)
{
// Read the pixels at x,y pos
uint16_t x_buffer_index = vga_pos.x >> RAM_PIXEL_BUFFER_SIZE_LOG2;
n_pixels_t pixels = dual_frame_buf_read(x_buffer_index, vga_pos.y);
// Write it into async fifo feeding vga pmod for display
pmod_async_fifo_write(pixels);
// Execute a cycle of vga timing to get x,y and increment for next time
vga_pos = vga_frame_pos_increment(vga_pos, RAM_PIXEL_BUFFER_SIZE);
}
}The frame buffer from frame_buffer.c stores RAM_PIXEL_BUFFER_SIZE pixels at each RAM address. This is done by defining a wrapper 'chunk of n pixels' struct.
// Must be divisor of FRAME_WIDTH across x direction
typedef struct n_pixels_t{
uint1_t data[RAM_PIXEL_BUFFER_SIZE];
}n_pixels_t;The pixels_buffer_kernel function reads n_pixels_t worth of pixels, operates on them running some kernel function on each pixel sequentially, and then writes the resulting group of pixel values back.
void pixels_buffer_kernel(uint16_t x_buffer_index, uint16_t y)
{
// Read the pixels from the 'read' frame buffer
n_pixels_t pixels = dual_frame_buf_read(x_buffer_index, y);
// Run kernel for each pixel
uint32_t i;
uint16_t x = x_buffer_index << RAM_PIXEL_BUFFER_SIZE_LOG2;
for (i = 0; i < RAM_PIXEL_BUFFER_SIZE; i+=1)
{
pixels.data[i] = some_kernel_func(pixels.data[i], x+i, y);
}
// Write pixels back to the 'write' frame buffer
dual_frame_buf_write(x_buffer_index, y, pixels);
}The pixels_kernel_seq_range function iterates over a range of frame area executing pixels_buffer_kernel for reach set of pixels. The frame area is defined by start and end x and y positions.
void pixels_kernel_seq_range(
uint16_t x_start, uint16_t x_end,
uint16_t y_start, uint16_t y_end)
{
uint16_t x_buffer_index_start = x_start >> RAM_PIXEL_BUFFER_SIZE_LOG2;
uint16_t x_buffer_index_end = x_end >> RAM_PIXEL_BUFFER_SIZE_LOG2;
uint16_t x_buffer_index;
uint16_t y;
for(y=y_start; y<=y_end; y+=1)
{
for(x_buffer_index=x_buffer_index_start; x_buffer_index<=x_buffer_index_end; x_buffer_index+=1)
{
pixels_buffer_kernel(x_buffer_index, y);
}
}
}Multiple host threads can be reading and writing the frame buffers trying to execute their own copy of pixels_kernel_seq_range. This is accomplished by manually instantiating multiple derived FSM thread pixels_kernel_seq_range_FSM modules inside of a function called render_frame. The NUM_TOTAL_THREADS = (NUM_X_THREADS*NUM_Y_THREADS) copies of pixels_kernel_seq_range all run in parallel, splitting the FRAME_WIDTH by NUM_X_THREADS threads and FRAME_HEIGHT by NUM_Y_THREADS.
// Module that runs pixel_kernel for every pixel
// by instantiating multiple simultaneous 'threads' of pixel_kernel_seq_range
void render_frame()
{
// Wire up N parallel pixel_kernel_seq_range_FSM instances
uint1_t thread_done[NUM_X_THREADS][NUM_Y_THREADS];
uint32_t i,j;
uint1_t all_threads_done;
while(!all_threads_done)
{
pixels_kernel_seq_range_INPUT_t fsm_in[NUM_X_THREADS][NUM_Y_THREADS];
pixels_kernel_seq_range_OUTPUT_t fsm_out[NUM_X_THREADS][NUM_Y_THREADS];
all_threads_done = 1;
uint16_t THREAD_X_SIZE = FRAME_WIDTH / NUM_X_THREADS;
uint16_t THREAD_Y_SIZE = FRAME_HEIGHT / NUM_Y_THREADS;
for (i = 0; i < NUM_X_THREADS; i+=1)
{
for (j = 0; j < NUM_Y_THREADS; j+=1)
{
if(!thread_done[i][j])
{
fsm_in[i][j].input_valid = 1;
fsm_in[i][j].output_ready = 1;
fsm_in[i][j].x_start = THREAD_X_SIZE*i;
fsm_in[i][j].x_end = (THREAD_X_SIZE*(i+1))-1;
fsm_in[i][j].y_start = THREAD_Y_SIZE*j;
fsm_in[i][j].y_end = (THREAD_Y_SIZE*(j+1))-1;
fsm_out[i][j] = pixels_kernel_seq_range_FSM(fsm_in[i][j]);
thread_done[i][j] = fsm_out[i][j].output_valid;
}
all_threads_done &= thread_done[i][j];
}
}
__clk();
}
// Final step in rendering frame is switching to read from newly rendered frame buffer
frame_buffer_read_port_sel = !frame_buffer_read_port_sel;
}render_frame is then simply run in a loop, trying for the fastest frames per second possible.
void main()
{
while(1)
{
render_frame();
}
}Using the multi-threaded dual frame buffer graphics demo setup discussed above, the final specifics for a Game of Life demo are ready to assemble:
The per-pixel kernel function implementing Game of Life runs the familiar alive neighbor cell counting algorithm to compute the cell's next alive/dead state:
// Func run for every n_pixels_t chunk
void pixels_buffer_kernel(uint16_t x_buffer_index, uint16_t y)
{
// Read the pixels from the 'read' frame buffer
n_pixels_t pixels = dual_frame_buf_read(x_buffer_index, y);
// Run Game of Life kernel for each pixel
uint32_t i;
uint16_t x = x_buffer_index << RAM_PIXEL_BUFFER_SIZE_LOG2;
for (i = 0; i < RAM_PIXEL_BUFFER_SIZE; i+=1)
{
pixels.data[i] = cell_next_state(pixels.data[i], x+i, y);
}
// Write pixels back to the 'write' frame buffer
dual_frame_buf_write(x_buffer_index, y, pixels);
}Memory is accessed RAM_PIXEL_BUFFER_SIZE pixels/cells at a time. However, simple implementations of Game of Life typically have individually addressable pixels/cells. To accommodate this a wrapper pixel_buf_read function is used to read single pixels.
// Frame buffer reads N pixels at a time
// ~Convert that into function calls reading 1 pixel/cell at a time
// VERY INEFFICIENT, reading N pixels to return just 1 for now...
uint1_t pixel_buf_read(uint32_t x, uint32_t y)
{
// Read the pixels from the 'read' frame buffer
uint16_t x_buffer_index = x >> RAM_PIXEL_BUFFER_SIZE_LOG2;
n_pixels_t pixels = dual_frame_buf_read(x_buffer_index, y);
// Select the single pixel offset of interest (bottom bits of x)
ram_pixel_offset_t x_offset = x;
return pixels.data[x_offset];
}The initially with just one thread rendering the entire screen the frame rate is a slow ~0.5 FPS:
// ~0.5 FPS
#define NUM_X_THREADS 1
#define NUM_Y_THREADS 1
#define HOST_CLK_MHZ 25.0 // Threads clock
#define DEV_CLK_MHZ 100.0 // Frame buffers clock
#define RAM_PIXEL_BUFFER_SIZE 16 // Frame buffer RAM width in pixelsThe first and easiest way of scaling up this design for higher FPS is to use more threads to tile the screen rendering.
// 1.1 FPS
// 2 threads
#define NUM_X_THREADS 2
#define NUM_Y_THREADS 1// 2.3 FPS
// 4 threads
#define NUM_X_THREADS 2
#define NUM_Y_THREADS 2// 4.5 FPS
// 8 threads
#define NUM_X_THREADS 4
#define NUM_Y_THREADS 2At this point my personal FPGA runs out of resources for more threads. These derived finite state machine threads have lots of room for optimizations and are what limit expansion to more threads at this time.
As described above the current 'read a single cell' function is very inefficient. It reads RAM_PIXEL_BUFFER_SIZE pixels and selects just one to return.
uint1_t pixel_buf_read(uint32_t x, uint32_t y);However, in computing the Game of Life next state kernel function the same section of RAM_PIXEL_BUFFER_SIZE pixels is read multiple times (when counting living neighbor cells).
The simplest way to avoid repeated reads is to keep around the last read's result and re-use it if requested again:
// 8.0 FPS
// 1 cached read
uint1_t pixel_buf_read(uint32_t x, uint32_t y)
{
// Cache registers
static uint16_t cache_x_buffer_index = FRAME_WIDTH; // invalid init
static uint16_t cache_y = FRAME_HEIGHT; // invalid init
static n_pixels_t cache_pixels;
// Read the pixels from the 'read' frame buffer or from cache
n_pixels_t pixels;
uint16_t x_buffer_index = x >> RAM_PIXEL_BUFFER_SIZE_LOG2;
uint1_t cache_match = (x_buffer_index==cache_x_buffer_index) & (y==cache_y);
if(cache_match)
{
// Use cache
pixels = cache_pixels;
}
else
{
// Read RAM and update cache
pixels = dual_frame_buf_read(x_buffer_index, y);
cache_x_buffer_index = x_buffer_index;
cache_y = y;
cache_pixels = pixels;
}
// Select the single pixel offset of interest (bottom bits of x)
ram_pixel_offset_t x_offset = x;
return pixels.data[x_offset];
}This increases rendering to 8.0 FPS.
Game of Life repeatedly reads in a 3x3 grid around each cell. RAM_PIXEL_BUFFER_SIZE is typically >3 so one read can capture the entire x direction for several cells, but multiple reads are needed for the three y direction lines:
// 13.9 FPS
// 3 'y' lines of reads cached
uint1_t pixel_buf_read(uint32_t x, uint32_t y)
{
// Cache registers
static uint16_t cache_x_buffer_index = FRAME_WIDTH;
static uint16_t cache_y[3] = {FRAME_HEIGHT, FRAME_HEIGHT, FRAME_HEIGHT};
static n_pixels_t cache_pixels[3];
// Read the pixels from the 'read' frame buffer or from cache
n_pixels_t pixels;
uint16_t x_buffer_index = x >> RAM_PIXEL_BUFFER_SIZE_LOG2;
// Check cache for match (only one will match)
uint1_t cache_match = 0;
uint8_t cache_sel; // Which of 3 cache lines
uint32_t i;
for(i=0; i<3; i+=1)
{
uint1_t match_i = (x_buffer_index==cache_x_buffer_index) & (y==cache_y[i]);
cache_match |= match_i;
if(match_i){
cache_sel = i;
}
}
if(cache_match)
{
pixels = cache_pixels[cache_sel];
}
else
{
// Read RAM and update cache
pixels = dual_frame_buf_read(x_buffer_index, y);
// If got a new x pos to read then default clear/invalidate entire cache
if(x_buffer_index != cache_x_buffer_index)
{
ARRAY_SET(cache_y, FRAME_HEIGHT, 3)
}
cache_x_buffer_index = x_buffer_index;
// Least recently used style shift out cache entries
// to make room for keeping new one most recent at [0]
ARRAY_1SHIFT_INTO_BOTTOM(cache_y, 3, y)
ARRAY_1SHIFT_INTO_BOTTOM(cache_pixels, 3, pixels)
}
// Select the single pixel offset of interest (bottom bits of x)
ram_pixel_offset_t x_offset = x;
return pixels.data[x_offset];
}Rendering speed is increased to 13.9 FPS.
In the above tests RAM_PIXEL_BUFFER_SIZE=16, that is, a group of N=16 pixels is stored at each RAM address. Increasing the width of the data bus (while keeping clock rates the same) will result in more available memory bandwidth, possible for more pixels per second written/read from RAM.
// 16.1 FPS
#define RAM_PIXEL_BUFFER_SIZE 32If the width is made very wide, ex. RAM_PIXEL_BUFFER_SIZE=FRAME_WIDTH, then essentially the RAM is a per-line buffer of FRAME_HEIGHT lines (the entire screen).
However, especially given the read data caching described above, this design is not currently memory bandwidth limited. Increaseing the RAM data width shows diminishing returns, ex. 64 bits gets to just 17 FPS for double the caching resources.
Increasing the clock rates and eventually 'overclocking' the design is the final easy axis to explore for increasing rendering speed.
#define HOST_CLK_MHZ 25.0 // Threads clock
#define DEV_CLK_MHZ 100.0 // Frame buffers clockThe host and device clock domains were chosen as easily meet timing. However, there is room to push the design further and see where timing fails, and where visible glitches occur.
The host clock domain of user threads is currently frequency limited by excess logic produced from derived finite state machines. The device clock domain containing the frame buffer RAM and arbitration logic is currently limited by the arbitration implementation built into shared_resource_bus.h.
TODO FASTER CLOCK MEASUREMENTS

