This repo contains our fully verified RISC-V, full 32i base instruction compatible CPU with pipelining and cache.
maincontains the verified Pipelined CPU with 2-way set associated Cache.SingleCyclecontains the verified single-cycle CPU.Pipelinedcontains the verified Pipelined CPU with hazard unit.Pipelinedw/L1L2,Superscalar, andBranch-predictionare still works in progress.
-
The Team
- Links to personal statements
- Contribution Table
-
Single Cycle
- The instruction set and CPU design
- Basic explanation of our logic and implementation
-
Pipelined
- CPU design
- Basic explanation of logic and implementation
-
Data Cache
- Basic explanation of logic and implementation
| Name | Personal Statements | F1 High Score (ms) |
|---|---|---|
| Abraham Lin - Repo Master | Abraham.md | 162 |
| Charlotte Maxwell | charlotte.md | 207 |
| Shravan Kumar | shravan_kumar.md | 304 |
| Shreeya Agarwal | shreeya.md | 277 |
| Component | Shreeya Agarwal | Shravan Kumar | Charlotte Maxwell | Abraham Lin |
|---|---|---|---|---|
| PC | * | * | * | |
| ALU | * | * | * | |
| Register File | * | * | ||
| Instruction Memory | * | |||
| Control Unit | * | * | ||
| Sign Extend | * | * | * | |
| Data Path | * | * | * | |
| Data Memory | * | |||
| Top Level Assembly | * | * | * | * |
| Unit Tests | * | |||
| Testbench & debugging | * | |||
| F1.s | * | * |
| Component | Shreeya Agarwal | Shravan Kumar | Charlotte Maxwell | Abraham Lin |
|---|---|---|---|---|
| FF1 | * | * | * | |
| FF2 | * | * | * | |
| FF3 | * | * | * | |
| FF4 | * | * | ||
| Hazard Unit | * | * | ||
| Fetch | * | |||
| Decode | * | |||
| Execute | * | |||
| Memory | * | |||
| WriteBack | * | |||
| Top | * | * | ||
| Testbench and debugging | * |
| Component | Shreeya Agarwal | Shravan Kumar | Charlotte Maxwell | Abraham Lin |
|---|---|---|---|---|
| Direct Mapped Cache | * | |||
| 2-Way Set Associative Cache | * | * | ||
| Memory and Top implementation | * | |||
| Unit Tests | * | * | * | |
| Testbench and debugging | * |
Please refer to everyone's individual personal statements to read up on the extra work done by the team.
The Aims
- Branch Prediction
- L1/L2 Cache
- Single Cycle SuperScalar model
The CPU diagrams as seen in this following document were designed by Abraham.
It is to be noted that when looking at the commit history:
- Shreeya had an issue with her github where when she commited she came up as root - all root commits belong to Shreeya.
- Charlotte had accidentally contributed from multiple accounts - there are two accounts from which Charlotte commited not just one.
- Members often met up to discuss the project, and as a group we showed good team spirit in order to work together and ensure everyone
- Was on track with tasks
- Learns from this project, and about all the components within it.
- Some small looking changes may have taken hours to debug and correct; this is evident in the quality of our work.
For Lab 4 information, see Lab_4.
See the following videos, for the F1 program, and the 4 waveform PDF programs.
F1 Lights
F1.Light.Demo.mp4
Gaussian
PDF_gaussian.MOV
Sine
PDF_sine.MOV
Triangle
PDF_triangle.MOV
Noisy
PDF_noisy.MOV
=======
First, refer to the branch details here
Next, follow the following commands to run the tests
git checkout <target-branch>
git cd tb/
chmod +x assemble.sh # Ensure that assemble.sh is added executable permission
bash -x ./doit.sh <Optional: files name> # For example ./tests/verify.cppEnsure that Vbuddy is connected correctly This command might be helpful
~/Documents/iac/lab0-devtools/tools/attach_usb.shCheck if the port that Vbuddy is connected to is consistent with vbuddy.cfg, setted as /dev/ttyUSB0.
Check by this command.
ls /dev/ttyU*Then we can run the plotting program.
git checkout <target-branch>
git cd tb/plots/
bash -x ./doit_plot.sh <Optional signal name> # Default is Gaussian; options also includes sine, noisy, triangleAlso follow the above to ensure that the Vbuddy is connected and configured correctly.
Then follow the below commands.
git checkout Pipelinedw/Cache
cd tb/F1_Light
bash -x ./doit_F1.sh
Building up on lab 4, we implemented the single cycle version. The main challenge was implementing all the instructions, and the addition of the new data memory.
The textbook and lecture slides recommended to use the following diagram:
Following the project brief after lab 4, the main requirements we had were:
- Changes in the control unit to implement all the instructions for the RISC-V 32I instruction set (e.g. JAL, Load, Store)
- Determining the machine code to implement the F1 light cycle
- Adding a Data Memory and a Multiplexor, and the logic for adding
We adapted this to the following diagram. The main changes made was:
- addition of a comparator
- The trigger input into DataMem
- ALU: adding the load immediate instruction to the ALU
- Top Level module checks and testing: Ensuring Variable names are consistent, debugging, and simulating on GTKWave to check the machine code works and was implemented properly.
The memory map shows that the data memory goes from 0x01000 to 0x1FFFF, so we needed
// Memory array: 2^17 locations 0x00000000 to 0x0001FFFF, each MEM_WIDTH bits wide
logic [MEM_WIDTH-1:0] mem [0:2**17-1];
initial begin
$readmemh("data.hex", mem, 32'h00010000);
$display("Data Memory Contents After Initialization:");
$display("mem[0] = %h", mem[32'h00010000]);
$display("mem[1] = %h", mem[32'h00010001]);
$display("mem[2] = %h", mem[32'h00010002]);
$display("mem[3] = %h", mem[32'h00010003]);
end
We then implemented load and store instructions for the single cycle. Specifically, we ensured to handle the logic for reading from and writing to memory, with support for byte, half-word, and word accesses. Whether it's signed or notis determined by the funct3 field in the instruction.
The Read Logic for Load Instructions
This block defines how data is read from the memory during load instructions, based on the instruction's funct3 field and the address A.
If the address A equals 32'h000000FC, the code fetches the value of the trigger signal, and places it in the LSB of RD. This is for hardware functionality. If the address doesn't match the MMIO range, the code uses a case statement to determine how to fetch data based on funct3 .
// Read logic for load instructions
always_comb begin
if (A == 32'h000000FC) begin
// MMIO read from trigger address
RD = {31'b0, trigger}; // Return trigger in LSB
end
else begin
// Regular memory read
case (funct3)
3'b000: RD = {{24{mem[A][7]}}, mem[A]}; // lb
3'b001: RD = {{16{mem[A+1][15]}}, mem[A+1], mem[A]}; // lh
3'b010: RD = {mem[A+3], mem[A+2], mem[A+1], mem[A]}; // lw
3'b100: RD = {24'b0, mem[A]}; // lbu
3'b101: RD = {16'b0, mem[A+1], mem[A]}; // lhu
default: RD = 32'b0; // Default case
endcaseWrite Logic for Store Instructions
This block defines how data is written to memory during store instructions, based on the instruction's funct3 field and the address A.
always_ff @(posedge clk) begin
if (WE) begin // Store instruction
case (funct3)
3'b000: mem[A] <= WD[7:0]; // sb
3'b001: begin
mem[A + 1] <= WD[15:8]; // sh
mem[A] <= WD[7:0];
end
3'b010: begin
mem[A + 3] <= WD[31:24]; // sw
mem[A + 2] <= WD[23:16];
mem[A + 1] <= WD[15:8];
mem[A] <= WD[7:0];
end
We see the memory writes are only performed if WE signal is active, and the case statement determines how to write data to the memory (WD) based on funct3.
Full RSICV 32I Instruction list implemented

The CU generates control signals based on:
op(used to determine the instruction type)funct3(to determine specific operations within an instruction e.g. ALU operations)funct7_5(A part of the instruction, used in some R-type instructions to differentiate between operations (e.g., add vs. subtract)).
The always_comb block executes whenever any of the input signals change. This block generates control signals based on the op and funct3 values. We ensured to initialise all the values.
This code generates control signals for a RISC-V processor based on the instruction opcode (op) and the function fields (funct3, funct7_5). These control signals drive the ALU, memory, register file, and branching mechanisms, enabling the processor to execute the appropriate operation for each instruction type.
For our single cycle, we wrote unit tests to ensure the modules worked accurately and to isolate errors for easier debugging.
To run the individual unit test, follow these commands
git checkout SingleCycle
cd tb
bash -x ./doit.sh ./tests/<testbench_to_run> #Forexample ./tests/sign_ext_tb.cppThis allowed us to check the expected behaviour of each control/data path signal in a module.
To run the test for single cycle
Based on our single-cycle implementation, we built up on it to build the pipelined version. The main challenges were splitting up the design into 5 stages: fetch, decode, execute, memory, writeback. The addition of the flip flops and the hazard unit was also an interesting design challenge.
The textbook and lecture slides recommended we use the following diagram:
We adapted this to the following diagram. The main changes we implemented were:
- The addition of 4 flip flops
- Hazard Unit to implement flush and stall logic for instructions
- splitting up the implementation into the 4 stages (
fetch,decode,memory,execute) and then adapting thetop.svfile to combine the logic
Since the control unit needed no major adaptations following single cycle, implementation of pipelining was smooth, especially as tackling it in 4 separate stages and combining it in the top file made debugging and keeping track of all the inputs and outputs easier.
In the Hazard Unit we had to implement methods to combat RAW dependency hazard, LW dependency hazard, and control hazards which were caused by branch and jumping.
In the hazard unit, we handled data hazards by determining when stalling or forwarding was necessary. The main goal was to ensure the processor handles the cases where the instructions depend on data that is yet to be written back (e.g. when the result of one instruction is needed by another instruction before it's available).
Initial Inputs
//default outputs
Stall = 1'b0;
Flush = (PCSrc || rst) ? 1'b1: 1'b0; // Flush if jumping happen (next PC not PC+4)
ForwardA = 2'b00;
ForwardB = 2'b00;Stall: indicates if its stalled or not (i.e. no new instructions should proceed to the next stage). This is so if a load instruction’s result is needed by a following instruction, the pipeline is stalled to wait for the load instruction to complete.Flush: Determined if the pipeline should be cleared of its current state - occurs when there's jumping/branchingForwardAandForwardB: selects where to forward the data from the source registes (RS1andRS2)
Forwarding
if (RegWriteM && (RdM == Rs1E) && (Rs1E != 5'b00000)) begin
ForwardA = 2'b10; //forward from MEM stage if not Load; result ready as ALUoutM
end
else if (RegWriteW && (RdW == Rs1E) && (Rs1E != 5'b00000)) begin
ForwardA = 2'b01; //forward from WB stage
endThis part checks if the value of register Rs1 (used in the current instruction at EX stage) is being written back in the MEM or WB stages
If the instruction in the MEM stage is writing to the same register Rs1E, the result is forwarded from the MEM stage (using ALUoutM), so ForwardA = 2'b10. If the instruction in the WB stage writes to Rs1E, the result is forwarded from the WB stage, so ForwardA = 2'b01.
A similar logic applies for the forwarding od Rs2E.
Stall Conditions
if(LoadE && ((RdE == Rs1D) || (RdE == Rs2D)))
Stall = 1'b1;This part detects load-use hazards, which occur when a load instruction is followed by an instruction that depends on the value being loaded.
The Stall and Flush logic was then added to the 4 flip flops, and all of this was combined into our top.sv file.
Cache was introduced to have a faster access to memory with limited storage accessible to the processor.
We first made a direct-mapped cache by following the recommended structure of the textbook; this contained a cache line of 60 bits.
- 32 (LSB) assigned to
DATA - 27 (NSB) assigned to
TAG - 1 (MSB) assigned to
VALID
This cache line maps to the memory addressing of the cache as:
- [0:1] in the
BYTE OFFSETto accomodate word and byte addressing - [2:4] in the
SETto establish and index the cache storage in the memory - [5:31] in the
TAGto identify data stored in the direct-mapped memory.
Cache Structure
typedef struct packed {
logic ValitdityBit;
logic [26:0] tag;
logic [DATA_WIDTH-1:0] data;
} CacheType;
CacheType cache [8]; // define 8 set cache
This defines the structure for each cache entry, where:
ValidityBitinidicates if the entry is valid or nottagdetermines if the cache entry matches the requested address
This formes an araay of 8 cache entries, where each address maps to a unique cache set.
Cache Read Logic
logic [DATA_WIDTH-1:0] RD;
logic [26:0] tag;
logic [2:0] set;
always_comb begin
tag = address[31:5];
set = address[4:2];
if(cache[set].ValitdityBit && (cache[set].tag == tag)) begin
hit = 1;
CacheData = cache[set].data;
end
else begin
hit = 0;
CacheData = CacheMissData_RMM;
end
endThe cache read logic checks if the requested address is in the cache by comparing the tag and checking if the cache entry is valid. If the data is found in the cache (hit), it is returned immediately; otherwise, the data from the main memory is returned (miss). The cache write logic writes data into the cache only if there’s a miss and the write enable signal (WE) is active.
The set-associative cache improves upon the direct-mapped cache by allowing multiple data blocks to reside in the same cache set, reducing conflict misses and improving overall cache hit rates. It also incorporates write-back policy, dirty bits, and LRU replacement. We followed the logic of the diagram provided in the textbook and lecture slides:
This was our final CPU design with the 2 way set-associative cache implemented.

1) Mapping
- Direct-Mapped Cache: each memory block can be placed in only one specific cache line, determined by the address. If there’s a conflict (i.e., multiple memory blocks map to the same cache line), only one block can reside in that cache line at a time, resulting in a cache miss.
- Set-Associative Cache: each memory block can be placed in any of several lines (sets) in the cache. The number of lines in a set is determined by the associativity of the cache. Here, we can hold 2 different blocks of data
2) Cache Lookup
- Direct-Mapped: a single tag comparison per cache line determines if its a hit or a miss
- Set-Associative Cache: here, 2 ways are checked for a cache hit (each set can hold 2 memory blocks). This increases the hit probability.
3) Replacement Policy
- Direct-Mapped: there is no choice of which cache line to evict in case of a miss because each block maps to a specific cache line.
- Set-Associative Cache: when a miss occurs and the set is full, the cache has to choose one of the ways to evict. In this code, the LRU (Least Recently Used) policy is used to decide which way to evict, using the
Ubit to track which way was last used.
Cache Structure
typedef struct packed {
logic U; // LRU bit -- 1 = [Way 0 recently used]; 0 = [Way 1 recently used] => should write to the other way
// Way 1
logic ValitdityBit1;
logic DB1; // Dirty bit
logic [27:0] tag1;
logic [DATA_WIDTH-1:0] data1;
// Way 0
logic ValitdityBit0;
logic DB0; // Dirty bit
logic [27:0] tag0;
logic [DATA_WIDTH-1:0] data0;
} CacheType;
The cache is organised into 4 sets, each of which has 2 slots for storing data. Since the number of locations mapped to each cache line has doubled, the TAG is now 28 bits instead of 27.
U is the bit indicating the LRU state. The dirty bits DB0, DB1 indicate if the data in the cache line is modified and needs to be written back to memory.
Cache Access
The logic will check if the address matches any of the 2 possible ways in the selected set (Way0 or Way1). This triggers the hit and stall signals according to if there's a miss or a hit.
Write-Back and Write-Allocate Policies
If the cache line is evicted and the data is dirty, it's written back the main memory.
On a write miss, data is fetched from the memory into cache, before the write operation is performed.
LRU Logic
The U bit is used to keep track of which way was last used. After a cache hit, the U bit is updated to indicate the least recently used way. This allows the system to evict the least recently used way when a miss occurs and both ways in the set are occupied.
Memory-Mapped I/O (MMIO)
The cache handles MMIO accesses, where the address 0x000000FC is specifically designated for memory-mapped I/O. When an address matches this, the cache directly outputs RD (read data from memory) without any cache lookup.
Read/Write Operations
- Read: If the cache contains the data, it's returned based on the
funct3value, which determines the load type (e.g., byte, halfword, word). - Write: On a write miss, the cache fetches the data from memory and updates the cache accordingly. The
funct3value determines the type of store (byte, halfword, or word).
Reset Functionality
On the rst, all cache entries are cleared, and the validity bits and dirty bits are reset to their default values.






