RISC-V RV32I CPU — Progressive Implementation

A progressive, from-scratch implementation of a RISC-V RV32I processor in Verilog, developed across multiple evaluation levels. The project begins with a single-cycle baseline and advances through a 5-stage pipelined design, ultimately culminating in a 2-wide dual-issue superscalar core. All designs target the Xilinx Spartan-7 50T FPGA (xc7s50csga324-1) using Vivado 2025.2.

1. Repository Layout

riscv/
├── src/                        # Root: standalone single-cycle RV32I core
├── sim/                        # Root: single-cycle simulation hex + testbench
├── mem/                        # Root: output from gen_hex.py
├── gen_hex.py                  # Level-0 test program generator (19 instructions)
├── comparison.txt              # Eval2 architecture comparison table
├── eval2_report.txt            # Eval2 detailed synthesis + simulation report
│
├── eval2/                      # Historical: single-cycle baseline (eval2)
├── eval2_multicycle/           # Historical: FSM multicycle (eval2)
├── eval2_pipeline/             # Historical: 4-stage pipeline IF/ID/EX/WB (eval2)
├── eval2_superscalar/          # Historical: early 2-wide superscalar attempt (eval2)
├── eval2_complete/             # Historical: Vivado project version (eval2)
├── eval1_presentation/         # LaTeX/PDF presentation slides
│
├── level2/pipeline/            # 5-stage pipeline: R/I/load/store/branch
│   ├── src/                    # 19 Verilog source files
│   ├── sim/                    # Testbench, program.hex, run.tcl
│   ├── synth.tcl               # Synthesis script (riscv_core standalone)
│   ├── impl.tcl                # Implementation script (riscv_core standalone)
│   ├── board_impl_boolean.tcl  # Full boolean_top implementation + program
│   ├── boolean.xdc             # Boolean Board constraints
│   ├── arty_s7.xdc             # Arty S7-50 constraints
│   ├── impl_out/               # Synthesis reports: riscv_core standalone
│   └── board_out_boolean/      # Synthesis reports: boolean_top full build
│
├── level3/pipeline/            # Level2 + JAL/JALR/LUI/AUIPC
│   ├── src/                    # Verilog source files
│   ├── sim/                    # Testbenches, program.hex, program_new.hex
│   ├── assemble.py             # Python RV32I assembler
│   ├── boolean.xdc
│   ├── synth.tcl
│   └── board_out_boolean/      # Synthesis reports
│
├── level4/pipeline/            # Level3 + comprehensive 69-instruction test
│   ├── src/                    # Verilog source files (identical RTL to level3)
│   ├── sim/                    # Testbench, 69-instruction program.hex
│   ├── assemble.py             # Enhanced assembler (all RV32I instructions)
│   ├── boolean.xdc
│   ├── synth.tcl
│   └── board_out_boolean/      # Synthesis reports
│
└── l4_superscalar/             # 2-wide dual-issue superscalar (simulation only)
    ├── src/                    # Monolithic riscv_core.v + supporting modules
    └── sim/                    # Testbench, program.hex

2. Architecture Overview

The project follows a linear progression:

Level 0 (root src/)       Single-cycle — all instructions in one clock
        |
Eval2 variants            Single-cycle / multicycle FSM / 4-stage pipeline /
        |                 early superscalar (historical, for comparison)
        |
Level 2 (5-stage)         IF → ID → EX → MEM → WB
        |                 R/I-ALU, LW, SW, all 6 branch types
        |
Level 3 (5-stage)         Level2 + LUI, AUIPC, JAL, JALR
        |
Level 4 (5-stage)         Level3 RTL, 69-instruction comprehensive test suite
        |
L4 Superscalar            2-wide dual-issue, 4-stage, full RV32I + byte/half memory

All pipelined designs (level2–level4) share the same 5-stage pipeline skeleton with a common hazard/forwarding architecture. The superscalar is a separate, monolithic design.

3. ISA Coverage Summary

Instruction Group	Level 2	Level 3	Level 4	L4 Superscalar
R-type (ADD/SUB/AND/OR/XOR/SLL/SRL/SRA/SLT/SLTU)	Yes	Yes	Yes	Yes
I-type ALU (ADDI/ANDI/ORI/XORI/SLTI/SLTIU/SLLI/SRLI/SRAI)	Yes	Yes	Yes	Yes
LW	Yes	Yes	Yes	Yes
SW	Yes	Yes	Yes	Yes
LB / LH (byte/halfword loads)	No	No	No	Yes
SB / SH (byte/halfword stores)	No	No	No	Yes
BEQ / BNE	Yes	Yes	Yes	Yes
BLT / BGE / BLTU / BGEU	Yes	Yes	Yes	Yes
LUI	No	Yes	Yes	Yes
AUIPC	No	Yes	Yes	Yes
JAL	No	Yes	Yes	Yes
JALR	No	Yes	Yes	Yes
FENCE / ECALL / EBREAK	No	No	No	Decoded (NOP)

4. Level-by-Level Description

Root `src/` — Single-Cycle Core

A standalone single-cycle RV32I processor in src/riscv_core.v. All seven source files are self-contained (no stage hierarchy):

File	Description
`riscv_core.v`	Top-level single-cycle core; PC, decode, execute, memory, writeback all in one module
`alu.v`	32-bit ALU
`control_unit.v`	Combinational control decoder
`imm_gen.v`	Immediate generator (all types)
`regfile.v`	32×32 register file
`instr_mem.v`	Synchronous instruction memory (`$readmemh`)
`data_mem.v`	Data memory

The test program in sim/program.hex is generated by gen_hex.py (19 instructions: ADDI, ADD, SUB, AND, OR, XOR, SLL, SRLI, SLT, SLTU, ANDI, ORI, XORI, SLLI, SRLI, SRAI, JAL x0 halt).

Eval2 Historical Variants

Four designs implemented for comparative evaluation (eval2/, eval2_multicycle/, eval2_pipeline/, eval2_superscalar/). All target the Artix-7 xc7a35tcpg236-1 and use a 12-instruction ALU test program. See Section 12 for a full comparison table.

Level 2 — 5-Stage Pipeline (Baseline)

Path: level2/pipeline/

The first fully pipelined design. Implements a classic 5-stage RISC pipeline with full forwarding and hazard detection.

Supported ISA:

All 10 R-type: ADD, SUB, AND, OR, XOR, SLL, SRL, SRA, SLT, SLTU
All 9 I-type ALU: ADDI, ANDI, ORI, XORI, SLTI, SLTIU, SLLI, SRLI, SRAI
LW (word load)
SW (word store)
All 6 branches: BEQ, BNE, BLT, BGE, BLTU, BGEU

Immediate types supported: I-type, S-type, B-type (no U or J).

Boards: Boolean Board (boolean.xdc) and Arty S7-50 (arty_s7.xdc). Both use the same device xc7s50csga324-1 (Spartan-7 50T).

Test program (sim/program.hex): 29 instructions. Exercises all supported instruction types, including a loop that sums 1–5 (expected result: 15 in x10).

Source files:

File	Description
`riscv_core.v`	Top-level: instantiates all pipeline stages
`if_stage.v`	Instruction fetch; PC register and IMem read
`if_id_reg.v`	IF/ID pipeline register
`id_stage.v`	Instruction decode; regfile read
`id_ex_reg.v`	ID/EX pipeline register
`ex_stage.v`	Execute; ALU, forwarding muxes, branch logic
`ex_mem_reg.v`	EX/MEM pipeline register
`mem_stage.v`	Memory access; data memory read/write
`mem_wb_reg.v`	MEM/WB pipeline register
`wb_stage.v`	Writeback; selects ALU result or load data
`hazard_fwd_unit.v`	Load-use stall detection + forwarding control
`alu.v`	32-bit ALU (11 operations)
`decoder.v`	Combinational instruction decoder
`imm_gen.v`	Immediate sign-extension (I/S/B types)
`regfile.v`	32×32 register file, write-through
`instr_mem.v`	Synchronous instruction memory
`data_mem.v`	Synchronous word-addressed data memory
`boolean_top.v`	Boolean Board top-level wrapper
`arty_s7_top.v`	Arty S7-50 top-level wrapper
`seg7_ctrl.v`	Time-multiplexed 4-digit hex display driver

Level 3 — Full Control Flow

Path: level3/pipeline/

Extends level2 with the remaining non-memory, non-ALU base instructions: LUI, AUIPC, JAL, JALR. This completes the control-flow subset of RV32I.

New in level3 vs level2:

Feature	Detail
`imm_gen.v`	Adds U-type (`lui`/`auipc`) and J-type (`jal`) immediates
`decoder.v`	Adds output signals: `jump`, `jalr`, `lui`, `auipc`, `link`
`id_ex_reg.v`	Carries 5 additional control bits through the pipeline
`ex_stage.v`	`alu_a` mux: selects PC for AUIPC; `alu_result` mux: selects PC+4 for link (JAL/JALR); computes `jalr_target = (rs1+imm) & ~1`, `jal_target = PC+imm`

assemble.py: Full Python RV32I assembler. Supports all R/I/S/B/U/J instruction formats, forward and backward label references, and all 6 branch types. Assembles source text directly to sim/program.hex.

Test programs:

sim/program.hex: 17-instruction nested loop / function-call test via JAL
sim/program_new.hex: 69-instruction comprehensive test covering all level3 instructions (assembled by assemble.py)

Testbenches:

sim/tb_riscv_core.v: Stack operations, nested loops, JAL call/return
sim/tb_new_program.v: Elaborate per-instruction pass/fail checking for all level3 ISA

Board: Boolean Board only.

Level 4 — Comprehensive Verification

Path: level4/pipeline/

The RTL is identical to level3 — same source files, same module hierarchy, same synthesis results. Level4 is distinguished by its enhanced assemble.py and test suite, which provide exhaustive coverage of all implemented instructions.

Test program (sim/program.hex): 69 instructions assembled by assemble.py. Covers:

LUI: loads 0x12345000 into x3
AUIPC: loads PC + 0xABCDE000 into x4 (expected: 0xABCDE008)
All 10 R-type operations
All 9 I-type ALU operations
SW + LW round-trip (pattern: 0xDEAD)
All 6 branch types (counter incremented once per taken branch, final expected: 6)
JAL + JALR with stack push/pop via SP (x2)

Testbench (sim/tb_riscv_core.v): Checks each result with explicit $display/$fatal assertions. Final pass message confirms all checks passed.

Board: Boolean Board only.

L4 Superscalar — 2-Wide Dual-Issue

Path: l4_superscalar/

A 2-wide dual-issue superscalar processor. This is a simulation-only design (no board top, no XDC, no synthesis TCL). Architecture differs significantly from the single-issue levels.

Key design decisions:

Feature	Detail
Fetch	Fetches 2 instructions per cycle (`instr0`, `instr1`) from `instr_mem`
Issue gate (`dual_ok`)	Issues pair only when: both slots valid; neither is a control instruction (branch/jal/jalr); no intra-group RAW hazard (slot0.rd == slot1.rs1 or slot1.rs2); not both memory ops; no FENCE
PC advance	`PC + 8` when dual_ok, `PC + 4` when single-issue, hold on stall, redirect on branch taken
Pipeline stages	4 stages: IF/ID → EX → MEM → WB (pairs tracked as e0/e1, m0/m1, w0/w1)
Dual ALUs	`u_alu0` + `u_alu1` operating in parallel
Forwarding	2-bit `fa0/fb0/fa1/fb1`: 00=none, 01=MEM slot0, 10=MEM slot1, 11=WB either
Register file	Dual-write port (we0/we1), 4 read ports (rs1_0/rs2_0/rs1_1/rs2_1), write-through on all read ports
Data memory	Byte-addressable with `width[1:0]` and `mem_unsigned` for LB/LH/LW/SB/SH/SW
`control_unit.v`	Extended decoder: adds `mem_to_reg`, `fence`, `ecall`, `ebreak`, `mem_width[2:0]`, `mem_unsigned`

Note: instr_mem.v contains a hardcoded Windows-style default path for MEM_FILE. This must be overridden at instantiation on Linux (pass MEM_FILE as a parameter or edit the default).

Test program (sim/program.hex): 17-instruction simple loop and function-call sequence. Testbench prints a full per-cycle pipeline trace (both slots, EX/MEM/WB stages) and checks x1–x19 against expected values.

5. Pipeline Architecture

All level2/3/4 designs share this 5-stage pipeline:

         ┌────────┐    ┌────────┐    ┌────────┐    ┌────────┐    ┌────────┐
  clk ──>│   IF   │───>│   ID   │───>│   EX   │───>│  MEM   │───>│   WB   │
         └────────┘    └────────┘    └────────┘    └────────┘    └────────┘
              │    if_id   │    id_ex   │    ex_mem  │    mem_wb  │
              │   ══════>  │   ══════>  │   ══════>  │   ══════>  │
              │            │            │            │            │
         PC + IMem    Decode +      ALU +        DMem         Regfile
                       RegRead     Branch                      Write
                                   Resolve

Pipeline registers: if_id_reg, id_ex_reg, ex_mem_reg, mem_wb_reg — all synchronous, reset-to-NOP on flush.

Key microarchitectural properties:

Property	Value / Description
Clock	100 MHz (10 ns period), single `sys_clk` domain
Reset	Active-low synchronous (`rst_n`)
Instruction memory	Synchronous read, word-addressed, `$readmemh` initialized
Data memory	Synchronous write, combinatorial read, word-addressed via `addr[31:2]`
Register file	Write-through: WB write visible to ID-stage reads in same cycle
Branch resolution	EX stage — 2-cycle penalty; IF and ID stages flushed on redirect
Forwarding	EX→EX (from EX/MEM register) and MEM→EX (from MEM/WB register)
Debug port	`dbg_addr[4:0]` / `dbg_data[31:0]` on regfile for board inspection

ALU operations (4-bit control):

Code	Operation	Code	Operation
0000	ADD	0101	SRL
0001	SUB	0110	SLT
0010	AND	0111	SLTU
0011	OR	1000	SRA
0100	XOR	1001	SLL
—	—	1010	PASS_B (for LUI)

6. Hazard Handling

Implemented in hazard_fwd_unit.v (instantiated inside ex_stage.v).

Forwarding

Forward paths resolve RAW hazards without stalling when the producing instruction has already computed its result:

`fwd_a` / `fwd_b`	Meaning
`2'b00`	No forwarding — use register file output
`2'b01`	Forward from EX/MEM register (one cycle ago)
`2'b10`	Forward from MEM/WB register (two cycles ago)

Forwarding conditions (for fwd_a — fwd_b is symmetric):

// EX/MEM forward (higher priority)
if (ex_mem_reg_write && ex_mem_rd != 0 && ex_mem_rd == id_ex_rs1)
    fwd_a = 2'b01;
// MEM/WB forward
else if (mem_wb_reg_write && mem_wb_rd != 0 && mem_wb_rd == id_ex_rs1)
    fwd_a = 2'b10;

Load-Use Stall

A 1-cycle bubble is inserted when a load is immediately followed by a dependent instruction. Detected in riscv_core.v:

stall = id_ex_mem_read && (id_ex_rd != 0) &&
        (id_ex_rd == id_rs1 || id_ex_rd == id_rs2);

On stall: PC is held, IF/ID register is held, ID/EX register is flushed to NOP.

Branch Penalty

Branches are resolved in the EX stage. On a taken branch:

pc_redirect signal asserted, pc_target driven to branch/jump target
IF and ID pipeline registers flushed (2-cycle penalty)
Not-taken branches incur no penalty (no branch prediction — assume not-taken)

7. Module Reference

The following table describes every module used in the level2/3/4 pipeline (with notes on level3/4 additions):

Module	Instantiated as	Description
`riscv_core`	`u_core`	Pipeline top-level; wires all stages; contains stall logic
`if_stage`	`u_if`	PC register, PC+4/redirect mux, instruction memory read
`if_id_reg`	`u_if_id`	IF/ID pipeline register; flush-to-NOP on redirect or stall
`id_stage`	`u_id`	Decoder, immediate generator, register file read
`id_ex_reg`	`u_id_ex`	ID/EX pipeline register; carries control + data; has synchronous reset signal `flush_decode`
`ex_stage`	`u_ex`	ALU + forwarding muxes + branch condition evaluation; level3+ adds jump/lui/auipc muxes
`ex_mem_reg`	`u_ex_mem`	EX/MEM pipeline register
`mem_stage`	`u_mem`	Data memory instantiation; LW/SW
`mem_wb_reg`	`u_mem_wb`	MEM/WB pipeline register
`wb_stage`	`u_wb`	Writeback mux: ALU result or load data
`hazard_fwd_unit`	`u_haz`	Forwarding logic (inside `ex_stage`); stall signal to core
`alu`	`u_alu`	32-bit combinational ALU; 4-bit control
`decoder`	`u_dec`	Combinational control signal generation from opcode/funct3/7
`imm_gen`	`u_imm`	Sign-extended immediate; level3+ adds U and J types
`regfile`	`u_rf`	32×32 register file; x0 hardwired to zero; write-through
`instr_mem`	`u_imem`	Synchronous ROM; parameterized `MEM_FILE`
`data_mem`	`u_dmem`	Synchronous RAM; word-addressed
`boolean_top`	(top)	Boolean Board wrapper: clock buffer, reset, `seg7_ctrl`, switch/LED I/O
`arty_s7_top`	(top)	Arty S7-50 wrapper (level2 only)
`seg7_ctrl`	`u_seg7`	4-digit 7-segment display driver; time-multiplexed

8. Simulation

Prerequisites

Icarus Verilog (iverilog / vvp)
(Optional) GTKWave for VCD waveform viewing

Level 2 — Quick Start

cd level2/pipeline/sim

# Compile
iverilog -o sim.out -s tb_riscv_core \
    tb_riscv_core.v \
    ../src/riscv_core.v \
    ../src/if_stage.v \
    ../src/if_id_reg.v \
    ../src/id_stage.v \
    ../src/id_ex_reg.v \
    ../src/ex_stage.v \
    ../src/ex_mem_reg.v \
    ../src/mem_stage.v \
    ../src/mem_wb_reg.v \
    ../src/wb_stage.v \
    ../src/hazard_fwd_unit.v \
    ../src/alu.v \
    ../src/decoder.v \
    ../src/imm_gen.v \
    ../src/regfile.v \
    ../src/instr_mem.v \
    ../src/data_mem.v

# Run (program.hex must be in the working directory or MEM_FILE path correct)
vvp sim.out

# View waveforms
gtkwave tb_riscv_core.vcd

Or use the provided TCL script from Vivado:

source sim/run.tcl

Level 3 / Level 4

Same flow. Level3 has two testbenches:

# Basic JAL/stack test
iverilog -o sim.out -s tb_riscv_core  tb_riscv_core.v  [sources...]
# Comprehensive all-instruction test
iverilog -o sim.out -s tb_new_program tb_new_program.v [sources...]

Assembling a New Program

The assemble.py in level3 and level4 assembles RV32I assembly source to hex:

cd level3/pipeline   # or level4/pipeline
python3 assemble.py  # reads inline assembly, writes sim/program.hex

To modify the test program, edit the assembly string inside assemble.py and re-run.

L4 Superscalar

cd l4_superscalar/sim

iverilog -o sim.out -s tb_riscv_core \
    tb_riscv_core.v \
    ../src/riscv_core.v \
    ../src/control_unit.v \
    ../src/alu.v \
    ../src/regfile.v \
    ../src/instr_mem.v \
    ../src/data_mem.v

vvp sim.out

Note: Edit instr_mem.v to fix the MEM_FILE default path (currently a Windows absolute path) or pass MEM_FILE as a parameter.

9. Synthesis and Implementation

Toolchain

Vivado 2025.2 (lin64), Build 6299465
Device: xc7s50csga324-1 (Spartan-7 50T)
Boolean Board clock: 100 MHz, pin F14, LVCMOS33
Arty S7 clock: 100 MHz, pin P14

Level 2 — Standalone Core Synthesis

# In Vivado TCL console or batch mode:
source level2/pipeline/synth.tcl   # Synthesize riscv_core only
source level2/pipeline/impl.tcl    # Implement riscv_core (no board I/O)

Reports land in level2/pipeline/impl_out/.

Level 2 — Full Board Build (Boolean Board)

source level2/pipeline/board_impl_boolean.tcl
# This script also calls program_board.tcl to flash the bitstream

Reports land in level2/pipeline/board_out_boolean/.

Level 3 / Level 4 — Board Build

source level3/pipeline/synth.tcl   # or level4/pipeline/synth.tcl

Reports land in levelN/pipeline/board_out_boolean/.

10. Hardware Deployment — Board I/O

Boolean Board (`boolean_top.v`)

All three levels (2, 3, 4) use an identical boolean_top.v wrapper.

Signal	Direction	Pins / Width	Description
`clk`	Input	F14	100 MHz system clock
`btn[0]`	Input	1 bit	Active-high reset (synchronous)
`sw[4:0]`	Input	5 bits	Register select — choose which x0–x31 register to display
`led[15:0]`	Output	16 bits	Lower 16 bits of the selected register
`seg[6:0]`	Output	7 bits	7-segment cathode signals (active-low)
`an[3:0]`	Output	4 bits	7-segment anode enables (active-low, time-multiplexed)
`dp`	Output	1 bit	Decimal point (driven low / unused)

Usage:

Program the board with the generated bitstream
Press btn[0] to reset the CPU; it will begin executing from address 0
Set sw[4:0] to the register number you want to inspect (e.g., 01010 = x10)
led[15:0] shows the lower 16 bits of that register immediately
The 4-digit hex display shows the full 32-bit value of the selected register

Segment display encoding: seg7_ctrl cycles through digits 0–3 at a frequency derived from the 100 MHz clock (typically ~1 kHz digit refresh). Each digit displays one hex nibble (0–F).

Arty S7-50 (`arty_s7_top.v`)

Level2 only. Same functional mapping as boolean_top but with Arty S7 pin assignments.

Signal	Direction	Pin	Description
`clk`	Input	P14	100 MHz system clock
`btn[0]`	Input	—	Reset
`sw[3:0]`	Input	—	Register select (lower 4 bits)
`led[3:0]`	Output	—	Lower 4 bits of selected register

11. Synthesis Results

Device Capacity Reference (xc7s50csga324-1, Spartan-7 50T)

Resource	Total Available
Slice LUTs	32,600
Slice Registers	65,200
Block RAM	75 (36Kb each)
DSP48E1	120
IOBs	210

Level 2 — Standalone `riscv_core` (no board I/O, `impl_out/`)

Note: Most logic was optimized away (constant inputs); these numbers reflect the pruned implementation, not the full core.

Resource	Used	Available	Utilization
Slice LUTs	2	32,600	< 0.01%
Slice Registers	30	65,200	0.05%
Block RAM	0	75	0%

Timing (post-route, standalone): WNS = +7.161 ns — timing met. Critical path: PC adder carry chain, 9 logic levels, 2.839 ns.

Power (post-route, standalone): Total = 0.101 W, Dynamic = 0.030 W.

Board Builds — Boolean Board (`board_out_boolean/`)

Metric	Level 2	Level 3	Level 4
Slice LUTs (total)	1,463	1,666	1,666
— LUT as Logic	885	1,088	1,088
— LUT as Dist. RAM	578	578	578
CARRY4	40	66	66
Slice Registers (FFs)	344	437	437
F7/F8 Muxes	384	385	385
Slices	465	529	529
Block RAM	0	0	0
IOBs	53 (25.2%)	53 (25.2%)	53 (25.2%)
BUFGCTRL	1	1	1

LUT utilization: L2 = 4.49%, L3/L4 = 5.11% of device.

Level2 → Level3/4 increase: +203 LUTs (+14%), +93 FFs (+27%). This overhead comes from the 5 additional control signals (jump, jalr, lui, auipc, link) propagating through the ID/EX pipeline register, the EX-stage mux tree, and the extended imm_gen.

Level3 vs Level4: Identical RTL, identical synthesis results. The difference is only in the test program and testbench.

Timing — Boolean Board (post-route, 100 MHz constraint)

Level	WNS (ns)	TNS (ns)	Failing Endpoints	Status
Level 2	-2.306	-335.187	162 / 6,592	NOT MET
Level 3	-3.407	-743.564	256 / 6,802	NOT MET
Level 4	-3.407	-743.564	256 / 6,802	NOT MET

All hold-time constraints are met (WHS > 0 in all cases).

Level 2 critical path: id_ex.out_rs2[0] → forwarding unit → ALU (SRL carry chain, bits [4–8]) → zero flag → pc_redirect flop reset. 12 logic levels, 11.811 ns data path delay, 9.805 ns routing. Slack = -2.306 ns → actual Fmax ≈ 79.5 MHz.

Level 3/4 critical path: ex_mem.out_rd[2] → forwarding unit (fwd_a) → ALU input mux → ALU bit [26] carry chain → zero flag → pc_redirect → id_ex pipeline register reset. 14 logic levels, 12.901 ns data path delay, 10.709 ns routing. Slack = -3.407 ns → actual Fmax ≈ 73.5 MHz.

Why timing fails: The critical path passes through: (1) the MEM/WB or EX/MEM forwarding comparators, (2) ALU operand selection muxes, (3) the 32-bit ALU carry chain (especially for shift/compare operations that the synthesizer cannot map to CARRY4 efficiently), (4) the zero/branch-taken combinational logic, and (5) the conditional flush/redirect path to the pipeline register reset pin — all in a single cycle. The routing component (83% of path delay) indicates the design is spread across the fabric, which inflates net delays significantly. Reducing the clock to ~70 MHz or retiming the branch-taken → flush path would resolve violations.

Power — Boolean Board (post-route, 100 MHz, typical process)

Metric	Level 2	Level 3	Level 4
Total On-Chip Power (W)	0.078	0.132	0.132
Dynamic Power (W)	0.006	0.061	0.061
Device Static (W)	0.072	0.072	0.072
Junction Temperature (C)	25.4	25.7	25.7
Confidence Level	Low	Low	Low

The low confidence level is expected (no simulation activity file provided). The dominant dynamic power consumer in level3/4 is I/O (0.041 W at Vcco33), followed by signals (0.010 W) and slice logic (0.005 W).

12. Eval2 Architecture Comparison

Historical designs implemented for direct architectural comparison. All run at 100 MHz reference; device: Artix-7 xc7a35tcpg236-1; 12-instruction ALU-only test program.

Design	Stages	CPI (eff.)	IPC (eff.)	Throughput @ 100 MHz	Fmax (est.)
Single-cycle	1	1.00	1.00	100 MIPS	~110–140 MHz
Multicycle FSM	4 (FSM)	4.00	0.25	25 MIPS	~150 MHz
4-stage pipeline	4 (IF/ID/EX/WB)	1.25	0.80	80 MIPS	~133 MHz
2-wide superscalar	4 (IF/ID/EX/WB)	0.75	1.33	133 MIPS	—

Eval2 Resource Utilization (Artix-7 xc7a35tcpg236-1):

Design	LUTs	FFs	BRAM
Single-cycle	487	32	0
Multicycle FSM	517	200	0
4-stage pipeline	570	216	1

Full details: see eval2_report.txt and comparison.txt.

13. Known Issues

Issue	Affected Level	Description
Timing not met at 100 MHz	L2, L3, L4 board builds	Critical path through forwarding + ALU + branch-taken → pipeline flush. Actual Fmax: ~79 MHz (L2), ~73 MHz (L3/L4). No functional impact in simulation. For deployment, reduce clock to 50–70 MHz via a clock divider or MMCM.
Hardcoded Windows path in `instr_mem.v`	`l4_superscalar/src/instr_mem.v`	The default `MEM_FILE` parameter uses an absolute Windows path. Must be overridden at instantiation on Linux/Mac.
No board support for superscalar	`l4_superscalar/`	No `boolean_top.v`, no XDC, no synthesis TCL. Simulation only.
Branch prediction: assume not-taken	All pipeline levels	All branches incur a 2-cycle penalty when taken. No dynamic predictor.
Word-only memory in L2–L4	L2, L3, L4	Data memory is word-addressed; no byte or halfword access (LB/LH/SB/SH not supported). Only `l4_superscalar` implements sub-word memory access.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
eval1_presentation		eval1_presentation
eval2		eval2
eval2_complete/pipelined_eval2		eval2_complete/pipelined_eval2
eval2_multicycle		eval2_multicycle
eval2_pipeline		eval2_pipeline
eval2_superscalar		eval2_superscalar
l4_superscalar		l4_superscalar
level2/pipeline		level2/pipeline
level3/pipeline		level3/pipeline
level4/pipeline		level4/pipeline
mem		mem
sim		sim
src		src
README.md		README.md
comparison.txt		comparison.txt
eval2_report.txt		eval2_report.txt
gen_hex.py		gen_hex.py
superscalar.html		superscalar.html
vivado.jou		vivado.jou
vivado.log		vivado.log
vivado_269888.backup.log		vivado_269888.backup.log
vivado_271641.backup.jou		vivado_271641.backup.jou
vivado_271641.backup.log		vivado_271641.backup.log
vivado_272247.backup.jou		vivado_272247.backup.jou
vivado_272247.backup.log		vivado_272247.backup.log
vivado_272313.backup.jou		vivado_272313.backup.jou
vivado_272313.backup.log		vivado_272313.backup.log

Folders and files

Latest commit

History

Repository files navigation

RISC-V RV32I CPU — Progressive Implementation

Table of Contents

1. Repository Layout

2. Architecture Overview

3. ISA Coverage Summary

4. Level-by-Level Description

Root src/ — Single-Cycle Core

Eval2 Historical Variants

Level 2 — 5-Stage Pipeline (Baseline)

Level 3 — Full Control Flow

Level 4 — Comprehensive Verification

L4 Superscalar — 2-Wide Dual-Issue

5. Pipeline Architecture

6. Hazard Handling

Forwarding

Load-Use Stall

Branch Penalty

7. Module Reference

8. Simulation

Prerequisites

Level 2 — Quick Start

Level 3 / Level 4

Assembling a New Program

L4 Superscalar

9. Synthesis and Implementation

Toolchain

Level 2 — Standalone Core Synthesis

Level 2 — Full Board Build (Boolean Board)

Level 3 / Level 4 — Board Build

10. Hardware Deployment — Board I/O

Boolean Board (boolean_top.v)

Arty S7-50 (arty_s7_top.v)

11. Synthesis Results

Device Capacity Reference (xc7s50csga324-1, Spartan-7 50T)

Level 2 — Standalone riscv_core (no board I/O, impl_out/)

Board Builds — Boolean Board (board_out_boolean/)

Timing — Boolean Board (post-route, 100 MHz constraint)

Power — Boolean Board (post-route, 100 MHz, typical process)

12. Eval2 Architecture Comparison

13. Known Issues

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Root `src/` — Single-Cycle Core

Boolean Board (`boolean_top.v`)

Arty S7-50 (`arty_s7_top.v`)

Level 2 — Standalone `riscv_core` (no board I/O, `impl_out/`)

Board Builds — Boolean Board (`board_out_boolean/`)

Packages