docs: Add minimalistic docs

TheMightyDuckOfDoom · TheMightyDuckOfDoom · commit 3fa4b15cb25a · 2025-12-16T21:00:12.000+01:00
diff --git a/README.md b/README.md
@@ -27,7 +27,21 @@
 The Architecture of BGPU is most similar to NVIDIA GPUs starting from the Fermi-Microarchitecture.
 We implement a form of Independent-Thread-Scheduling (ITS) similiar to the NVIDIA Volta Architecture.
 
-TODO: As with all projects, documentation is still a work-in-progress...
+Please have a look at [`BGPU Architecture`](docs/arch.md)
+
+## Quickstart
+
+To run some simple tests on different levels of hierarchy use the following targets:
+```bash
+make tb_compute_unit
+make tb_compute_cluster
+make tb_bgpu_soc
+```
+
+To see what is executed have a look at:
+- [`test/tb_compute_unit.sv`](test/tb_compute_unit.sv)
+- [`test/tb_compute_cluster.sv`](test/tb_compute_cluster.sv)
+- [`test/tb_bgpu_soc.sv`](test/tb_bgpu_soc.sv)
 
 ## Helpfull References
 
diff --git a/docs/arch.md b/docs/arch.md
@@ -0,0 +1,45 @@
+# BGPU: A Bad GPU
+
+## BGPU SoC
+
+This is the top most module in the BGPU-Architecture.
+It currently contains the following:
+- Control Domain: Allowing JTAG access and launching of kernels
+- One or more Compute Clusters
+- Various AXI-Interconnects to connect the Control Domain and Compute Clusters to a Memory Controller
+
+## Control Domain
+
+The Control Domain allows us to control the Compute Clusters.
+
+It contains the following modules:
+- RISC-V JTAG Debug Module
+- RISC-V RV32I Processor
+- Thread Engine
+- AXI and OBI Interconnects
+
+## Compute Clusters
+
+A Compute Cluster is composed out of one or more Compute Units.
+
+## Compute Unit
+
+The Compute Unit is the heart of the BGPU.
+This is the place where we actually do usefull computations (hopefully).
+
+The following diagram gives an overview of the Compute Unit:
+
+<img src="fig/compute_unit.drawio.svg">
+
+An instruction flows through these stages:
+- Fetcher: Selects a PC of a Warp that should fetch new instructions
+- Instruction Cache: Retrieves one or more (if FetchWidth > 1) instructions at the PC
+- Decoder: Decodes the instructions. Tell the fetcher where the next PC will be for the Warp
+- Multi Warp Dispatcher: Keeps Instructions in an Wait Buffer until they are allowed to be executed. Dispatches one or more (if DispatchWidth > 1) to collect their operands
+- Register Operand Collector Stage: Read the Operands of the Instructions
+- Execution Unit Demultiplexer: Sends the Instructions to their respective Execution Unit
+- Branch Unit: Calculates the PC for Conditional Branches
+- Integer Unit: Performs integer operations and housekeeping operations (index within threadblock, get parameter address, ...)
+- Floating Point Unit: Performs Floating Point operations
+- Load Store Unit: Performs Loads and Stores to/from Memory
+- Result Collector: Arbitrates between Execution Unit Results and sends them to the Register File
diff --git a/docs/fig/compute_unit.drawio.svg b/docs/fig/compute_unit.drawio.svg