|
1 | | -# chisel-empty |
| 1 | +# Timy GPU |
| 2 | +The goal of this Timy GPU project is to develop a miniature gpu capable of parallel processing. Specifically the ability to execute programs similar to "compute shaders". The project will be written in the chisel HDL with the goal of testing the design on a physical FPGA. |
| 3 | +## Design |
| 4 | +The GPU will consist of memory, a single core with many thread, and logic to load programs into memory and dispatch programs to execute. |
| 5 | +### Memory |
| 6 | +A single memory block will be shared between cores. A memory controller will manage read and write requests from the multiple cores at once. Memory will have 16 bits of addressable space. Memory will include both program data and application data. There will also be a stack with a size of 16 bits. |
| 7 | +### GPU State |
| 8 | +The GPU can either be idle, program load, or execute states. During program load state, the gpu will read in data and address location from external wires and load the data into memory at the addresses. |
| 9 | +```cs |
| 10 | +in byte state // idle | program load | execute |
2 | 11 |
|
3 | | -An almost empty chisel project (and adder) as a starting point for hardware design. |
| 12 | +out bool writeReady |
| 13 | +in byte writeData |
| 14 | +in byte writeAddress |
| 15 | +in bool write |
4 | 16 |
|
5 | | -To generate Verilog code for the adder execute: |
6 | | -```bash |
7 | | -make |
8 | | -``` |
9 | | - |
10 | | -Run the tests with: |
11 | | -```bash |
12 | | -make test |
13 | | -``` |
| 17 | +out bool readReady |
| 18 | +out byte readData |
| 19 | +in byte readAddress |
| 20 | +in bool read |
14 | 21 |
|
15 | | -Cleanup the repository with: |
16 | | -```bash |
17 | | -make clean |
| 22 | +in byte startPointer |
18 | 23 | ``` |
| 24 | +### Core |
| 25 | +Theoretically the GPU could be expanded to allow the execution of multiple programs at once by adding more cores, however for simplicity's sake I'm aiming to only have a single core for the moment. Although once we get a single core working I don't imagine supporting more cores to be very difficult. A core will consist of a memory access, dispatcher / synchronizer, and a number of threads. The dispatcher / synchronizer manages loading the program from memory and wiring this to threads for threads to execute. (This is already somewhat working as of 10-16). |
| 26 | +### Thread |
| 27 | +A single thread contains a program counter, ALU, LSU, and registers. Thread take in loaded operations from their parent core and execute the operation. Once more development is done with register's, I'll have a clearer idea of how many / which registers a thread needs, but I'm imagining initially we'll have a few 16 bit registers: |
| 28 | +1. stack register |
| 29 | +2. a, b, and c register |
| 30 | +### Instruction Set |
| 31 | +The instruction set is 24 bits wide. 8 for opcode and 16 additional bits for immediate. The first 5 opcode bits specify instruction. The next 3 specify target or source / destination registers if an instruction uses it. |
| 32 | +- move |
| 33 | + `00001` + target --> moves immediate into register |
| 34 | + `00010` + src/dst --> moves value in register to register |
| 35 | +- load |
| 36 | + `00011` + src/dst --> takes address from register and loads memory into other register |
| 37 | +- add |
| 38 | + `00100` + src/dst --> add value in src to dst and store in dst |
| 39 | +- mul |
| 40 | + `00101` + src/dst --> multiplies value in src by dst and store in dst |
| 41 | +- cmp |
| 42 | + `00110` + src/dst --> compares value in src to dst and stores result in nzp flag of alu |
| 43 | +- jmp |
| 44 | + `00111` + target --> jumps program pointer to value specified in register |
| 45 | + `01000` + target --> jumps program pointer to value specified in register if negative flag is set |
| 46 | + `01001` + target --> jumps program pointer to value specified in register if positive flag is set |
| 47 | + `01010` + target --> jumps program pointer to value specified in register if zero flag is set |
| 48 | + `01011` + target --> jumps program pointer to value specified in register if not zero flag is set |
| 49 | +- or |
| 50 | + `01100` + src/dst --> does bitwise or of src and dst registers and stores result in dst |
| 51 | +- and |
| 52 | + `01101` + src/dst --> does bitwise and of src and dst registers and stores result in dst |
| 53 | +- xor |
| 54 | + `01110` + src/dst --> does bitwise xor of src and dst registers and stores result in dst |
| 55 | +- not |
| 56 | + `01111` + src/dst --> does bitwise not of src and dst registers and stores result in dst |
| 57 | +- shift R |
| 58 | + `10000` + target --> shifts bits to the right and pads 0s at beginning of target register |
| 59 | +- shift L |
| 60 | + `10001` + target --> shifts bits to the left and pads 0s at end of target register |
| 61 | +- push |
| 62 | + `10010` + target --> pushes value in target register to stack |
| 63 | +- pop |
| 64 | + `10011` + target --> pops value on top of stack into register |
| 65 | +- sync (experimental?) |
| 66 | + `10100` + target --> tells the core dispatcher to not dispatch any threads until all threads have reached the program pointer specified in the specified register |
| 67 | +- term |
| 68 | + `10101` --> signals that the thread has finished execution |
| 69 | +- store |
| 70 | + `10110` + src/dst --> takes address from src register and stores the value in dst register into memory |
| 71 | + |
| 72 | +Potentially we may need more instructions but I can't think of any more that we need right now? |
| 73 | +# How To Run |
| 74 | +`sbt run test` |
0 commit comments