Replies: 5 comments 3 replies
-
|
I kinda agree. The caches are not useful for internal IMEM/DMEM i.e. if they are enabled. Once they are disabled then they make sense though. In my SMP experiment I actively use such an arragement. Multiple harts, all with individual I$ and D$ and accessing a single shared SDRAM though the external memory interface. Internal IMEM and DMEM are disabled. My two cents: why not do both? Keep the existing caches sort of an Level 1 cache (maybe rename it?). But also introduce caches at your mentioned places sort of an level 2 cache? We could also choose to implement write-back instead of write-through. This could fix the latency issue with internal IMEM/DMEM? |
Beta Was this translation helpful? Give feedback.
-
|
I have been working on a more generic cache module that (hopefully) provides better efficiency by implementing "write-back" and "write-allocate" strategies instead of the current d-cache's "write-through" strategy. Right now, the cache is still direct-mapped. A 2-way set-associative cache might be much better - especially when using external memory for data and instructions. I tried to address this by implementing a "virtual splitting" of the new cache (VSPLIT). With this feature being enabled, the lower half of the cache blocks are reserved for data-only and the upper half of cache blocks are reserved for instructions-only. For some workloads (like code that uses the The cache is still in an early stage and needs some more optimization, but synthesis and simulation already seems fine: https://github.com/stnolting/neorv32/blob/generic_cache/rtl/core/neorv32_cache.vhd I made some simple benchmarks / tests:
Interestingly, the setup with the "external cache" (having the same size as the old I-cache + D-cache) is about ~10% faster (for a very synthetic workload). It seems like that bus congestion caused by the CPU's instruction and data interfaces are not a big performance issue. Anyway, more tests are required. But I wonder if we should simply remove the CPU caches and just stick with the external bus cache? 🤔 Btw, the new cache module is truly generic - so it could also be used for something like a mutli-core setup ... just saying 😉 |
Beta Was this translation helpful? Give feedback.
-
|
Sorry to necropost, but related to this I have a closed source fork of your design tailored to Avalon/Qsys. It supports compile time configured SPI bootcopier from flash to RAM at address 0 or XIP from DSPI or QSPI flash. The Qsys hw.tcl file detects the target device and looks up the maximum size of the FPGA hardware image, then sets a generic on the CPU to the next 64kB block boundary beyond that image. My build tools then extract this generic from the .sopcinfo to write the linker script and supply it as a #define for user software. To implement XIP, I basically made a new instruction and data cache that were connected to the CPU fetch and LSU respectively. In XIP mode, when OCD is off, the CPU connects directly to the XIP icache, no other slaves are available. If and only if OCD is enabled are IO bus switches synthesised to allow the CPU access to either the XIP icache or OCD so with no OCD you get minimum latency between XIP and CPU at the cost of no on-chip executable code. The XIP dcache attaches to the IO bus switch (my internal switch only has 4 entries because I ripped out every peripheral except the bootcopier/XIP, CFS, CLINT and OCD everything else is Avalon/Qsys). I allocated 16MB of the IO space to the XIP. The icache and dcache send requests to either a DSPI or QSPI master which reads entire cache lines from flash (so the cache line length of both caches must be the same). The dcache is of course read only. For optional write access to the SPI flash, I offer single byte writes to and control of the CS line from a separate IO register, though I've not really used that yet. |
Beta Was this translation helpful? Give feedback.
-
|
The CPU LSU can access other devices but when OCD is disabled the fetch unit connects directly to XIP. Of course this restricts you to only running code from flash but I dread how complex the linker script generation would get with options for random Avalon slaves attached. In XIP mode the LSU connects first to an Avalon data master which looks for "local access" of addresses above 0xF0000000 and routes those to the IO bridge. I removed support for the atomic operations because while Avalon could support it, I don't think it is worth implementing for my use case. The DSPI and QSPI masters weren't particularly hard to do, they only send one command to flash (0xBB or 0xEB respectively), then read back an entire cache line. |
Beta Was this translation helpful? Give feedback.
-
|
An illustration: Boot RAM is a 1kB 32-bit RAM. The content is a C bootcopier program that is built after the main executable is finished. The size of the main executable is supplied to the bootcopier during compilation as well as the location where the build scripts will place the user executable in flash memory (which in turn is derived from the FPGA part number) as above. The boot copier contains its own simple 32-bit wide SPI register which can be read and written to as part of the bootloader memory map. Build tooling auto injects the fresh boot loader into the .sof and builds .pof and .jic files for the flash image. XIP mode, OCD = false Fetch unit connects directly to the SPI cache which can only fetch from SPI. Data master is not cached, local access to 0xF0000000 and above are redirected to the IO switch. Fetch unit connects to an IO switch with only two memory regions enabled SPI in 0xF0000000 to 0xF3FFFFFFF and OCD from 0xFC000000 to 0xFFFFFFFF. Dmaster routes as before. XIP caches are at minimum a 1kB ring buffer prefetch holding two blocks, or a direct mapped cache derived from the neorv32_cache but with reduced tag size due to the 24-bit address range supported by the DSPI/QSPI interface. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
In general, a cache can improve overall performance hiding the latency of slow memories by utilizing temporal and spatial data locality. Right now we have two optional caches right next to the CPU: a read-only cache for instructions and a read/write cache for data.
The default processor-internal / FPGA-internal memories have an access latency of 1 cycle so the CPU can access them as fast as possible. If the caches are enabled, the overall performance drops significantly as these maximum-speed memories are cached, which requires additional time. Even worse, the data-cache uses a write-back architecture so any store operation bypasses the cache with an extra delay of one cycle.
A setup that only uses internal IMEM/DMEM in combination with the caches enabled has about ~40% less CoreMark performance than a system without caches.
Right now there are just two module that have more than 1 cycle access latency: the execute in-place module (XIP) and the external memory interface (WISHBONE). So accesses to these modules really can benefit from having the caches implemented.
Now the question: do we really need the caches? And do we need them where they are right now?
I've been thinking about relocating the caches. The read-only i-cache could be moved inside the XIP module while the read/write d-cache could be moved to the external bus interface. Thus, we would only cache accesses that for sure have a very large latency compared to the other internal modules.
Of the course the caches are quite hand where they are right now. The reduce bus traffic as the CPU's instruction and data ports have to share a single processor-wide bus. But actually, this is not a real deal, as the CPU is equipped with an instruction prefetch buffer that speculatively fetches instructions whenever the CPU does not perform a load/store operation. So I think it won't hurt (too much) not having the caches right in front of the CPU.
Anyway, I'm curious what you think. 😉
Beta Was this translation helpful? Give feedback.
All reactions