-
Notifications
You must be signed in to change notification settings - Fork 3
Cache Prefetching
Hüseyin Tuğrul BÜYÜKIŞIK edited this page Mar 16, 2021
·
14 revisions
For a single-thread application, prefetching helps overlapping I/O of next page to be accessed. Calling prefetch method only for one element is enough to fully load the page that holds the element at index. Just like get/set/bulk operations, prefetch method is thread-safe too. Any number of prefetch calls can be made as long as number of cache lines (active pages) is enough to serve all asnchronous page loads. Example usage:
VirtualMultiArray<Object> arr(...,pageSize,...);
for(size_t i=0;i<testSize;i++)
{
Object o = arr[i];
// when first element of a page is reached, asynchronously load next page into cache
if((i%pageSize)==0)
arr.prefetch(i+pageSize);
}
Prefetching efficiency depends on:
- latency from pcie for loading a page (+storing if previous accesses have edited elements)
- time required for computing all elements until next page
- if prefetch operation is complete before the processing loop arrives at that page
-
- then: loop continues iterations without pcie latency
-
- else: prefetching adds an extra latency of a mutex lock and LRU iteration
Benchmark timings from an arbitrary no-compute example:
simple write: 589735907 nanoseconds (bandwidth = 671.49 MB/s) (throughput = 393.16 nanoseconds per iteration)
simple write: 759817996 nanoseconds (bandwidth = 521.18 MB/s) (throughput = 506.55 nanoseconds per iteration)
simple write: 762645884 nanoseconds (bandwidth = 519.24 MB/s) (throughput = 508.43 nanoseconds per iteration)
simple write: 761231767 nanoseconds (bandwidth = 520.21 MB/s) (throughput = 507.49 nanoseconds per iteration)
simple write: 760679226 nanoseconds (bandwidth = 520.59 MB/s) (throughput = 507.12 nanoseconds per iteration)
prefetch write: 479190183 nanoseconds (bandwidth = 826.39 MB/s) (throughput = 319.46 nanoseconds per iteration)
prefetch write: 477348518 nanoseconds (bandwidth = 829.58 MB/s) (throughput = 318.23 nanoseconds per iteration)
prefetch write: 476916204 nanoseconds (bandwidth = 830.33 MB/s) (throughput = 317.94 nanoseconds per iteration)
prefetch write: 477104499 nanoseconds (bandwidth = 830.01 MB/s) (throughput = 318.07 nanoseconds per iteration)
prefetch write: 476079865 nanoseconds (bandwidth = 831.79 MB/s) (throughput = 317.39 nanoseconds per iteration)
simple read: 450068799 nanoseconds (bandwidth = 879.87 MB/s) (throughput = 300.05 nanoseconds per iteration)
simple read: 360594634 nanoseconds (bandwidth = 1098.19 MB/s) (throughput = 240.40 nanoseconds per iteration)
simple read: 360009298 nanoseconds (bandwidth = 1099.97 MB/s) (throughput = 240.01 nanoseconds per iteration)
simple read: 360932279 nanoseconds (bandwidth = 1097.16 MB/s) (throughput = 240.62 nanoseconds per iteration)
simple read: 360074952 nanoseconds (bandwidth = 1099.77 MB/s) (throughput = 240.05 nanoseconds per iteration)
prefetch read: 230803171 nanoseconds (bandwidth = 1715.75 MB/s) (throughput = 153.87 nanoseconds per iteration)
prefetch read: 229898585 nanoseconds (bandwidth = 1722.50 MB/s) (throughput = 153.27 nanoseconds per iteration)
prefetch read: 231080183 nanoseconds (bandwidth = 1713.69 MB/s) (throughput = 154.05 nanoseconds per iteration)
prefetch read: 229643488 nanoseconds (bandwidth = 1724.41 MB/s) (throughput = 153.10 nanoseconds per iteration)
prefetch read: 228486597 nanoseconds (bandwidth = 1733.14 MB/s) (throughput = 152.32 nanoseconds per iteration)
Source code:
#include "GraphicsCardSupplyDepot.h"
#include "VirtualMultiArray.h"
#include "CpuBenchmarker.h"
#include <random>
class Object
{
public:
Object(){id=-1;}
Object(size_t i){id=i; data[id%(256)]='A';}
char data[256];
size_t id;
};
int main()
{
GraphicsCardSupplyDepot depot;
const size_t n = 15000000;
const size_t pageSize=10000;
const int maxActivePagesPerGpu = 5;
VirtualMultiArray<Object> arr(n,depot.requestGpus(),pageSize,maxActivePagesPerGpu);
const int testSize = n/10;
for(int j=0;j<5;j++)
{
CpuBenchmarker bench(testSize*sizeof(Object),"simple write",testSize);
std::random_device rd;
std::mt19937 rng(rd());
std::uniform_real_distribution<float> rnd(0,testSize);
for(size_t i=0;i<testSize;i++)
{
arr[i]=Object(rnd(rng));
}
}
for(int j=0;j<5;j++)
{
CpuBenchmarker bench(testSize*sizeof(Object),"prefetch write",testSize);
std::random_device rd;
std::mt19937 rng(rd());
std::uniform_real_distribution<float> rnd(0,testSize);
for(size_t i=0;i<testSize;i++)
{
arr[i]=Object(rnd(rng));
if((i%pageSize)==0)
arr.prefetch(i+pageSize);
}
}
for(int j=0;j<5;j++)
{
CpuBenchmarker bench(testSize*sizeof(Object),"simple read",testSize);
for(size_t i=0;i<testSize;i++)
{
Object o = arr[i];
}
}
for(int j=0;j<5;j++)
{
CpuBenchmarker bench(testSize*sizeof(Object),"prefetch read",testSize);
for(size_t i=0;i<testSize;i++)
{
Object o = arr[i];
if((i%pageSize)==0)
arr.prefetch(i+pageSize);
}
}
std::cout<<arr.get(0).data[0]<<std::endl;
return 0;
}