Cache Prefetching

For a single-thread application, prefetching helps overlapping I/O of next page to be accessed. Calling prefetch method only for one element is enough to fully load the page that holds the element at index. Just like get/set/bulk operations, prefetch method is thread-safe too. Any number of prefetch calls can be made as long as number of cache lines (active pages) is enough to serve all asnchronous page loads. Example usage:

VirtualMultiArray<Object> arr(...,pageSize,...);
for(size_t i=0;i<testSize;i++)
{
    Object o = arr[i];

    // when first element of a page is reached, asynchronously load next page into cache
    if((i%pageSize)==0)
        arr.prefetch(i+pageSize);
}

Prefetching efficiency depends on:

latency from pcie for loading a page (+storing if previous accesses have edited elements)
time required for computing all elements until next page
if prefetch operation is complete before the processing loop arrives at that page
- then: loop continues iterations without pcie latency
- else: prefetching adds an extra latency of a mutex lock and LRU iteration

Benchmark timings from an arbitrary no-compute example:

simple write: 589735907 nanoseconds     (bandwidth = 671.49 MB/s)      (throughput = 393.16 nanoseconds per iteration) 
simple write: 759817996 nanoseconds     (bandwidth = 521.18 MB/s)      (throughput = 506.55 nanoseconds per iteration) 
simple write: 762645884 nanoseconds     (bandwidth = 519.24 MB/s)      (throughput = 508.43 nanoseconds per iteration) 
simple write: 761231767 nanoseconds     (bandwidth = 520.21 MB/s)      (throughput = 507.49 nanoseconds per iteration) 
simple write: 760679226 nanoseconds     (bandwidth = 520.59 MB/s)      (throughput = 507.12 nanoseconds per iteration) 
prefetch write: 479190183 nanoseconds     (bandwidth = 826.39 MB/s)      (throughput = 319.46 nanoseconds per iteration) 
prefetch write: 477348518 nanoseconds     (bandwidth = 829.58 MB/s)      (throughput = 318.23 nanoseconds per iteration) 
prefetch write: 476916204 nanoseconds     (bandwidth = 830.33 MB/s)      (throughput = 317.94 nanoseconds per iteration) 
prefetch write: 477104499 nanoseconds     (bandwidth = 830.01 MB/s)      (throughput = 318.07 nanoseconds per iteration) 
prefetch write: 476079865 nanoseconds     (bandwidth = 831.79 MB/s)      (throughput = 317.39 nanoseconds per iteration) 
simple read: 450068799 nanoseconds     (bandwidth = 879.87 MB/s)      (throughput = 300.05 nanoseconds per iteration) 
simple read: 360594634 nanoseconds     (bandwidth = 1098.19 MB/s)      (throughput = 240.40 nanoseconds per iteration) 
simple read: 360009298 nanoseconds     (bandwidth = 1099.97 MB/s)      (throughput = 240.01 nanoseconds per iteration) 
simple read: 360932279 nanoseconds     (bandwidth = 1097.16 MB/s)      (throughput = 240.62 nanoseconds per iteration) 
simple read: 360074952 nanoseconds     (bandwidth = 1099.77 MB/s)      (throughput = 240.05 nanoseconds per iteration) 
prefetch read: 230803171 nanoseconds     (bandwidth = 1715.75 MB/s)      (throughput = 153.87 nanoseconds per iteration) 
prefetch read: 229898585 nanoseconds     (bandwidth = 1722.50 MB/s)      (throughput = 153.27 nanoseconds per iteration) 
prefetch read: 231080183 nanoseconds     (bandwidth = 1713.69 MB/s)      (throughput = 154.05 nanoseconds per iteration) 
prefetch read: 229643488 nanoseconds     (bandwidth = 1724.41 MB/s)      (throughput = 153.10 nanoseconds per iteration) 
prefetch read: 228486597 nanoseconds     (bandwidth = 1733.14 MB/s)      (throughput = 152.32 nanoseconds per iteration)

Source code:

#include "GraphicsCardSupplyDepot.h"
#include "VirtualMultiArray.h"
#include "CpuBenchmarker.h"
#include <random>

class Object
{
public:
	Object(){id=-1;}
	Object(size_t i){id=i; data[id%(256)]='A';}
	char data[256];
	size_t id;
};
int main()
{
	GraphicsCardSupplyDepot depot;
	const size_t n = 15000000;
	const size_t pageSize=10000;
	const int maxActivePagesPerGpu = 5;

	VirtualMultiArray<Object> arr(n,depot.requestGpus(),pageSize,maxActivePagesPerGpu);
	const int testSize = n/10;
	for(int j=0;j<5;j++)
	{
		CpuBenchmarker bench(testSize*sizeof(Object),"simple write",testSize);
		std::random_device rd;
		std::mt19937 rng(rd());
		std::uniform_real_distribution<float> rnd(0,testSize);

		for(size_t i=0;i<testSize;i++)
		{
			arr[i]=Object(rnd(rng));
		}
	}

	for(int j=0;j<5;j++)
	{
		CpuBenchmarker bench(testSize*sizeof(Object),"prefetch write",testSize);
		std::random_device rd;
		std::mt19937 rng(rd());
		std::uniform_real_distribution<float> rnd(0,testSize);
		for(size_t i=0;i<testSize;i++)
		{
			arr[i]=Object(rnd(rng));
			if((i%pageSize)==0)
				arr.prefetch(i+pageSize);

		}
	}

	for(int j=0;j<5;j++)
	{
		CpuBenchmarker bench(testSize*sizeof(Object),"simple read",testSize);
		for(size_t i=0;i<testSize;i++)
		{
			Object o = arr[i];
		}
	}

	for(int j=0;j<5;j++)
	{
		CpuBenchmarker bench(testSize*sizeof(Object),"prefetch read",testSize);
		for(size_t i=0;i<testSize;i++)
		{
			Object o = arr[i];
			if((i%pageSize)==0)
				arr.prefetch(i+pageSize);

		}
	}
	std::cout<<arr.get(0).data[0]<<std::endl;
	return 0;
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache Prefetching

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally