Skip to content

Cache Prefetching

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Mar 16, 2021 · 14 revisions

For a single-thread application, prefetching helps overlapping I/O of next page to be accessed. Calling prefetch method only for one element is enough to fully load the page that holds the element at index. Just like get/set/bulk operations, prefetch method is thread-safe too. Any number of prefetch calls can be made as long as number of cache lines (active pages) is enough to serve all asnchronous page loads. Example usage:

VirtualMultiArray<Object> arr(...,pageSize,...);
for(size_t i=0;i<testSize;i++)
{
    Object o = arr[i];

    // when first element of a page is reached, asynchronously load next page into cache
    if((i%pageSize)==0)
        arr.prefetch(i+pageSize);
}

Prefetching efficiency depends on:

  • latency from pcie for loading a page (+storing if previous accesses have edited elements)
  • time required for computing all elements until next page
  • if prefetch operation is complete before the processing loop arrives at that page
    • then: loop continues iterations without pcie latency
    • else: prefetching adds an extra latency of a mutex lock and LRU iteration

Benchmark timings from an arbitrary no-compute example:

simple write: 589735907 nanoseconds     (bandwidth = 671.49 MB/s)      (throughput = 393.16 nanoseconds per iteration) 
simple write: 759817996 nanoseconds     (bandwidth = 521.18 MB/s)      (throughput = 506.55 nanoseconds per iteration) 
simple write: 762645884 nanoseconds     (bandwidth = 519.24 MB/s)      (throughput = 508.43 nanoseconds per iteration) 
simple write: 761231767 nanoseconds     (bandwidth = 520.21 MB/s)      (throughput = 507.49 nanoseconds per iteration) 
simple write: 760679226 nanoseconds     (bandwidth = 520.59 MB/s)      (throughput = 507.12 nanoseconds per iteration) 
prefetch write: 479190183 nanoseconds     (bandwidth = 826.39 MB/s)      (throughput = 319.46 nanoseconds per iteration) 
prefetch write: 477348518 nanoseconds     (bandwidth = 829.58 MB/s)      (throughput = 318.23 nanoseconds per iteration) 
prefetch write: 476916204 nanoseconds     (bandwidth = 830.33 MB/s)      (throughput = 317.94 nanoseconds per iteration) 
prefetch write: 477104499 nanoseconds     (bandwidth = 830.01 MB/s)      (throughput = 318.07 nanoseconds per iteration) 
prefetch write: 476079865 nanoseconds     (bandwidth = 831.79 MB/s)      (throughput = 317.39 nanoseconds per iteration) 
simple read: 450068799 nanoseconds     (bandwidth = 879.87 MB/s)      (throughput = 300.05 nanoseconds per iteration) 
simple read: 360594634 nanoseconds     (bandwidth = 1098.19 MB/s)      (throughput = 240.40 nanoseconds per iteration) 
simple read: 360009298 nanoseconds     (bandwidth = 1099.97 MB/s)      (throughput = 240.01 nanoseconds per iteration) 
simple read: 360932279 nanoseconds     (bandwidth = 1097.16 MB/s)      (throughput = 240.62 nanoseconds per iteration) 
simple read: 360074952 nanoseconds     (bandwidth = 1099.77 MB/s)      (throughput = 240.05 nanoseconds per iteration) 
prefetch read: 230803171 nanoseconds     (bandwidth = 1715.75 MB/s)      (throughput = 153.87 nanoseconds per iteration) 
prefetch read: 229898585 nanoseconds     (bandwidth = 1722.50 MB/s)      (throughput = 153.27 nanoseconds per iteration) 
prefetch read: 231080183 nanoseconds     (bandwidth = 1713.69 MB/s)      (throughput = 154.05 nanoseconds per iteration) 
prefetch read: 229643488 nanoseconds     (bandwidth = 1724.41 MB/s)      (throughput = 153.10 nanoseconds per iteration) 
prefetch read: 228486597 nanoseconds     (bandwidth = 1733.14 MB/s)      (throughput = 152.32 nanoseconds per iteration) 

Source code:

#include "GraphicsCardSupplyDepot.h"
#include "VirtualMultiArray.h"
#include "CpuBenchmarker.h"
#include <random>

class Object
{
public:
	Object(){id=-1;}
	Object(size_t i){id=i; data[id%(256)]='A';}
	char data[256];
	size_t id;
};
int main()
{
	GraphicsCardSupplyDepot depot;
	const size_t n = 15000000;
	const size_t pageSize=10000;
	const int maxActivePagesPerGpu = 5;

	VirtualMultiArray<Object> arr(n,depot.requestGpus(),pageSize,maxActivePagesPerGpu);
	const int testSize = n/10;
	for(int j=0;j<5;j++)
	{
		CpuBenchmarker bench(testSize*sizeof(Object),"simple write",testSize);
		std::random_device rd;
		std::mt19937 rng(rd());
		std::uniform_real_distribution<float> rnd(0,testSize);

		for(size_t i=0;i<testSize;i++)
		{
			arr[i]=Object(rnd(rng));
		}
	}

	for(int j=0;j<5;j++)
	{
		CpuBenchmarker bench(testSize*sizeof(Object),"prefetch write",testSize);
		std::random_device rd;
		std::mt19937 rng(rd());
		std::uniform_real_distribution<float> rnd(0,testSize);
		for(size_t i=0;i<testSize;i++)
		{
			arr[i]=Object(rnd(rng));
			if((i%pageSize)==0)
				arr.prefetch(i+pageSize);

		}
	}

	for(int j=0;j<5;j++)
	{
		CpuBenchmarker bench(testSize*sizeof(Object),"simple read",testSize);
		for(size_t i=0;i<testSize;i++)
		{
			Object o = arr[i];
		}
	}

	for(int j=0;j<5;j++)
	{
		CpuBenchmarker bench(testSize*sizeof(Object),"prefetch read",testSize);
		for(size_t i=0;i<testSize;i++)
		{
			Object o = arr[i];
			if((i%pageSize)==0)
				arr.prefetch(i+pageSize);

		}
	}
	std::cout<<arr.get(0).data[0]<<std::endl;
	return 0;
}

Clone this wiki locally