|
1 | 1 | # Region-based Heterogeneous Memory Management
|
| 2 | +## Design |
2 | 3 |
|
3 |
| -Please check out the [design documentation](http://gangliao.me) to find out more details about |
4 |
| -buddy memory allocator for both CPU and GPU. |
| 4 | +### Usage |
| 5 | + |
| 6 | +To allocate 4KB CPU memory: |
| 7 | + |
| 8 | +```cpp |
| 9 | +p = memory::Alloc(platform::CPUPlace(), 4*1024); |
| 10 | +``` |
| 11 | + |
| 12 | +To allocate 4KB memory on the 3rd GPU: |
| 13 | + |
| 14 | +```cpp |
| 15 | +p = memory::Alloc(platform::GPUPlace(2), 4*1024); |
| 16 | +``` |
| 17 | + |
| 18 | +To free memory and check the so-far used amount of memory on a place: |
| 19 | + |
| 20 | +```cpp |
| 21 | +auto pl = platform::GPUPlace(0); |
| 22 | +p = memory::Alloc(pl, 4*1024); |
| 23 | +cout << memory::Used(pl); |
| 24 | +memory::Free(pl, p); |
| 25 | +``` |
| 26 | +
|
| 27 | +### API |
| 28 | +
|
| 29 | +In `paddle/memory/memory.h` we have: |
| 30 | +
|
| 31 | +```cpp |
| 32 | +namespace memory { |
| 33 | +template <typename Place> void* Alloc(Place, size_t); |
| 34 | +template <typename Place> void Free(Place, void*); |
| 35 | +template <typename Place> size_t Used(Place); |
| 36 | +} // namespace memory |
| 37 | +``` |
| 38 | + |
| 39 | +These function templates have specializations on either `platform::CPUPlace` or `platform::GPUPlace`: |
| 40 | + |
| 41 | +```cpp |
| 42 | +template<> |
| 43 | +void* Alloc<CPUPlace>(CPUPlace p, size_t size) { |
| 44 | + return GetCPUBuddyAllocator()->Alloc(size); |
| 45 | +} |
| 46 | +``` |
| 47 | + |
| 48 | +and |
| 49 | + |
| 50 | +```cpp |
| 51 | +template<> |
| 52 | +void Alloc<GPUPlace>(GPUPlace p, size_t size) { |
| 53 | + return GetGPUBuddyAllocator(p.id)->Alloc(size); |
| 54 | +} |
| 55 | +``` |
| 56 | + |
| 57 | +Similar specializations exist for `Free` and `Used`. |
| 58 | + |
| 59 | +### Implementation |
| 60 | + |
| 61 | +`GetCPUBuddyAllocator` and `GetGPUBuddyAllocator` are singletions. |
| 62 | + |
| 63 | +```cpp |
| 64 | +BuddyAllocator* GetCPUBuddyAllocator() { |
| 65 | + static BuddyAllocator* a = NULL; |
| 66 | + if (a == NULL) { |
| 67 | + a = new BuddyAllocator(new CPUAllocator /*backup allocator*/, ...); |
| 68 | + } |
| 69 | + return a; |
| 70 | +} |
| 71 | + |
| 72 | +BuddyAllocator* GetGPUBuddyAllocator(int gpu_id) { |
| 73 | + static BuddyAllocator* as = NULL; |
| 74 | + if (as == NULL) { |
| 75 | + as = new BuddyAllocator*[platform::NumGPUs()]; |
| 76 | + for (int gpu = 0; gpu < platform::NumGPUs(); gpu++) { |
| 77 | + as[gpu] = new BuddyAllocator(new GPUAllocator(gpu) /* backup allocator */, ...); |
| 78 | + } |
| 79 | + } |
| 80 | + return as[gpu_id); |
| 81 | +``` |
| 82 | +
|
| 83 | +#### `BuddyAllocator` |
| 84 | +
|
| 85 | +`BuddyAllocator` implements the buddy allocation algorithm. Its constructor takes parameters only related with the algorithm: |
| 86 | +
|
| 87 | +```cpp |
| 88 | +BuddyAllocator::BuddyAllocator(initial_pool_size, max_pool_size) { |
| 89 | + ... |
| 90 | +} |
| 91 | +``` |
| 92 | + |
| 93 | +Please be aware that **`BuddyAllocator` always allocate aligned memory**, aligned on 32-bytes, which can hold a `BuddyAllocator::Block` object: |
| 94 | + |
| 95 | +```cpp |
| 96 | +class BuddyAllocator { |
| 97 | + private: |
| 98 | + struct Block { |
| 99 | + size_t size; |
| 100 | + Block* left, right; |
| 101 | + size_t index; // allocator id |
| 102 | + }; |
| 103 | + ... |
| 104 | +}; |
| 105 | +``` |
| 106 | +
|
| 107 | +Because BuddyAllocator has the meta-data of each block, it can trace the used memory -- record the amount returned by `Alloc` freed in `Free`. Instead, `CPUAllocator` and `GPUAllocator` doesn't know the size of freed memory block and cannot do the trace. |
| 108 | +
|
| 109 | +#### System Allocators |
| 110 | +
|
| 111 | +The `GPUAllocator` and `CPUAllocator` are calls *system allocators*. They work as the fallback allocators of `BuddyAllocator`. |
| 112 | +
|
| 113 | +## Justification |
| 114 | +
|
| 115 | +I got inspiration from Majel and Caffe2, though above design look different from both. |
| 116 | +
|
| 117 | +### Caffe2 |
| 118 | +
|
| 119 | +In Caffe2, `Tensor<Context>::mutable_data()` allocates the memroy. In particular, [`Tensor<Context>::mutable_data`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/tensor.h#L523) calls [`Tensor<Context>::raw_mutable_data`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/tensor.h#L459), which in turn calls [`Context::New`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/tensor.h#L479). |
| 120 | +
|
| 121 | +There are two implementations of `Context`: |
| 122 | +
|
| 123 | +1. [`CPUContext`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context.h#L105), whose [`New` method](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context.h#L131) calls [`g_cpu_allocator.get()->New(size_t)`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context.cc#L15) to allocate the memory. |
| 124 | +
|
| 125 | +1. [`CUDAContext`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context_gpu.h#L99), which has a data member [`int gpu_id_`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context_gpu.h#L202). This looks very similar to class `majel::GPUPlace`, who also has an `int id_` data member. `CUDAContext::New(size_t)` calls [`g_cub_allocator->DeviceAllocate(&ptr, nbytes)`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context_gpu.cu#L355) to allocate the memory. |
| 126 | +
|
| 127 | +### Majel |
| 128 | +
|
| 129 | +In Majel, there are basically two allocator types: |
| 130 | +
|
| 131 | +1. `cpu::SystemAllocator`, which has similar functionality to `caffe2::CPUContext::New/Delete`. |
| 132 | +1. `gpu::SystemAllocator`, which has similar functionality to `caffe2::CUDAContext::New/Delete`. |
| 133 | +
|
| 134 | +However, memory allocation is not via these two allocators. Instead, these two allocators are defined in hidden namespaces. |
| 135 | +
|
| 136 | +In Majel there are hidden global variables like: |
| 137 | +
|
| 138 | +1. `cpu::SystemAllocator g_cpu_allocator`, and |
| 139 | +1. `vector<gpu::SystemAllocator*> g_gpu_allocators(NUM_GPUS)`. |
| 140 | +
|
| 141 | +Programs allocate memory via a BuddyAllocator, which can take the `g_cpu_allocator` or a `g_gpu_allocators[gpu_id]` as its *fallback allocator*, so that if BuddyAllocator cannot find a block in its memory pool, it extends its memory pool by calling the fallback allocator's `New(size_t)`. |
0 commit comments