-
Notifications
You must be signed in to change notification settings - Fork 115
PR on Virtual Memory Management #929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@RaulMoldes |
|
There is no problem. Good luck with all you have in hands :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to have to do a proper full review of the code, but I skimmed a bit over the implementation and appreciated the comments.
To answer your question: the change to MultiStream shouldn't impact the VirtualStorage trait, since each stream has its own memory pool. This is to ensure correct synchronization between different streams where the order of operations coupled with lazy execution can cause problems. We automatically add sync events between CUDA streams when a tensor allocation from one stream is used by another stream. To summarize, having multiple memory pools per stream is important for sync events, not so much for allocations, since they need to be aligned with Rust atomics anyway.
Also, we just merged a new memory pool (persistent memory pool yeah, naming is hard). The key idea is that parameters in models normally don't change much in size during execution, so we can create a pool of their size that minimizes padding. We still need dynamic memory, though, and I believe that the virtual memory pool could be an improvement over what we have currently in some cases. For training, most of the memory isn't used by model parameters but by backward states, since those tensors are normally batch_size larger than the weights, and for LLMs/transformers, batch_size * seq_length larger than the weights! Virtual memory could reduce the amount of memory used for training.
So we will ship Burn 0.19, then I'll do a proper review/testing of virtual memory that could land in Burn 0.20. From the look of it, I think the design decisions are solid, so it's encouraging.
| enumset = { workspace = true } | ||
| foldhash = { workspace = true } | ||
| hashbrown = { workspace = true } | ||
| libc = "0.2.176" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure this would be OK for wasm deployment with wgpu
| pub(crate) struct VirtualMemoryPage { | ||
| /// Map from offset to slice ID | ||
| /// Uses a btree map instead of hashmap to ensure offsets are ordered. | ||
| pub slices: HashMap<u64, SliceId>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is not valid here.
PR on Virtual Memory Management for CubeCl.
Ensured
cargo run checksandcargo xtask validatehave been executed.I know this PR is a pretty big one, and I have been told to better explain myself so I will try to do.
The past few weeks I have been trying to implement a system that, using the virtual memory management API of GPU platforms, improves the memory management of CubeCL. The main advantage of this low level virtual memory apis is two:
I have also discovered that CUDA allows you to provide a hint once to reserve an address space, so that the reservation starts when you like.
Allocate Batch A of size [64 x 256 x 256 x 3]
Allocate Batch B of size [32 x 256 x 256 x 3]
Deallocate A (Memory pool now has a chunk of s = 64 * img_size, where img_size is 256 x 256 x 3).
Allocate Batch C of size [128 x 256 x 256 x 3] -> Does not fit on A_free, needs to reallocate.
...
This a very simplified example. But you can get the intuition.
Using virtual memory we can work around it. This would be an example with physical block size of [16 x img size]
Allocate batch A: 4 physical blocks get allocated and mapped.
Allocate batch B: 2 phsyical blocks get allocated and mapped.
Deallocate (unmap A): 4 physical blocks are ready to be reused.
Allocate (re-map) C: The scattered blocks of A are remapped into a new address space, the missing spaces are obtained from the driver and no physical memory is lost.
What is explained above is the main idea. However, my implementation goes a bit further by taking advantage of:
The amount of virtual memory that the OS can provide to you is generally unlimited. This means that even if your DRAM is of size, let's say X Mb, you could theoretically make virtual reservations of sizes much larger than X, as long as those addresses are not mapped to physical memory. This can help to save time, by reducing the number of API calls to reserve virtual memory. Instead of that, you reserve once and keep reusing that address spaces. We just need to make sure to periodically unmap address spaces that are not in-use, and then this address spaces can be reused.
At least the CUDA driver API (do not know about other vendors) allows to provide a hint when reserving a virtual address space so that this address space starts where you (the runtime) wants it to. This is something that is not guaranteed, as if you tell the driver to give you a reservation starting at an address that is already reserved, the driver will give you a random address. However, if we know of the ending address of our last reservation, we can use it to make the next reservation start at that address, allowing to later compact two contiguous address spaces into a bigger one, which will facilitate memory reuse.
This is how I implemented it on CubeCL Cuda, in practice.
After this theoretical explanation, this is what I have done.
VirtualStorage integration
CubeCL works with traits. Currently, the main abstraction for storages is the [ComputeStorage] trait. However this trait does not provide enough flexibility for the granularity you need with this vmm apis.
Therefore the solution has been to create a new trait [VirtualStorage] (on
cubecl_runtime/src/storage/virtual_memory/base.rs).This trait provides the extra functionality that the Compute Storages need to work with virtual memory.
To integrate it on CubecL architecture, i would like to make it OPTIONAL, so all methods have a default implementation, and backends that do not support virtual memory can derive it this way:
This is an explanation of the methods in the trait:
I have also provided and tested an example of how i believe this trait should be implemented on CubeCL-CUDA (
crates/cubecl-cuda/compute/storage/gpu.rs, look for GpuVirtualStorage). The main idea is to keep the Storage as close as an interface to the hardware as possible, then at the memory pool level is where the magic happens. The code is well structured and documented I believe, so I refer you to that implementation if you want to review more.VirtualMemoryPool
I know I am very bad at naming things. I have provided a large comment at the top of the module in the virtual memory pool so that you can understand what it does (
cubecl_runtime/src/memory_management/memory_pool/virtual_pool.rs).I was reviewing the implementation of the SlicedPool, which i think is incomplete, in fact. Anyway, I noticed that it had some data structures that could be reused for the purpose of my implementation, but i needed some extra functionality. I decided to make them generic. These are:
RingBuffer: I did not create a trait for it, i just made it generic so that it can be used to search efficiently for free [
MemoryFragments] on sets of [MemoryChunks].MemoryPage: There was a comment at the top of this struct suggesting to make it generic, so I did not hesitate in doing it. The generic trait is called MemoryChunk. It represents a region in memory guaranteed to be contiguous.
Slice: In this case I just needed a trait to decouple [
MemoryChunk] and Slice, so I created MemoryFragment which is a fraction of a MemoryChunk.I have reused this structures in my own implementation. The [VirtualMemoryPool] is basically a memory pool that can safely by any [MemoryManager] that contains a [VirtualStorage].
At runtime, there is a flag in the [MemoryDeviceProperties] that determines whether virtual memory is supported on the target device.
If it is, it will create two new [DynamicPools::Virtual]. The first one making physical allocations of the page size in the target device and the second one making allocations 10 times this. This idea of having two buckets (or two pools) is copied from PyTorch (also the sizes), but it wont be much problem to change it if you decide to, for example, generate buckets as you do on the [ExclusivePool].
The memory pool works similar to the sliced pool in a sense:
The main idea is that you can further optimize the memory management strategy by ensuring that the address space is as compact as possible, and potentially reducing the number of api calls required.
If you think about it, at the next allocation, the free slices will be avaible for reuse, and the pool wont have to do more than just re-map them, to their assigned fraction in the memory space.
Testing:
I have added two main test suites for my implementations.
Test suite A: Virtual Storage:
I have added the following tests. The three first ones validate the normal functionality of the virtualstorage while the two last ones are focused on specific situations and attempt to demonstrate that the assumptions I stated at the beginning of this document are correct.
Test suite B: Virtual Memory Pool:
I saw that to test the memory manager you have a separate data structure called BytesStorage that simulates Compute Storage using heap allocations. To test my virtual memory pool I need a similar structure but it needs to use virtual memory, which cannot be done using only
corestuff as you'll need some OS to back you. I have implemented a simulated virtual memory allocator called [BytesVirtualStorage], which is intended only for testing purposes. The only thing I was not able to simulate was the contiguous reservation of memory pages.Notes:
I have the following questions.
Implementing this has taken me some time. I respect my time and yours, and I really appreciate that you reviewer reviewd my PR because you help me learn from my mistakes as an engineer. However I am not sure how to integrate this with the new [MultiStream] that you have built, and which for sure will be really helpful to improve CubeCL performance. My concern is that you cannot use the vmm api to allocate memory on a specific stream.
If you can tell me your opinion on my implementation I will appreciate it.
Thanks!.