Skip to content

Conversation

@RaulMoldes
Copy link

PR on Virtual Memory Management for CubeCl.

Ensured cargo run checks and cargo xtask validate have been executed.

I know this PR is a pretty big one, and I have been told to better explain myself so I will try to do.

The past few weeks I have been trying to implement a system that, using the virtual memory management API of GPU platforms, improves the memory management of CubeCL. The main advantage of this low level virtual memory apis is two:

  1. You can tell the operating system to map certain discarded physical memory allocations to aligned virtual memory ranges. The OS will make sure that memory is always aligned as viewed by the user even if it is physically fragmented. The OS will maintain an internal pool so that you as a user always see the virtual address space as contiguous.

I have also discovered that CUDA allows you to provide a hint once to reserve an address space, so that the reservation starts when you like.

  1. For machine learning models, this is useful specially during inference, specially if the batch size changes accross iterations. Imagine the following situation:
  • Allocate Batch A of size [64 x 256 x 256 x 3]

  • Allocate Batch B of size [32 x 256 x 256 x 3]

  • Deallocate A (Memory pool now has a chunk of s = 64 * img_size, where img_size is 256 x 256 x 3).

  • Allocate Batch C of size [128 x 256 x 256 x 3] -> Does not fit on A_free, needs to reallocate.

  • ...

This a very simplified example. But you can get the intuition.

Using virtual memory we can work around it. This would be an example with physical block size of [16 x img size]

  • Allocate batch A: 4 physical blocks get allocated and mapped.

  • Allocate batch B: 2 phsyical blocks get allocated and mapped.

  • Deallocate (unmap A): 4 physical blocks are ready to be reused.

  • Allocate (re-map) C: The scattered blocks of A are remapped into a new address space, the missing spaces are obtained from the driver and no physical memory is lost.

What is explained above is the main idea. However, my implementation goes a bit further by taking advantage of:

  1. The amount of virtual memory that the OS can provide to you is generally unlimited. This means that even if your DRAM is of size, let's say X Mb, you could theoretically make virtual reservations of sizes much larger than X, as long as those addresses are not mapped to physical memory. This can help to save time, by reducing the number of API calls to reserve virtual memory. Instead of that, you reserve once and keep reusing that address spaces. We just need to make sure to periodically unmap address spaces that are not in-use, and then this address spaces can be reused.

  2. At least the CUDA driver API (do not know about other vendors) allows to provide a hint when reserving a virtual address space so that this address space starts where you (the runtime) wants it to. This is something that is not guaranteed, as if you tell the driver to give you a reservation starting at an address that is already reserved, the driver will give you a random address. However, if we know of the ending address of our last reservation, we can use it to make the next reservation start at that address, allowing to later compact two contiguous address spaces into a bigger one, which will facilitate memory reuse.

This is how I implemented it on CubeCL Cuda, in practice.

   /// Reserves a virtual address space of the target size.
    /// The parameter `start_addr` is a hint to tell CUDA where do we want the allocation to start.
    /// However, in practice the CUDA documentations says it is not guaranteed for the allocation to start where we want it to.
    /// Returns a storage handle pointing to the reserved virtual address space.
    fn reserve(
        &mut self,
        size: u64,
        start_addr: Option<StorageId>,
    ) -> Result<StorageHandle, IoError> {
        if !self.is_virtual_mem_enabled() {
            return Err(IoError::Unknown("Virtual memory is disabled!".to_string()));
        }

        let aligned_size = size.next_multiple_of(self.mem_alignment as u64);

        let addr = if let Some(prev) = start_addr
            && let Some(space) = self.virtual_memory.get(&prev)
        {
            space.ptr() + space.size()
        } else {
            0 // Zero will tell CUDA to autotune
        };

        unsafe {
            let mut virtual_addr: CUdeviceptr = 0;

            // Note: It is not guaranteed that CUDA will reserve the address range at the address we request.
            // The fourth argument to [cuMemAddressReserve] acts like a 'hint' to tell the driver that we would like to
            // pre reserve memory starting at that point. It should be useful to expand virtual memory ranges when new memory is required.
            let result = cuMemAddressReserve(
                &mut virtual_addr,
                aligned_size as usize,
                self.mem_alignment,
                addr,
                0,
            );

            match result {
                CUresult::CUDA_SUCCESS => {
                    let id = StorageId::new();
                    let addr = GpuVirtualAddressSpace::new(virtual_addr, aligned_size);

                    self.virtual_memory.insert(id, addr);

                    let handle = StorageHandle::new(id, StorageUtilization { size, offset: 0 });
                    Ok(handle)
                }

                CUresult::CUDA_ERROR_OUT_OF_MEMORY => {
                    Err(IoError::BufferTooBig(aligned_size as usize))
                }
                other => Err(IoError::Unknown(format!(
                    "CUDA reserve failed: {:?}",
                    other
                ))),
            }
        }
    }


    /// Checks if two address spaces are contiguous in memory (the first one ends where the second one starts).
    /// This is useful to perform defragmentation.
    fn are_aligned(&self, lhs: &StorageId, rhs: &StorageId) -> bool {
        if let Some(a) = self.virtual_memory.get(lhs)
            && let Some(b) = self.virtual_memory.get(rhs)
        {
            return a.ptr() + a.size() == b.ptr();
        };
        false
    }

After this theoretical explanation, this is what I have done.

VirtualStorage integration

CubeCL works with traits. Currently, the main abstraction for storages is the [ComputeStorage] trait. However this trait does not provide enough flexibility for the granularity you need with this vmm apis.

Therefore the solution has been to create a new trait [VirtualStorage] (on cubecl_runtime/src/storage/virtual_memory/base.rs).
This trait provides the extra functionality that the Compute Storages need to work with virtual memory.
To integrate it on CubecL architecture, i would like to make it OPTIONAL, so all methods have a default implementation, and backends that do not support virtual memory can derive it this way:

/// Override VirtualStorage
impl VirtualStorage for WgpuStorage {}

This is an explanation of the methods in the trait:

/// Virtual Storage trait.
/// I want to make this trait optional. However, to be able to use it on the memory manager I have to restrict the type of Storage with VirtualStorage trait bounds.
/// Therefore all methods will have a default implementation.
/// By enforcing the ComputeStorage to inherit from it, all storages that implement ComputeStorage will automatically implement the default implementation of VirtualStorage.
/// Then, at runtime, the memory pools can check the method [`is_virtual_mem_enabled`] to verify is virtual memory is supported on the target backend.
pub trait VirtualStorage {
    /// Retrieves the minimum allocation granularity of this storage. All physical and virtual allocations should be aligned.
    fn granularity(&self) -> usize {
        0 // Default granularity is zero when virtual memory is not supported.
    }

    /// Check whether virtual mem is supported
    fn is_virtual_mem_enabled(&self) -> bool {
        false
    }

    /// Allocate physical memory of the requested size
    fn allocate(&mut self, _size: u64) -> Result<PhysicalStorageHandle, IoError> {
        Err(IoError::Unknown(
            "Virtual memory is not supported!".to_string(),
        ))
    }

    /// Releases a physical memory handle to the driver (explicit).
    fn release(&mut self, _id: PhysicalStorageId) {}

    /// Reserves an address space of a given size. Padding should be automatically added to meet the granularity requirements. The parameter `start_addr` is the id of the address space which should end at the beginning of the next reservation (if applicable).
    fn reserve(
        &mut self,
        _size: u64,
        _start_addr: Option<StorageId>,
    ) -> Result<StorageHandle, IoError> {
        Err(IoError::Unknown(
            "Virtual memory is not supported!".to_string(),
        ))
    }

    /// Releases the virtual address range associated with this handle.
    fn free(&mut self, _id: StorageId) {}

    /// Map physical memory to a range of virtual addresses
    fn map(
        &mut self,
        _id: StorageId,
        _offset: u64,
        _physical_storage: &mut PhysicalStorageHandle,
    ) -> Result<StorageHandle, IoError> {
        Err(IoError::Unknown(
            "Virtual memory is not supported!".to_string(),
        ))
    }

    /// Unmap the handles
    fn unmap(&mut self, _id: StorageId, _offset: u64, _physical: &mut PhysicalStorageHandle) {}

    /// Checks if two address spaces are contiguous in memory (the first one ends where the second one starts).
    /// This is useful to perform defragmentation.
    fn are_aligned(&self, _lhs: &StorageId, _rhs: &StorageId) -> bool {
        false
    }
}

I have also provided and tested an example of how i believe this trait should be implemented on CubeCL-CUDA (crates/cubecl-cuda/compute/storage/gpu.rs, look for GpuVirtualStorage). The main idea is to keep the Storage as close as an interface to the hardware as possible, then at the memory pool level is where the magic happens. The code is well structured and documented I believe, so I refer you to that implementation if you want to review more.

VirtualMemoryPool

I know I am very bad at naming things. I have provided a large comment at the top of the module in the virtual memory pool so that you can understand what it does (cubecl_runtime/src/memory_management/memory_pool/virtual_pool.rs).

I was reviewing the implementation of the SlicedPool, which i think is incomplete, in fact. Anyway, I noticed that it had some data structures that could be reused for the purpose of my implementation, but i needed some extra functionality. I decided to make them generic. These are:

  • RingBuffer: I did not create a trait for it, i just made it generic so that it can be used to search efficiently for free [MemoryFragments] on sets of [MemoryChunks].

  • MemoryPage: There was a comment at the top of this struct suggesting to make it generic, so I did not hesitate in doing it. The generic trait is called MemoryChunk. It represents a region in memory guaranteed to be contiguous.

  • Slice: In this case I just needed a trait to decouple [MemoryChunk] and Slice, so I created MemoryFragment which is a fraction of a MemoryChunk.

I have reused this structures in my own implementation. The [VirtualMemoryPool] is basically a memory pool that can safely by any [MemoryManager] that contains a [VirtualStorage].

At runtime, there is a flag in the [MemoryDeviceProperties] that determines whether virtual memory is supported on the target device.
If it is, it will create two new [DynamicPools::Virtual]. The first one making physical allocations of the page size in the target device and the second one making allocations 10 times this. This idea of having two buckets (or two pools) is copied from PyTorch (also the sizes), but it wont be much problem to change it if you decide to, for example, generate buckets as you do on the [ExclusivePool].

The memory pool works similar to the sliced pool in a sense:

1. It tries to allocate a slice of enough size. If it finds a free one of a bigger size than the requested one, slice is splitted using the ringbuffer. and the requested size is returned to the user.

2. If no slice is found, goes directly to the driver to get a new one and mapping it, returning it to the user.

3. The main difference is that this pool can defragment inself at cleanup. Every [self.dealloc_period] deallocations:

    3.1. First, free slices that have not been recently allocated are collected and unmapped, returning their physical memory to a free list for reuse.

    3.2. Second, the pages are merged and compacted. The idea is to take advantage of the cuda feature taht allows you to allocate memory address spaces in a contiguous way (aka one after the other). Then, the memory pool maintains a linked list that resets at cleanup, and which tells you the order in which pages were allocated. If the allocations happen to have been made contiguous, you could merge them in a single big page (on the best scenario), or at least on some bigger fragments (on the worst).

    3.3. After that, you compact the remaining pages, merging contiguous free slices on a single one.

    3.4 If explictly called, it will return pages that have become full free to the driver, and also will cleanup phsyical memory.

The main idea is that you can further optimize the memory management strategy by ensuring that the address space is as compact as possible, and potentially reducing the number of api calls required.

If you think about it, at the next allocation, the free slices will be avaible for reuse, and the pool wont have to do more than just re-map them, to their assigned fraction in the memory space.

Testing:

I have added two main test suites for my implementations.

Test suite A: Virtual Storage:

I have added the following tests. The three first ones validate the normal functionality of the virtualstorage while the two last ones are focused on specific situations and attempt to demonstrate that the assumptions I stated at the beginning of this document are correct.

/// Validates a simple physical memory allocation pattern.
fn test_physical_memory_allocation()
/// Validates a simple virtual address space reservation
fn test_virtual_address_space_reservation()
/// Validates a simple virtual address space mapping
fn test_memory_mapping()
// Demonstrates that we can reserve contiguous address spaces thanks to VMM capabilities.
fn test_contiguous_address_space_reservations()
// Demonstrates that [`reserve`] does not fail even though you allocate more than available memory
fn test_virtual_memory_overcommitting()

Test suite B: Virtual Memory Pool:

I saw that to test the memory manager you have a separate data structure called BytesStorage that simulates Compute Storage using heap allocations. To test my virtual memory pool I need a similar structure but it needs to use virtual memory, which cannot be done using only core stuff as you'll need some OS to back you. I have implemented a simulated virtual memory allocator called [BytesVirtualStorage], which is intended only for testing purposes. The only thing I was not able to simulate was the contiguous reservation of memory pages.

/// Validates a simple allocation pattern.
/// Checks for the data integrity of allocations.
fn test_virtual_pool_alloc();

/// Validates that defragmentation happens after explictly calling cleanup
fn test_virtual_pool_cleanup();

/// Validates the defragmentation behaviour after calling cleanup in a not-explicit way
/// The main difference is that this one does not free physical memory, and recently allocated pages are also preserved.
fn test_virtual_pool_cleanup_noexplicit()

Notes:

I have the following questions.

Implementing this has taken me some time. I respect my time and yours, and I really appreciate that you reviewer reviewd my PR because you help me learn from my mistakes as an engineer. However I am not sure how to integrate this with the new [MultiStream] that you have built, and which for sure will be really helpful to improve CubeCL performance. My concern is that you cannot use the vmm api to allocate memory on a specific stream.

If you can tell me your opinion on my implementation I will appreciate it.

Thanks!.

@nathanielsimard
Copy link
Member

@RaulMoldes
I'll need to review this more carefully soon. I'm currently focused on fixing issues for the next release of Burn/CubeCL, but right after, I'll examine this PR more closely. Thanks!

@RaulMoldes
Copy link
Author

There is no problem. Good luck with all you have in hands :)

Copy link
Member

@nathanielsimard nathanielsimard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to have to do a proper full review of the code, but I skimmed a bit over the implementation and appreciated the comments.

To answer your question: the change to MultiStream shouldn't impact the VirtualStorage trait, since each stream has its own memory pool. This is to ensure correct synchronization between different streams where the order of operations coupled with lazy execution can cause problems. We automatically add sync events between CUDA streams when a tensor allocation from one stream is used by another stream. To summarize, having multiple memory pools per stream is important for sync events, not so much for allocations, since they need to be aligned with Rust atomics anyway.

Also, we just merged a new memory pool (persistent memory pool yeah, naming is hard). The key idea is that parameters in models normally don't change much in size during execution, so we can create a pool of their size that minimizes padding. We still need dynamic memory, though, and I believe that the virtual memory pool could be an improvement over what we have currently in some cases. For training, most of the memory isn't used by model parameters but by backward states, since those tensors are normally batch_size larger than the weights, and for LLMs/transformers, batch_size * seq_length larger than the weights! Virtual memory could reduce the amount of memory used for training.

So we will ship Burn 0.19, then I'll do a proper review/testing of virtual memory that could land in Burn 0.20. From the look of it, I think the design decisions are solid, so it's encouraging.

enumset = { workspace = true }
foldhash = { workspace = true }
hashbrown = { workspace = true }
libc = "0.2.176"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure this would be OK for wasm deployment with wgpu

pub(crate) struct VirtualMemoryPage {
/// Map from offset to slice ID
/// Uses a btree map instead of hashmap to ensure offsets are ordered.
pub slices: HashMap<u64, SliceId>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is not valid here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants