Skip to content

Conversation

JohannesGaessler
Copy link
Collaborator

This PR makes it so that on exit a breakdown of the memory use is printed. For example:

llama_print_memory_breakdown: memory breakdown:      total   free     self   model   context   compute    unaccounted
llama_print_memory_breakdown:   - CUDA0 (RTX 4090):  24080 = 9436 + (14193 = 13169 +      38 +     985) +         451
llama_print_memory_breakdown:   - CUDA1 (RTX 4090):  24080 = 9868 + (13756 = 13169 +      38 +     548) +         456
llama_print_memory_breakdown:   - CUDA2 (RTX 4090):  24080 = 9868 + (13756 = 13169 +      38 +     548) +         456
llama_print_memory_breakdown:   - CPU (EPYC 7742):  515628              72 =     0 +      57 +      15

Explanation:

size_t memory_total;             // total memory as reported by the device
size_t memory_free;              // free memory as reported by the device
size_t memory_used_self;         // sum of model, context, and compute
size_t memory_used_self_model;   // memory allocated for the model
size_t memory_used_self_context; // memory allocated for the context
size_t memory_used_self_compute; // memory allocated for temporary compute buffers
size_t memory_used_unaccounted;  // memory with unknown use, e.g. drivers or other programs, total - (free + self)

The intended immediate use is to make it easier to efficiently distribute models across devices. I also intend to re-use this code to determine automatically which parts of the model to put on which device for optimal performance. Long-term I would also want to expose this information via the HTTP server to establish a Pareto frontier of quality vs. memory use for different quantizations of different models.

Open problems:

  • I added a function llama_print_memory_breakdown to the llama API which produces the above table on the console. Internally this function uses another new function llama_backend_info which returns a struct with information about the backends used by a llama_context. I'm not sure whether the latter should be part of the public API, and if yes, in what form.
  • I added methods like llama_model::memory_use(ggml_backend_dev_t dev) which return the memory used on a specified device. But I'm not sure whether the device is the correct argument type here. Would it make more sense to pass a ggml_backend_buffer_type_t? In particular, I think this is the only correct way to handle e.g. CUDA_Host buffers.
  • The memory for e.g. the CUDA pools is currently under "unaccounted", but it should be under "compute". Currently it is not possible for llama.cpp to retrieve this information. I think it would make sense to extend the ggml backend interface with a function that returns the total amount of device memory allocated by the backend.
  • I'm not sure what to show, if anything, for the CPU. "Free" memory in this context does not have a clear-cut definition so I'm only showing total memory and memory that is definitely allocated for the CPU backend.

include/llama.h Outdated
Comment on lines 1361 to 1380

LLAMA_API size_t llama_backend_count(const struct llama_context * ctx);

struct llama_backend_info_data {
const char * name;

struct {
const char * name;
const char * description;

// device memory is in bytes
size_t memory_total; // total memory as reported by the device
size_t memory_free; // free memory as reported by the device
size_t memory_used_self; // sum of model, context, and compute
size_t memory_used_self_model; // memory allocated for the model
size_t memory_used_self_context; // memory allocated for the context
size_t memory_used_self_compute; // memory allocated for temporary compute buffers
size_t memory_used_unaccounted; // memory with unknown use, e.g. drivers or other programs, total - (free + self)
} device;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think tracking memory usage per-backend is the right way to do this. There are two reasonable options:

  • Tracking memory per device
  • Tracking memory per buffer type

In practice, it can be hard to map a buffer type to a device. For example should a CUDA_Host buffer count as CPU device, or as a CUDA device? What device should a CUDA_Split buffer belong to? It allocates memory from multiple devices.

Therefore, I think the only reasonable way to do this is per buffer type.

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented Sep 14, 2025

Thank you for the pointers, I've changed the code as follows:

  • For now expose only a function llama_memory_breakdown_print via the llama API. There is an internal method llama_context::memory_breakdown which returns a map from buffer type to the the memory used for model, context, and compute.
  • On master the function ggml_backend_sched_get_buffer_size filters based on ggml_backend_t, I've changed this to filter by ggml_backend_buft_t instead since I've interpreted your comments to mean that this would be a more correct way to do it in general. If a buffer type is not used, return 0 instead of abort.
  • According to the comments for ggml_backend_dev_type as of right now only GPU type devices have their own memory. So I'm reporting the memory use for each individual GPU type device while summing up the host memory.

Example print:

llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 4090)   | 24080 = 14350 + ( 9236 =  8231 +      20 +     985) +         493 |
llama_memory_breakdown_print: |   - CUDA1 (RTX 4090)   | 24080 = 16434 + ( 7152 =  6584 +      19 +     548) +         494 |
llama_memory_breakdown_print: |   - CUDA2 (RTX 4090)   | 24080 = 16434 + ( 7152 =  6584 +      19 +     548) +         494 |
llama_memory_breakdown_print: |   - CUDA3 (RTX 4090)   | 24080 = 18074 + ( 5505 =  4938 +      17 +     548) +         501 |
llama_memory_breakdown_print: |   - CUDA4 (RTX 4090)   | 24080 = 16434 + ( 7152 =  6584 +      19 +     548) +         494 |
llama_memory_breakdown_print: |   - CUDA5 (RTX 4090)   | 24080 = 16434 + ( 7152 =  6584 +      19 +     548) +         494 |
llama_memory_breakdown_print: |   - Host               |                  20943 = 20928 +       0 +      15                |

Edit: I accidentally copy-pasted a table produced by a bugged version.

@github-actions github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels Sep 14, 2025
@slaren
Copy link
Member

slaren commented Sep 16, 2025

  • On master the function ggml_backend_sched_get_buffer_size filters based on ggml_backend_t, I've changed this to filter by ggml_backend_buft_t instead since I've interpreted your comments to mean that this would be a more correct way to do it in general. If a buffer type is not used, return 0 instead of abort.

I don't think this change is necessary, because the way ggml_backend_sched is intended to work is that it allocates a compute buffer for each backend, and ggml_backend_sched_get_buffer_size returns the size of the compute buffer associated with that backend. By default, the buffer type used for each backend is the default buffer type of the backend, but users can override this if necessary. This is used for example to allocate a host buffer as the compute buffer for the CPU backend.

So I think it makes the API a bit awkward to use because it forces uses to know what buffer types the backends are using, which they may not know if they are using the default buffer type. It will also be a breaking API change for many ggml applications.

Instead, I would suggest adding a function ggml_backend_sched_get_buffer_type to obtain the buffer type of a backend. Then you can do something like this to obtain the allocated size of each buffer type:

std::map<ggml_backend_buffer_type_t, size_t> sizes;
for (auto * backend : backends) {
    auto * buft = ggml_backend_sched_get_buffer_type(sched, backend);
    auto   size = ggml_backend_sched_get_buffer_size(sched, backend);
    sizes[buft] += size;
}

@JohannesGaessler
Copy link
Collaborator Author

I'm noticing that in llama-kv-cache.cpp and llama-recurrent.cpp there are private methods which return the cumulative size of all buffers. This is the same behavior that I implemented with memory_use except it doesn't have support for filtering by buffer type. I think it would make sense to unify the methods, using total_size as the name and making the buffer type argument optional. If set, return only the size of those buffers matching the type.

Currently when using mmap a size of 0 MB is reported for the amount of host memory used for the model. This is correct in the sense that no actual memory buffers have been allocated but I think it can be confusing. So I think the correct way to handle this is to print either Host (mmap) or Host (no mmap) as the label. I don't see a good way to determine mmap status on shutdown, should I explicitly store it?

@slaren
Copy link
Member

slaren commented Sep 17, 2025

So I think the correct way to handle this is to print either Host (mmap) or Host (no mmap) as the label. I don't see a good way to determine mmap status on shutdown, should I explicitly store it?

I will take a more in depth look at this later, but mmaped and non-mmaped buffers have different buffer types. For mmap, you should get a CPU_Mapped buffer, while for non-mmap you would get either a CPU buffer, or a host buffer such as CUDA_Host.

@JohannesGaessler
Copy link
Collaborator Author

I was not aware of the existence of CPU_Mapped buffers, they are currently not being included in the calculation. This is probably a bug then, I'll look into it.

@slaren
Copy link
Member

slaren commented Sep 17, 2025

There are other buffer types, such as CPU_REPACK for repacked weights. It would be more reliable to look at every allocated buffer and get its buffer type, rather than trying to enumerate the buffer types and looking for buffers of that type.

@JohannesGaessler
Copy link
Collaborator Author

I refactored the implementation: instead of returning a single size_t value for a specified buffer type, return a map for buffer type -> memory use, merge the maps in a bottom-up way. For the print, report memory use per GPU device, consolidate the rest as host memory.

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption that every buffer that is not the default buffer type of a GPU device is a host buffer is wrong, and it will already break when using -sm row, since it will count the split buffers as host memory.

IMO this cannot be done reliably without changing the ggml-backend API, and it is not worth trying to hack it. I think showing a breakdown per buffer type should be good enough.

@JohannesGaessler
Copy link
Collaborator Author

In the current version memory is printed per GPU device in conjunction with their default buffer types. All other buffer types are printed separately with only the memory use. For -sm row the print is ugly with large amounts of unaccounted memory per device and the split buffers below:

llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 4090)   | 24080 = 19268 + ( 350 =     0 +     240 +     109) +        4462 |
llama_memory_breakdown_print: |   - CUDA1 (RTX 4090)   | 24080 = 22696 + ( 274 =     0 +      16 +     258) +        1110 |
llama_memory_breakdown_print: |   - CUDA0_Split        |                  3510 =  3510 +       0 +       0                |
llama_memory_breakdown_print: |   - CUDA1_Split        |                   644 =   644 +       0 +       0                |
llama_memory_breakdown_print: |   - CPU_Mapped         |                   281 =   281 +       0 +       0                |
llama_memory_breakdown_print: |   - CUDA_Host          |                     9 =     0 +       0 +       9                |

It is to my knowledge not possible to retrieve the actually allocated memory per device though. So for now the memory breakdown print and later the automation of distributing tensors to devices will not work properly. I'll revisit this after a more general tensor parallelism implementation.

@slaren slaren requested a review from ggerganov September 22, 2025 22:18
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Mac, the Metal_Mapped buffers appear to be counted a second time as "unaccounted" for the main device:

0.01.817.191 I llama_memory_breakdown_print: | memory breakdown [MiB]   | total    free     self   model   context   compute    unaccounted |
0.01.817.194 I llama_memory_breakdown_print: |   - Metal (Apple M4 Max) | 28753 = 10079 + ( 2156 =     0 +     384 +    1772) +       16518 |
0.01.817.195 I llama_memory_breakdown_print: |   - Metal_Mapped         |                  16497 = 16497 +       0 +       0                |
0.01.817.196 I llama_memory_breakdown_print: |   - CPU                  |                     12 =     0 +       0 +      12                |
0.01.817.197 I llama_memory_breakdown_print: |   - CPU_Mapped           |                    166 =   166 +       0 +       0                |

If I run with --no-mmap, it looks ok:

0.06.757.370 I llama_memory_breakdown_print: | memory breakdown [MiB]   | total    free     self   model   context   compute    unaccounted |
0.06.757.371 I llama_memory_breakdown_print: |   - Metal (Apple M4 Max) | 28753 = 10246 + (18486 = 16330 +     384 +    1772) +          20 |
0.06.757.371 I llama_memory_breakdown_print: |   - CPU                  |                    178 =   166 +       0 +      12                |

Don't think it is an issue, so we can merge. Though let me know if you have ideas how to fix it.

@JohannesGaessler
Copy link
Collaborator Author

I was implicitly assuming that (other than split buffers) each device had only a single buffer type. I changed the logic as follows:

  • Get a map of buffer type -> memory breakdown.
  • Iterate over buffer types:
    • If ggml_backend_buft_is_host, then accumulate memory breakdown for host.
    • If ggml_backend_buft_get_device, then accumulate memory breakdown for that device.
  • Print memory breakdown for each device.
  • Print memory breakdown for host.
  • Print memory breakdown for each buffer type that was not used so far.

@ggerganov please re-test.

@slaren
Copy link
Member

slaren commented Sep 23, 2025

  • If ggml_backend_buft_is_host, then accumulate memory breakdown for host.

is_host also requires that the tensors are in standard ggml layout, which excludes repacked buffer types.

@ggerganov
Copy link
Member

@JohannesGaessler The output on Mac looks like this now, which I think is OK:

0.07.327.664 I llama_memory_breakdown_print: | memory breakdown [MiB]   | total    free     self   model   context   compute    unaccounted |
0.07.327.664 I llama_memory_breakdown_print: |   - Metal (Apple M4 Max) | 28753 = 11700 + (17018 = 16330 +     384 +     304) +          34 |
0.07.327.665 I llama_memory_breakdown_print: |   - Host                 |                    178 =   166 +       0 +      12                |

@JohannesGaessler
Copy link
Collaborator Author

The CI failures are to my understanding not being caused by this PR. The repacked buffers not being accumulated as "Host" and instead being reported separately is a cosmetic issue but I think not a severe one. I would merge the PR as-is unless one of you vetoes this.

@ggerganov
Copy link
Member

For reference, this is how it reports with CPU-only build and repack buffers:

0.06.644.657 I llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
0.06.644.657 I llama_memory_breakdown_print: |   - Host               |                 1297 =   526 +     254 +     517                |
0.06.644.658 I llama_memory_breakdown_print: |   - CPU_REPACK         |                 1721 =  1721 +       0 +       0                |

I think it is OK.

@JohannesGaessler JohannesGaessler merged commit e789095 into ggml-org:master Sep 24, 2025
61 of 67 checks passed
pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Sep 25, 2025
* llama: print memory breakdown on exit
struct pushed a commit to struct/llama.cpp that referenced this pull request Sep 26, 2025
* llama: print memory breakdown on exit
yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025
* llama: print memory breakdown on exit
pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 23, 2025
* llama: print memory breakdown on exit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants