llama: print memory breakdown on exit #15860

JohannesGaessler · 2025-09-07T21:35:23Z

This PR makes it so that on exit a breakdown of the memory use is printed. For example:

llama_print_memory_breakdown: memory breakdown:      total   free     self   model   context   compute    unaccounted
llama_print_memory_breakdown:   - CUDA0 (RTX 4090):  24080 = 9436 + (14193 = 13169 +      38 +     985) +         451
llama_print_memory_breakdown:   - CUDA1 (RTX 4090):  24080 = 9868 + (13756 = 13169 +      38 +     548) +         456
llama_print_memory_breakdown:   - CUDA2 (RTX 4090):  24080 = 9868 + (13756 = 13169 +      38 +     548) +         456
llama_print_memory_breakdown:   - CPU (EPYC 7742):  515628              72 =     0 +      57 +      15

Explanation:

size_t memory_total;             // total memory as reported by the device
size_t memory_free;              // free memory as reported by the device
size_t memory_used_self;         // sum of model, context, and compute
size_t memory_used_self_model;   // memory allocated for the model
size_t memory_used_self_context; // memory allocated for the context
size_t memory_used_self_compute; // memory allocated for temporary compute buffers
size_t memory_used_unaccounted;  // memory with unknown use, e.g. drivers or other programs, total - (free + self)

The intended immediate use is to make it easier to efficiently distribute models across devices. I also intend to re-use this code to determine automatically which parts of the model to put on which device for optimal performance. Long-term I would also want to expose this information via the HTTP server to establish a Pareto frontier of quality vs. memory use for different quantizations of different models.

Open problems:

I added a function llama_print_memory_breakdown to the llama API which produces the above table on the console. Internally this function uses another new function llama_backend_info which returns a struct with information about the backends used by a llama_context. I'm not sure whether the latter should be part of the public API, and if yes, in what form.
I added methods like llama_model::memory_use(ggml_backend_dev_t dev) which return the memory used on a specified device. But I'm not sure whether the device is the correct argument type here. Would it make more sense to pass a ggml_backend_buffer_type_t? In particular, I think this is the only correct way to handle e.g. CUDA_Host buffers.
The memory for e.g. the CUDA pools is currently under "unaccounted", but it should be under "compute". Currently it is not possible for llama.cpp to retrieve this information. I think it would make sense to extend the ggml backend interface with a function that returns the total amount of device memory allocated by the backend.
I'm not sure what to show, if anything, for the CPU. "Free" memory in this context does not have a clear-cut definition so I'm only showing total memory and memory that is definitely allocated for the CPU backend.

include/llama.h

slaren · 2025-09-09T22:29:40Z

include/llama.h

+
+    LLAMA_API size_t llama_backend_count(const struct llama_context * ctx);
+
+    struct llama_backend_info_data {
+        const char * name;
+
+        struct {
+            const char * name;
+            const char * description;
+
+            // device memory is in bytes
+            size_t memory_total;             // total memory as reported by the device
+            size_t memory_free;              // free memory as reported by the device
+            size_t memory_used_self;         // sum of model, context, and compute
+            size_t memory_used_self_model;   // memory allocated for the model
+            size_t memory_used_self_context; // memory allocated for the context
+            size_t memory_used_self_compute; // memory allocated for temporary compute buffers
+            size_t memory_used_unaccounted;  // memory with unknown use, e.g. drivers or other programs, total - (free + self)
+        } device;
+    };


I don't think tracking memory usage per-backend is the right way to do this. There are two reasonable options:

Tracking memory per device

Tracking memory per buffer type

In practice, it can be hard to map a buffer type to a device. For example should a CUDA_Host buffer count as CPU device, or as a CUDA device? What device should a CUDA_Split buffer belong to? It allocates memory from multiple devices.

Therefore, I think the only reasonable way to do this is per buffer type.

JohannesGaessler · 2025-09-14T19:19:54Z

Thank you for the pointers, I've changed the code as follows:

For now expose only a function llama_memory_breakdown_print via the llama API. There is an internal method llama_context::memory_breakdown which returns a map from buffer type to the the memory used for model, context, and compute.
On master the function ggml_backend_sched_get_buffer_size filters based on ggml_backend_t, I've changed this to filter by ggml_backend_buft_t instead since I've interpreted your comments to mean that this would be a more correct way to do it in general. If a buffer type is not used, return 0 instead of abort.
According to the comments for ggml_backend_dev_type as of right now only GPU type devices have their own memory. So I'm reporting the memory use for each individual GPU type device while summing up the host memory.

Example print:

llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 4090)   | 24080 = 14350 + ( 9236 =  8231 +      20 +     985) +         493 |
llama_memory_breakdown_print: |   - CUDA1 (RTX 4090)   | 24080 = 16434 + ( 7152 =  6584 +      19 +     548) +         494 |
llama_memory_breakdown_print: |   - CUDA2 (RTX 4090)   | 24080 = 16434 + ( 7152 =  6584 +      19 +     548) +         494 |
llama_memory_breakdown_print: |   - CUDA3 (RTX 4090)   | 24080 = 18074 + ( 5505 =  4938 +      17 +     548) +         501 |
llama_memory_breakdown_print: |   - CUDA4 (RTX 4090)   | 24080 = 16434 + ( 7152 =  6584 +      19 +     548) +         494 |
llama_memory_breakdown_print: |   - CUDA5 (RTX 4090)   | 24080 = 16434 + ( 7152 =  6584 +      19 +     548) +         494 |
llama_memory_breakdown_print: |   - Host               |                  20943 = 20928 +       0 +      15                |

Edit: I accidentally copy-pasted a table produced by a bugged version.

slaren · 2025-09-16T21:00:34Z

On master the function ggml_backend_sched_get_buffer_size filters based on ggml_backend_t, I've changed this to filter by ggml_backend_buft_t instead since I've interpreted your comments to mean that this would be a more correct way to do it in general. If a buffer type is not used, return 0 instead of abort.

I don't think this change is necessary, because the way ggml_backend_sched is intended to work is that it allocates a compute buffer for each backend, and ggml_backend_sched_get_buffer_size returns the size of the compute buffer associated with that backend. By default, the buffer type used for each backend is the default buffer type of the backend, but users can override this if necessary. This is used for example to allocate a host buffer as the compute buffer for the CPU backend.

So I think it makes the API a bit awkward to use because it forces uses to know what buffer types the backends are using, which they may not know if they are using the default buffer type. It will also be a breaking API change for many ggml applications.

Instead, I would suggest adding a function ggml_backend_sched_get_buffer_type to obtain the buffer type of a backend. Then you can do something like this to obtain the allocated size of each buffer type:

std::map<ggml_backend_buffer_type_t, size_t> sizes;
for (auto * backend : backends) {
    auto * buft = ggml_backend_sched_get_buffer_type(sched, backend);
    auto   size = ggml_backend_sched_get_buffer_size(sched, backend);
    sizes[buft] += size;
}

JohannesGaessler · 2025-09-17T16:13:25Z

I'm noticing that in llama-kv-cache.cpp and llama-recurrent.cpp there are private methods which return the cumulative size of all buffers. This is the same behavior that I implemented with memory_use except it doesn't have support for filtering by buffer type. I think it would make sense to unify the methods, using total_size as the name and making the buffer type argument optional. If set, return only the size of those buffers matching the type.

Currently when using mmap a size of 0 MB is reported for the amount of host memory used for the model. This is correct in the sense that no actual memory buffers have been allocated but I think it can be confusing. So I think the correct way to handle this is to print either Host (mmap) or Host (no mmap) as the label. I don't see a good way to determine mmap status on shutdown, should I explicitly store it?

slaren · 2025-09-17T16:15:52Z

So I think the correct way to handle this is to print either Host (mmap) or Host (no mmap) as the label. I don't see a good way to determine mmap status on shutdown, should I explicitly store it?

I will take a more in depth look at this later, but mmaped and non-mmaped buffers have different buffer types. For mmap, you should get a CPU_Mapped buffer, while for non-mmap you would get either a CPU buffer, or a host buffer such as CUDA_Host.

JohannesGaessler · 2025-09-17T16:25:24Z

I was not aware of the existence of CPU_Mapped buffers, they are currently not being included in the calculation. This is probably a bug then, I'll look into it.

slaren · 2025-09-17T16:29:18Z

There are other buffer types, such as CPU_REPACK for repacked weights. It would be more reliable to look at every allocated buffer and get its buffer type, rather than trying to enumerate the buffer types and looking for buffers of that type.

JohannesGaessler · 2025-09-18T20:19:16Z

I refactored the implementation: instead of returning a single size_t value for a specified buffer type, return a map for buffer type -> memory use, merge the maps in a bottom-up way. For the print, report memory use per GPU device, consolidate the rest as host memory.

slaren

The assumption that every buffer that is not the default buffer type of a GPU device is a host buffer is wrong, and it will already break when using -sm row, since it will count the split buffers as host memory.

IMO this cannot be done reliably without changing the ggml-backend API, and it is not worth trying to hack it. I think showing a breakdown per buffer type should be good enough.

JohannesGaessler · 2025-09-19T13:42:08Z

In the current version memory is printed per GPU device in conjunction with their default buffer types. All other buffer types are printed separately with only the memory use. For -sm row the print is ugly with large amounts of unaccounted memory per device and the split buffers below:

llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 4090)   | 24080 = 19268 + ( 350 =     0 +     240 +     109) +        4462 |
llama_memory_breakdown_print: |   - CUDA1 (RTX 4090)   | 24080 = 22696 + ( 274 =     0 +      16 +     258) +        1110 |
llama_memory_breakdown_print: |   - CUDA0_Split        |                  3510 =  3510 +       0 +       0                |
llama_memory_breakdown_print: |   - CUDA1_Split        |                   644 =   644 +       0 +       0                |
llama_memory_breakdown_print: |   - CPU_Mapped         |                   281 =   281 +       0 +       0                |
llama_memory_breakdown_print: |   - CUDA_Host          |                     9 =     0 +       0 +       9                |

It is to my knowledge not possible to retrieve the actually allocated memory per device though. So for now the memory breakdown print and later the automation of distributing tensors to devices will not work properly. I'll revisit this after a more general tensor parallelism implementation.

ggerganov

On Mac, the Metal_Mapped buffers appear to be counted a second time as "unaccounted" for the main device:

0.01.817.191 I llama_memory_breakdown_print: | memory breakdown [MiB]   | total    free     self   model   context   compute    unaccounted |
0.01.817.194 I llama_memory_breakdown_print: |   - Metal (Apple M4 Max) | 28753 = 10079 + ( 2156 =     0 +     384 +    1772) +       16518 |
0.01.817.195 I llama_memory_breakdown_print: |   - Metal_Mapped         |                  16497 = 16497 +       0 +       0                |
0.01.817.196 I llama_memory_breakdown_print: |   - CPU                  |                     12 =     0 +       0 +      12                |
0.01.817.197 I llama_memory_breakdown_print: |   - CPU_Mapped           |                    166 =   166 +       0 +       0                |

If I run with --no-mmap, it looks ok:

0.06.757.370 I llama_memory_breakdown_print: | memory breakdown [MiB]   | total    free     self   model   context   compute    unaccounted |
0.06.757.371 I llama_memory_breakdown_print: |   - Metal (Apple M4 Max) | 28753 = 10246 + (18486 = 16330 +     384 +    1772) +          20 |
0.06.757.371 I llama_memory_breakdown_print: |   - CPU                  |                    178 =   166 +       0 +      12                |

Don't think it is an issue, so we can merge. Though let me know if you have ideas how to fix it.

JohannesGaessler · 2025-09-23T11:11:50Z

I was implicitly assuming that (other than split buffers) each device had only a single buffer type. I changed the logic as follows:

Get a map of buffer type -> memory breakdown.
Iterate over buffer types:
- If ggml_backend_buft_is_host, then accumulate memory breakdown for host.
- If ggml_backend_buft_get_device, then accumulate memory breakdown for that device.
Print memory breakdown for each device.
Print memory breakdown for host.
Print memory breakdown for each buffer type that was not used so far.

@ggerganov please re-test.

slaren · 2025-09-23T11:13:48Z

If ggml_backend_buft_is_host, then accumulate memory breakdown for host.

is_host also requires that the tensors are in standard ggml layout, which excludes repacked buffer types.

ggerganov · 2025-09-23T12:26:56Z

@JohannesGaessler The output on Mac looks like this now, which I think is OK:

0.07.327.664 I llama_memory_breakdown_print: | memory breakdown [MiB]   | total    free     self   model   context   compute    unaccounted |
0.07.327.664 I llama_memory_breakdown_print: |   - Metal (Apple M4 Max) | 28753 = 11700 + (17018 = 16330 +     384 +     304) +          34 |
0.07.327.665 I llama_memory_breakdown_print: |   - Host                 |                    178 =   166 +       0 +      12                |

JohannesGaessler · 2025-09-23T19:55:13Z

The CI failures are to my understanding not being caused by this PR. The repacked buffers not being accumulated as "Host" and instead being reported separately is a cosmetic issue but I think not a severe one. I would merge the PR as-is unless one of you vetoes this.

ggerganov · 2025-09-24T05:55:35Z

For reference, this is how it reports with CPU-only build and repack buffers:

0.06.644.657 I llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
0.06.644.657 I llama_memory_breakdown_print: |   - Host               |                 1297 =   526 +     254 +     517                |
0.06.644.658 I llama_memory_breakdown_print: |   - CPU_REPACK         |                 1721 =  1721 +       0 +       0                |

I think it is OK.

* llama: print memory breakdown on exit

slaren reviewed Sep 9, 2025

View reviewed changes

llama: print memory breakdown on exit

e9067fb

JohannesGaessler force-pushed the llama-memory-info-2 branch from 3d03a7a to 5c44c51 Compare September 14, 2025 19:11

address review comments, refactor table

ceea1b9

JohannesGaessler force-pushed the llama-memory-info-2 branch from 5c44c51 to ceea1b9 Compare September 14, 2025 19:23

github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels Sep 14, 2025

JohannesGaessler mentioned this pull request Sep 16, 2025

[WIP] Rpc split row #16020

Open

revert ggml-backend changes, fix cpu context

fc10841

return maps, merge, non-GPU==Host

c47ead4

slaren reviewed Sep 18, 2025

View reviewed changes

print bufts + memory use for non-GPU

7b19b12

slaren approved these changes Sep 22, 2025

View reviewed changes

slaren requested a review from ggerganov September 22, 2025 22:18

ggerganov approved these changes Sep 23, 2025

View reviewed changes

JohannesGaessler requested a review from CISC as a code owner September 23, 2025 11:05

handle multiple bufts per device

f2b3c1d

JohannesGaessler force-pushed the llama-memory-info-2 branch from be6400d to f2b3c1d Compare September 23, 2025 12:33

JohannesGaessler merged commit e789095 into ggml-org:master Sep 24, 2025
61 of 67 checks passed

pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Sep 25, 2025

llama: print memory breakdown on exit (ggml-org#15860)

73422f3

* llama: print memory breakdown on exit

struct pushed a commit to struct/llama.cpp that referenced this pull request Sep 26, 2025

llama: print memory breakdown on exit (ggml-org#15860)

e7a8fb7

* llama: print memory breakdown on exit

yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025

llama: print memory breakdown on exit (ggml-org#15860)

99db111

* llama: print memory breakdown on exit

pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 23, 2025

llama: print memory breakdown on exit (ggml-org#15860)

169570f

* llama: print memory breakdown on exit

llama: print memory breakdown on exit #15860

llama: print memory breakdown on exit #15860

Uh oh!

Conversation

JohannesGaessler commented Sep 7, 2025

Uh oh!

Uh oh!

slaren Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Sep 17, 2025

Uh oh!

slaren commented Sep 17, 2025

Uh oh!

JohannesGaessler commented Sep 17, 2025

Uh oh!

slaren commented Sep 17, 2025

Uh oh!

JohannesGaessler commented Sep 18, 2025

Uh oh!

slaren left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Sep 19, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Sep 23, 2025

Uh oh!

slaren commented Sep 23, 2025

Uh oh!

ggerganov commented Sep 23, 2025

Uh oh!

JohannesGaessler commented Sep 23, 2025

Uh oh!

ggerganov commented Sep 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JohannesGaessler commented Sep 14, 2025 •

edited

Loading

slaren commented Sep 16, 2025 •

edited

Loading