-
Notifications
You must be signed in to change notification settings - Fork 13.4k
llama: print memory breakdown on exit #15860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama: print memory breakdown on exit #15860
Conversation
include/llama.h
Outdated
|
||
LLAMA_API size_t llama_backend_count(const struct llama_context * ctx); | ||
|
||
struct llama_backend_info_data { | ||
const char * name; | ||
|
||
struct { | ||
const char * name; | ||
const char * description; | ||
|
||
// device memory is in bytes | ||
size_t memory_total; // total memory as reported by the device | ||
size_t memory_free; // free memory as reported by the device | ||
size_t memory_used_self; // sum of model, context, and compute | ||
size_t memory_used_self_model; // memory allocated for the model | ||
size_t memory_used_self_context; // memory allocated for the context | ||
size_t memory_used_self_compute; // memory allocated for temporary compute buffers | ||
size_t memory_used_unaccounted; // memory with unknown use, e.g. drivers or other programs, total - (free + self) | ||
} device; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think tracking memory usage per-backend is the right way to do this. There are two reasonable options:
- Tracking memory per device
- Tracking memory per buffer type
In practice, it can be hard to map a buffer type to a device. For example should a CUDA_Host
buffer count as CPU device, or as a CUDA device? What device should a CUDA_Split
buffer belong to? It allocates memory from multiple devices.
Therefore, I think the only reasonable way to do this is per buffer type.
3d03a7a
to
5c44c51
Compare
Thank you for the pointers, I've changed the code as follows:
Example print:
Edit: I accidentally copy-pasted a table produced by a bugged version. |
5c44c51
to
ceea1b9
Compare
I don't think this change is necessary, because the way So I think it makes the API a bit awkward to use because it forces uses to know what buffer types the backends are using, which they may not know if they are using the default buffer type. It will also be a breaking API change for many ggml applications. Instead, I would suggest adding a function std::map<ggml_backend_buffer_type_t, size_t> sizes;
for (auto * backend : backends) {
auto * buft = ggml_backend_sched_get_buffer_type(sched, backend);
auto size = ggml_backend_sched_get_buffer_size(sched, backend);
sizes[buft] += size;
} |
I'm noticing that in Currently when using mmap a size of 0 MB is reported for the amount of host memory used for the model. This is correct in the sense that no actual memory buffers have been allocated but I think it can be confusing. So I think the correct way to handle this is to print either |
I will take a more in depth look at this later, but mmaped and non-mmaped buffers have different buffer types. For mmap, you should get a |
I was not aware of the existence of |
There are other buffer types, such as |
I refactored the implementation: instead of returning a single |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption that every buffer that is not the default buffer type of a GPU device is a host buffer is wrong, and it will already break when using -sm row
, since it will count the split buffers as host memory.
IMO this cannot be done reliably without changing the ggml-backend API, and it is not worth trying to hack it. I think showing a breakdown per buffer type should be good enough.
In the current version memory is printed per GPU device in conjunction with their default buffer types. All other buffer types are printed separately with only the memory use. For
It is to my knowledge not possible to retrieve the actually allocated memory per device though. So for now the memory breakdown print and later the automation of distributing tensors to devices will not work properly. I'll revisit this after a more general tensor parallelism implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Mac, the Metal_Mapped
buffers appear to be counted a second time as "unaccounted" for the main device:
0.01.817.191 I llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
0.01.817.194 I llama_memory_breakdown_print: | - Metal (Apple M4 Max) | 28753 = 10079 + ( 2156 = 0 + 384 + 1772) + 16518 |
0.01.817.195 I llama_memory_breakdown_print: | - Metal_Mapped | 16497 = 16497 + 0 + 0 |
0.01.817.196 I llama_memory_breakdown_print: | - CPU | 12 = 0 + 0 + 12 |
0.01.817.197 I llama_memory_breakdown_print: | - CPU_Mapped | 166 = 166 + 0 + 0 |
If I run with --no-mmap
, it looks ok:
0.06.757.370 I llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
0.06.757.371 I llama_memory_breakdown_print: | - Metal (Apple M4 Max) | 28753 = 10246 + (18486 = 16330 + 384 + 1772) + 20 |
0.06.757.371 I llama_memory_breakdown_print: | - CPU | 178 = 166 + 0 + 12 |
Don't think it is an issue, so we can merge. Though let me know if you have ideas how to fix it.
I was implicitly assuming that (other than split buffers) each device had only a single buffer type. I changed the logic as follows:
@ggerganov please re-test. |
|
@JohannesGaessler The output on Mac looks like this now, which I think is OK:
|
be6400d
to
f2b3c1d
Compare
The CI failures are to my understanding not being caused by this PR. The repacked buffers not being accumulated as "Host" and instead being reported separately is a cosmetic issue but I think not a severe one. I would merge the PR as-is unless one of you vetoes this. |
For reference, this is how it reports with CPU-only build and repack buffers:
I think it is OK. |
* llama: print memory breakdown on exit
* llama: print memory breakdown on exit
* llama: print memory breakdown on exit
* llama: print memory breakdown on exit
This PR makes it so that on exit a breakdown of the memory use is printed. For example:
Explanation:
The intended immediate use is to make it easier to efficiently distribute models across devices. I also intend to re-use this code to determine automatically which parts of the model to put on which device for optimal performance. Long-term I would also want to expose this information via the HTTP server to establish a Pareto frontier of quality vs. memory use for different quantizations of different models.
Open problems:
llama_print_memory_breakdown
to the llama API which produces the above table on the console. Internally this function uses another new functionllama_backend_info
which returns a struct with information about the backends used by allama_context
. I'm not sure whether the latter should be part of the public API, and if yes, in what form.llama_model::memory_use(ggml_backend_dev_t dev)
which return the memory used on a specified device. But I'm not sure whether the device is the correct argument type here. Would it make more sense to pass aggml_backend_buffer_type_t
? In particular, I think this is the only correct way to handle e.g.CUDA_Host
buffers.