Skip to content

Conversation

@AlekseiNikiforovIBM
Copy link
Contributor

Allow loading little endian models on big endian systems.
This would allow using any models downloaded via ollama unmodified.

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Jan 14, 2025
@JohannesGaessler
Copy link
Collaborator

I don't have any big endian systems for testing or maintenance. Do you pledge to help with maintaining big endian support long-term?

@AlekseiNikiforovIBM
Copy link
Contributor Author

Yes, we intend to maintain big endian support long term.

We could also help to integrate Z VM guest as test system into CI if you'd like to.

ggml/src/ggml.c Outdated
Comment on lines 39 to 66
#if defined(__gnu_linux__)
#include <endian.h>
#else
#define le64toh(x) (x)
#define le32toh(x) (x)
#define le16toh(x) (x)
#endif

// endianness conversion
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
#define convert_from_le16(x) UNUSED(x)
#define convert_from_le32(x) UNUSED(x)
#define convert_from_le64(x) UNUSED(x)
#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
static inline void convert_from_le16(void * value) {
*((uint16_t*)value) = le16toh(*((uint16_t*)value));
}

static inline void convert_from_le32(void * value) {
*((uint32_t*)value) = le32toh(*((uint32_t*)value));
}

static inline void convert_from_le64(void * value) {
*((uint64_t*)value) = le64toh(*((uint64_t*)value));
}
#else
#error Unexpected or undefined __BYTE_ORDER__
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 6548 to 6590
static void ggml_byteswap_q4_0(void * restrict buffer, size_t elements) {
block_q4_0 *data_ptr = (block_q4_0*) buffer;
for (size_t i = 0; i < elements; ++i) {
convert_from_le16(&(data_ptr[i].d));
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is only the endianness for the scale d but not the quantized values q being changed? More generally, quantized models require a lot of bit manipulation in order to work. Did you test this code with any quantized models or only with models using conventional, scalar datatypes such as FP16?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check this function, but ggml_byteswap_q4_k and ggml_byteswap_q6_k are used. This function I've implemented similarly.

I can disable all functions which are not explicitely tested yet, or I could test them if you could point me to models using them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally speaking, I've thought about the problem of endianness some more and there are quite a lot of potential pitfalls because we do a lot of low-level bitwise manipulations. There's also e.g. the question of how to handle interaction with backends other than CPU which would likely still assume little endianness. Quite frankly I'm not sure whether this is a feature that would be worth the complexity to support given how rare big endian processors are. My personal opinion is that something like this would be fine to merge if it does not interfere with any little endian code and it is clear that big endian issues are wholly your responsibility to determine and resolve. @ggerganov @slaren your input would be appreciated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be hard to justify the maintenance cost. I am not sure how this interacts with the existing support for big endian gguf files in the gguf python library and convert_hf_to_gguf.py. It might make more sense to instead improve on that, for example by adding a tool to convert a gguf file to a different endianess. Model repositories like ollama can deal with this by serving both versions on demand, if they wish.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're correct about existing big endian files. I'll add a cmdline switch to disable byteswapping on s390x to allow loading such existing files.

Everyone uploading 2 copies of models, little endian and big endian, unfortunately is unlikely to happen. Due to that we'd like to have a solution to load little endian models on big endian system.

Of course, byteswapping in advance would help, and would enable, for example, using mmap. But additional big endian specific step isn't always possible, like in case with ollama, and can be just put inside loading model, like in this PR. And while I didn't measure performance, I expect loading model to take considerably less time than actually running model, and I expect byteswapping during loading not changing this noticeably.

As for gguf python library, if there is a testsuite outside of one described here (https://github.com/ggerganov/llama.cpp/blob/master/ci/README.md), then I could also check it.

@taronaeo
Copy link
Collaborator

If I may chime in on the consideration of support, this PR is a pretty big step forward as an umbrella support for zSystems and problematic/unsupported models while we actively work on implementing SIMD support individually (taronaeo/llama.cpp-s390x@891922f).

At current, even with --bigendian conversion using the convert_hf_to_gguf.py tool, Llama-3.2-1B fails to load correctly and this behaviour is similar to what is seen in #3298 (comment) as demonstrated here

$ build/bin/llama-cli -m /opt/hf_models/llama-3.2-1b-be.F32.gguf -n 50 -t 8 -p "Write me a dog walking business idea  1." --seed 1568795874

system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : VXE = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

sampler seed: 1568795874
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 50, n_keep = 1

Write me a dog walking business idea  1.GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

llama_perf_sampler_print:    sampling time =       3.97 ms /    62 runs   (    0.06 ms per token, 15625.00 tokens per second)
llama_perf_context_print:        load time =     702.51 ms
llama_perf_context_print: prompt eval time =     459.48 ms /    12 tokens (   38.29 ms per token,    26.12 tokens per second)
llama_perf_context_print:        eval time =    9103.29 ms /    49 runs   (  185.78 ms per token,     5.38 tokens per second)
llama_perf_context_print:       total time =    9602.03 ms /    61 tokens

If support to convert little-endian to big-endian is provided on-the-fly, albeit understandably with performance penalties, we can see models such as Llama-3.2 start running properly using little-endian while we work on a patch to make it work using big-endian.

@AlekseiNikiforovIBM
Copy link
Contributor Author

I've implemented '--no-byteswap' switch and tested it with model converted with help of #11349

Introduce byteswap function for ggml data.
Implement some of them.
Currently tested on llama3.2.
@AlekseiNikiforovIBM
Copy link
Contributor Author

Rebased PR to resolve merge conflicts

@AlekseiNikiforovIBM
Copy link
Contributor Author

I've added option to disable byteswapping on s390x. While byteswapping could be implemented for little-endian systems, I believe most models are little-endian and there's no demand for such option at the moment.

I'd like to proceed with this PR. What's needed for that?

@slaren
Copy link
Member

slaren commented Jan 30, 2025

I don't think this can be merged in its current state. As it is, it just going to be a maintenance burden and will increase the technical debt that we will need to pay when eventually we come to work on this code again and have to figure a way to deal with this mess.

Fundamentally, the problem is that the gguf loader is not designed to support byteswapping. It just wasn't a consideration when it was created, and it would take a major effort to redesign the system to allow transparent byteswapping during loading.

Additionally, you need to figure how to deal with big endian GGUF files. Adding a command line flag to just dump the responsibility of figuring that to the user is not a good way to deal with this. I can see two options, remove support for big endian GGUF files entirely, or add the necessary logic to support both types.

Alternatively, you could instead extend the existing support for big endian GGUF files. This would be by far the easiest route, and the most likely to be merged.

@AlekseiNikiforovIBM
Copy link
Contributor Author

I've checked gguf-py and it evaluates byte order based on version field. I'll remove cmdline switch and remake c++ code using same logic.

As for byteswapping itself, I'll try reworking it.

Bitfields are allocated in different order on s390x
Add ggml_ prefix to their names.
Now little-endian systems can load big-endian models too.
Model endianness heuristic is based on guessing using
model version field.

Additional fixes for testsuite after removing
capability to write non-native endian files.
@github-actions github-actions bot added the python python script changes label Feb 18, 2025
@AlekseiNikiforovIBM
Copy link
Contributor Author

I've reworked PR a bit. First of all, I've removed the command line option I introduced earlier. Now there is a heuristic similar to one in python code, where it guesses endian type based on value of version field.

I've removed capability to save gguf files in non-native endianness. It automatically saves files in native endian.

I didn't check byteswapping during asynchronous data loading even with cuda on x86 in llama_model_loader::load_all_data, because I couldn't force it to switch to this code path. I got upload_backend == nullptr every way I tried it. I'm pretty sure it should work, but If you could point me on how to force llama.cpp to run this code path, I could verify that it works too.

Please let me know if there are still any pain points present in this PR.

@AlekseiNikiforovIBM
Copy link
Contributor Author

Are those 3 failures related to changes in this PR? If yes, what exactly is wrong?

@slaren
Copy link
Member

slaren commented Feb 26, 2025

The changes look like a good improvement, however the main problem is still that this is not transparent to the users of the gguf library, and requires additional code to check and byteswap the tensor data if necessary. llama.cpp is not the only application of ggml.

Ideally, this should be completely transparent to the user, but that's not possible without also moving the data loading code (including mmap support) to the gguf library, which would be a major change. When the gguf loader is used without no_alloc, it will also not perform byteswapping of the tensor data, resulting in the application silently loading incorrect tensor data.

As it is, other applications that use the gguf library will fail silently when loading a gguf file with a non-native byte order, which is not great. It will also force us to add all the boilerplate for byteswapping to all new code using the gguf library. I am not saying that this is necessarily a deal-breaks when it comes to the decision of merging this PR, but it would make the decision a lot easier if these issues were addressed. If that's not possible, I will defer to @ggerganov or other maintainers.

@slaren
Copy link
Member

slaren commented Feb 26, 2025

Are those 3 failures related to changes in this PR? If yes, what exactly is wrong?

I am not sure what's the issue, but I don't think it's caused by the changes in this PR. Merging the latest changes from master may fix it.

@ggerganov
Copy link
Member

If I understand correctly, on master, big-endian systems are currently capable of working with big-endian GGUFs. I think this is already good enough support. I can see that being able to load a LE GGUF on a BE system would be convenient, but I agree that the changes proposed here do not justify the maintenance associated. I personally haven't seen/used a BE machine in my life, so my assumption is that this is a very niche use case - please correct me if I am wrong.

The ideal solution is indeed to handle all this seamlessly in the gguf implementation, but as @slaren pointed out, it appears to be a really big change. So unless we see some strong argument that this feature is needed, I would also prefer not to merge as it is.

@AlekseiNikiforovIBM
Copy link
Contributor Author

The changes look like a good improvement, however the main problem is still that this is not transparent to the users of the gguf library, and requires additional code to check and byteswap the tensor data if necessary. llama.cpp is not the only application of ggml.

What if I add new field in struct gguf_init_params, named like allow_byteswapping? It'll allow all clients to opt-in into this feature. I'll make it off by default, and enable it for llama.cpp? This'll keep other potential gguf library clients from breaking.

When the gguf loader is used without no_alloc, it will also not perform byteswapping of the tensor data, resulting in the application silently loading incorrect tensor data.

I'll disable byteswapping if no_alloc is enabled.

I agree that the changes proposed here do not justify the maintenance associated.

As for maintaining byteswapping parts and/or big endian, me and my colleagues (ATM https://github.com/Andreas-Krebbel when I'm not available) are willing to support those parts. Feel free to assign any issue arising in the context of this to me or my colleagues. We could also set up a s390x CI runner to validate any changes on a BE target.

I personally haven't seen/used a BE machine in my life, so my assumption is that this is a very niche use case - please correct me if I am wrong.

While the IBM Z platform may be a niche platform in terms of the number of systems out there, it powers critical workloads around the world in finance, airline reservations, insurance, retail and many more. Among the biggest players in these industries, the IBM Z platform is seeing good market adoption. The ability to run llama.cpp on IBM Z without too much hassle allows large enterprises to use it in combination with their traditional workloads.

Enable it for llama.cpp
@AlekseiNikiforovIBM
Copy link
Contributor Author

Disabling byteswapping when no_alloc is enabled didn't work. But with latest commit byteswapping is opt-in.

@JohannesGaessler
Copy link
Collaborator

My understanding is that BE systems can already run llama.cpp, but it's inconvenient. So BE support with mainline model files would be a convenience gain for only a small number of people. How critical those BE systems are is not relevant. So any implementations must be done in a way that is minimally invasive and does not induce any side effects.

@AlekseiNikiforovIBM
Copy link
Contributor Author

My understanding is that BE systems can already run llama.cpp, but it's inconvenient.

While llama.cpp on BE systems can run BE models, it cannot run LE models.

One of users of llama.cpp is ollama, and it does not work on BE systems because ollama downloads a model from an online repository, and that model is always LE model. There could be BE models there, but I have not seen any yet in their repositories.

And while it is possible to run an external python script to convert model from LE to BE or other way around for llama.cpp, for ollama there is no good place available where this could be done as far as I can see at the moment. The only good option is to do it on the fly.

So any implementations must be done in a way that is minimally invasive and does not induce any side effects.

I'm trying to create such implementation and I'd be grateful if you'd point me to any remaining such issues in current implementation.

@AlekseiNikiforovIBM
Copy link
Contributor Author

Do you have any suggestions to improve this code further in order to bring it to acceptable state?

@ggerganov
Copy link
Member

I am afraid the changes still add to much extra logic that we will have to work around over time and it does not justify the small added benefit. Currently this code handles the quantization types on a case-by-case basis, which is OK but not great. However a GGUF can contain more generic data (such as the unicode strings that you added code to handle in llama.cpp). Without a generic way to tag the data in the GGUF that requires byte swapping and doing it seamlessly from the user of the gguf library, I don't think it is worth adding the technical debt.

One of users of llama.cpp is ollama

That's a good point and I think the better solution would be for providers like ollama to offer big-endian models seamlessly. However, my understanding is that ollama no longer uses llama.cpp, so I guess this is no longer a factor in the decision to add BE support or not.

@ericcurtin
Copy link
Collaborator

ericcurtin commented Apr 30, 2025

I am afraid the changes still add to much extra logic that we will have to work around over time and it does not justify the small added benefit. Currently this code handles the quantization types on a case-by-case basis, which is OK but not great. However a GGUF can contain more generic data (such as the unicode strings that you added code to handle in llama.cpp). Without a generic way to tag the data in the GGUF that requires byte swapping and doing it seamlessly from the user of the gguf library, I don't think it is worth adding the technical debt.

One of users of llama.cpp is ollama

That's a good point and I think the better solution would be for providers like ollama to offer big-endian models seamlessly. However, my understanding is that ollama no longer uses llama.cpp, so I guess this is no longer a factor in the decision to add BE support or not.

FWIW @AlekseiNikiforovIBM and @ggerganov RamaLama would probably pick up the slack here, it's one of the reasons RamaLama exists :)

If it works with upstream llama.cpp we can make it work with RamaLama

@AlekseiNikiforovIBM
Copy link
Contributor Author

Closing this PR as abandoned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning python python script changes testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants