Allow s390x to load little endian models unmodified #11234

AlekseiNikiforovIBM · 2025-01-14T12:34:59Z

Allow loading little endian models on big endian systems.
This would allow using any models downloaded via ollama unmodified.

JohannesGaessler · 2025-01-14T14:09:50Z

I don't have any big endian systems for testing or maintenance. Do you pledge to help with maintaining big endian support long-term?

AlekseiNikiforovIBM · 2025-01-14T14:29:12Z

Yes, we intend to maintain big endian support long term.

We could also help to integrate Z VM guest as test system into CI if you'd like to.

JohannesGaessler · 2025-01-14T15:17:49Z

ggml/src/ggml.c

+#if defined(__gnu_linux__)
+#include <endian.h>
+#else
+#define le64toh(x) (x)
+#define le32toh(x) (x)
+#define le16toh(x) (x)
+#endif
+
+// endianness conversion
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+#define convert_from_le16(x) UNUSED(x)
+#define convert_from_le32(x) UNUSED(x)
+#define convert_from_le64(x) UNUSED(x)
+#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+static inline void convert_from_le16(void * value) {
+    *((uint16_t*)value) = le16toh(*((uint16_t*)value));
+}
+
+static inline void convert_from_le32(void * value) {
+    *((uint32_t*)value) = le32toh(*((uint32_t*)value));
+}
+
+static inline void convert_from_le64(void * value) {
+    *((uint64_t*)value) = le64toh(*((uint64_t*)value));
+}
+#else
+#error Unexpected or undefined __BYTE_ORDER__
+#endif


This should be defined once in ggml-impl.h. Also check the newly updated contributing guidelines regarding the closing of preprocessor directives.

JohannesGaessler · 2025-01-14T15:24:28Z

ggml/src/ggml.c

+static void ggml_byteswap_q4_0(void * restrict buffer, size_t elements) {
+    block_q4_0 *data_ptr = (block_q4_0*) buffer;
+    for (size_t i = 0; i < elements; ++i) {
+        convert_from_le16(&(data_ptr[i].d));
+    }
+}


Why is only the endianness for the scale d but not the quantized values q being changed? More generally, quantized models require a lot of bit manipulation in order to work. Did you test this code with any quantized models or only with models using conventional, scalar datatypes such as FP16?

I didn't check this function, but ggml_byteswap_q4_k and ggml_byteswap_q6_k are used. This function I've implemented similarly.

I can disable all functions which are not explicitely tested yet, or I could test them if you could point me to models using them.

More generally speaking, I've thought about the problem of endianness some more and there are quite a lot of potential pitfalls because we do a lot of low-level bitwise manipulations. There's also e.g. the question of how to handle interaction with backends other than CPU which would likely still assume little endianness. Quite frankly I'm not sure whether this is a feature that would be worth the complexity to support given how rare big endian processors are. My personal opinion is that something like this would be fine to merge if it does not interfere with any little endian code and it is clear that big endian issues are wholly your responsibility to determine and resolve. @ggerganov @slaren your input would be appreciated.

I think it would be hard to justify the maintenance cost. I am not sure how this interacts with the existing support for big endian gguf files in the gguf python library and convert_hf_to_gguf.py. It might make more sense to instead improve on that, for example by adding a tool to convert a gguf file to a different endianess. Model repositories like ollama can deal with this by serving both versions on demand, if they wish.

I think you're correct about existing big endian files. I'll add a cmdline switch to disable byteswapping on s390x to allow loading such existing files.

Everyone uploading 2 copies of models, little endian and big endian, unfortunately is unlikely to happen. Due to that we'd like to have a solution to load little endian models on big endian system.

Of course, byteswapping in advance would help, and would enable, for example, using mmap. But additional big endian specific step isn't always possible, like in case with ollama, and can be just put inside loading model, like in this PR. And while I didn't measure performance, I expect loading model to take considerably less time than actually running model, and I expect byteswapping during loading not changing this noticeably.

As for gguf python library, if there is a testsuite outside of one described here (https://github.com/ggerganov/llama.cpp/blob/master/ci/README.md), then I could also check it.

taronaeo · 2025-01-15T04:05:49Z

If I may chime in on the consideration of support, this PR is a pretty big step forward as an umbrella support for zSystems and problematic/unsupported models while we actively work on implementing SIMD support individually (taronaeo/llama.cpp-s390x@891922f).

At current, even with --bigendian conversion using the convert_hf_to_gguf.py tool, Llama-3.2-1B fails to load correctly and this behaviour is similar to what is seen in #3298 (comment) as demonstrated here

$ build/bin/llama-cli -m /opt/hf_models/llama-3.2-1b-be.F32.gguf -n 50 -t 8 -p "Write me a dog walking business idea  1." --seed 1568795874

system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : VXE = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

sampler seed: 1568795874
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 50, n_keep = 1

Write me a dog walking business idea  1.GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

llama_perf_sampler_print:    sampling time =       3.97 ms /    62 runs   (    0.06 ms per token, 15625.00 tokens per second)
llama_perf_context_print:        load time =     702.51 ms
llama_perf_context_print: prompt eval time =     459.48 ms /    12 tokens (   38.29 ms per token,    26.12 tokens per second)
llama_perf_context_print:        eval time =    9103.29 ms /    49 runs   (  185.78 ms per token,     5.38 tokens per second)
llama_perf_context_print:       total time =    9602.03 ms /    61 tokens

If support to convert little-endian to big-endian is provided on-the-fly, albeit understandably with performance penalties, we can see models such as Llama-3.2 start running properly using little-endian while we work on a patch to make it work using big-endian.

AlekseiNikiforovIBM · 2025-01-22T13:06:51Z

I've implemented '--no-byteswap' switch and tested it with model converted with help of #11349

Introduce byteswap function for ggml data. Implement some of them. Currently tested on llama3.2.

AlekseiNikiforovIBM · 2025-01-22T13:24:17Z

Rebased PR to resolve merge conflicts

AlekseiNikiforovIBM · 2025-01-30T14:00:55Z

I've added option to disable byteswapping on s390x. While byteswapping could be implemented for little-endian systems, I believe most models are little-endian and there's no demand for such option at the moment.

I'd like to proceed with this PR. What's needed for that?

slaren · 2025-01-30T14:48:08Z

I don't think this can be merged in its current state. As it is, it just going to be a maintenance burden and will increase the technical debt that we will need to pay when eventually we come to work on this code again and have to figure a way to deal with this mess.

Fundamentally, the problem is that the gguf loader is not designed to support byteswapping. It just wasn't a consideration when it was created, and it would take a major effort to redesign the system to allow transparent byteswapping during loading.

Additionally, you need to figure how to deal with big endian GGUF files. Adding a command line flag to just dump the responsibility of figuring that to the user is not a good way to deal with this. I can see two options, remove support for big endian GGUF files entirely, or add the necessary logic to support both types.

Alternatively, you could instead extend the existing support for big endian GGUF files. This would be by far the easiest route, and the most likely to be merged.

AlekseiNikiforovIBM · 2025-01-30T17:04:12Z

I've checked gguf-py and it evaluates byte order based on version field. I'll remove cmdline switch and remake c++ code using same logic.

As for byteswapping itself, I'll try reworking it.

Bitfields are allocated in different order on s390x

Add ggml_ prefix to their names.

Now little-endian systems can load big-endian models too. Model endianness heuristic is based on guessing using model version field. Additional fixes for testsuite after removing capability to write non-native endian files.

AlekseiNikiforovIBM · 2025-02-18T14:33:02Z

I've reworked PR a bit. First of all, I've removed the command line option I introduced earlier. Now there is a heuristic similar to one in python code, where it guesses endian type based on value of version field.

I've removed capability to save gguf files in non-native endianness. It automatically saves files in native endian.

I didn't check byteswapping during asynchronous data loading even with cuda on x86 in llama_model_loader::load_all_data, because I couldn't force it to switch to this code path. I got upload_backend == nullptr every way I tried it. I'm pretty sure it should work, but If you could point me on how to force llama.cpp to run this code path, I could verify that it works too.

Please let me know if there are still any pain points present in this PR.

AlekseiNikiforovIBM · 2025-02-20T11:18:54Z

Are those 3 failures related to changes in this PR? If yes, what exactly is wrong?

slaren · 2025-02-26T00:23:58Z

The changes look like a good improvement, however the main problem is still that this is not transparent to the users of the gguf library, and requires additional code to check and byteswap the tensor data if necessary. llama.cpp is not the only application of ggml.

Ideally, this should be completely transparent to the user, but that's not possible without also moving the data loading code (including mmap support) to the gguf library, which would be a major change. When the gguf loader is used without no_alloc, it will also not perform byteswapping of the tensor data, resulting in the application silently loading incorrect tensor data.

As it is, other applications that use the gguf library will fail silently when loading a gguf file with a non-native byte order, which is not great. It will also force us to add all the boilerplate for byteswapping to all new code using the gguf library. I am not saying that this is necessarily a deal-breaks when it comes to the decision of merging this PR, but it would make the decision a lot easier if these issues were addressed. If that's not possible, I will defer to @ggerganov or other maintainers.

slaren · 2025-02-26T00:25:29Z

Are those 3 failures related to changes in this PR? If yes, what exactly is wrong?

I am not sure what's the issue, but I don't think it's caused by the changes in this PR. Merging the latest changes from master may fix it.

ggerganov · 2025-02-26T08:11:22Z

If I understand correctly, on master, big-endian systems are currently capable of working with big-endian GGUFs. I think this is already good enough support. I can see that being able to load a LE GGUF on a BE system would be convenient, but I agree that the changes proposed here do not justify the maintenance associated. I personally haven't seen/used a BE machine in my life, so my assumption is that this is a very niche use case - please correct me if I am wrong.

The ideal solution is indeed to handle all this seamlessly in the gguf implementation, but as @slaren pointed out, it appears to be a really big change. So unless we see some strong argument that this feature is needed, I would also prefer not to merge as it is.

AlekseiNikiforovIBM · 2025-03-10T12:43:13Z

The changes look like a good improvement, however the main problem is still that this is not transparent to the users of the gguf library, and requires additional code to check and byteswap the tensor data if necessary. llama.cpp is not the only application of ggml.

What if I add new field in struct gguf_init_params, named like allow_byteswapping? It'll allow all clients to opt-in into this feature. I'll make it off by default, and enable it for llama.cpp? This'll keep other potential gguf library clients from breaking.

When the gguf loader is used without no_alloc, it will also not perform byteswapping of the tensor data, resulting in the application silently loading incorrect tensor data.

I'll disable byteswapping if no_alloc is enabled.

I agree that the changes proposed here do not justify the maintenance associated.

As for maintaining byteswapping parts and/or big endian, me and my colleagues (ATM https://github.com/Andreas-Krebbel when I'm not available) are willing to support those parts. Feel free to assign any issue arising in the context of this to me or my colleagues. We could also set up a s390x CI runner to validate any changes on a BE target.

I personally haven't seen/used a BE machine in my life, so my assumption is that this is a very niche use case - please correct me if I am wrong.

While the IBM Z platform may be a niche platform in terms of the number of systems out there, it powers critical workloads around the world in finance, airline reservations, insurance, retail and many more. Among the biggest players in these industries, the IBM Z platform is seeing good market adoption. The ability to run llama.cpp on IBM Z without too much hassle allows large enterprises to use it in combination with their traditional workloads.

Enable it for llama.cpp

AlekseiNikiforovIBM · 2025-03-11T14:53:02Z

Disabling byteswapping when no_alloc is enabled didn't work. But with latest commit byteswapping is opt-in.

JohannesGaessler · 2025-03-12T08:09:41Z

My understanding is that BE systems can already run llama.cpp, but it's inconvenient. So BE support with mainline model files would be a convenience gain for only a small number of people. How critical those BE systems are is not relevant. So any implementations must be done in a way that is minimally invasive and does not induce any side effects.

AlekseiNikiforovIBM · 2025-03-14T14:05:10Z

My understanding is that BE systems can already run llama.cpp, but it's inconvenient.

While llama.cpp on BE systems can run BE models, it cannot run LE models.

One of users of llama.cpp is ollama, and it does not work on BE systems because ollama downloads a model from an online repository, and that model is always LE model. There could be BE models there, but I have not seen any yet in their repositories.

And while it is possible to run an external python script to convert model from LE to BE or other way around for llama.cpp, for ollama there is no good place available where this could be done as far as I can see at the moment. The only good option is to do it on the fly.

So any implementations must be done in a way that is minimally invasive and does not induce any side effects.

I'm trying to create such implementation and I'd be grateful if you'd point me to any remaining such issues in current implementation.

AlekseiNikiforovIBM · 2025-04-02T14:53:49Z

Do you have any suggestions to improve this code further in order to bring it to acceptable state?

ggerganov · 2025-04-02T16:58:24Z

I am afraid the changes still add to much extra logic that we will have to work around over time and it does not justify the small added benefit. Currently this code handles the quantization types on a case-by-case basis, which is OK but not great. However a GGUF can contain more generic data (such as the unicode strings that you added code to handle in llama.cpp). Without a generic way to tag the data in the GGUF that requires byte swapping and doing it seamlessly from the user of the gguf library, I don't think it is worth adding the technical debt.

One of users of llama.cpp is ollama

That's a good point and I think the better solution would be for providers like ollama to offer big-endian models seamlessly. However, my understanding is that ollama no longer uses llama.cpp, so I guess this is no longer a factor in the decision to add BE support or not.

ericcurtin · 2025-04-30T09:39:20Z

I am afraid the changes still add to much extra logic that we will have to work around over time and it does not justify the small added benefit. Currently this code handles the quantization types on a case-by-case basis, which is OK but not great. However a GGUF can contain more generic data (such as the unicode strings that you added code to handle in llama.cpp). Without a generic way to tag the data in the GGUF that requires byte swapping and doing it seamlessly from the user of the gguf library, I don't think it is worth adding the technical debt.

One of users of llama.cpp is ollama

That's a good point and I think the better solution would be for providers like ollama to offer big-endian models seamlessly. However, my understanding is that ollama no longer uses llama.cpp, so I guess this is no longer a factor in the decision to add BE support or not.

FWIW @AlekseiNikiforovIBM and @ggerganov RamaLama would probably pick up the slack here, it's one of the reasons RamaLama exists :)

If it works with upstream llama.cpp we can make it work with RamaLama

AlekseiNikiforovIBM · 2025-08-28T08:53:01Z

Closing this PR as abandoned

AlekseiNikiforovIBM requested a review from JohannesGaessler as a code owner January 14, 2025 12:35

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Jan 14, 2025

JohannesGaessler reviewed Jan 14, 2025

View reviewed changes

github-actions bot added the examples label Jan 22, 2025

AlekseiNikiforovIBM added 3 commits January 22, 2025 14:07

Load little-endian models on s390x

9a4b0df

Introduce byteswap function for ggml data. Implement some of them. Currently tested on llama3.2.

Implement most of remaining byteswap functions

a8757fe

Implement byteswap for tq1_0 and tq2_0

0682209

AlekseiNikiforovIBM force-pushed the s390x_support branch from 43544fb to a9db9b0 Compare January 22, 2025 13:24

AlekseiNikiforovIBM added 8 commits February 11, 2025 15:33

Get rid of additional memcpy calls

e8ebab4

Fix unicode flags conversion from and to uint16_t

2747b31

Bitfields are allocated in different order on s390x

Fix unused variable warnings

4171300

Move conversion functions to common header

a961c6f

Add ggml_ prefix to their names.

Make assert messages unique

6f484a4

Update alignment of byteswap function type definition

8420640

Update preprocessor directives according to guidelines

f1a4b69

Make byteswapping more universal

3ed07ff

Now little-endian systems can load big-endian models too. Model endianness heuristic is based on guessing using model version field. Additional fixes for testsuite after removing capability to write non-native endian files.

AlekseiNikiforovIBM force-pushed the s390x_support branch from a9db9b0 to 3ed07ff Compare February 18, 2025 14:12

github-actions bot added the python python script changes label Feb 18, 2025

Make byteswapping opt-in

3bddc6a

Enable it for llama.cpp

VyoJ mentioned this pull request Jul 31, 2025

Enable automatic endianness conversion, SIMD & MASS support zopencommunity/llamacppport#4

Merged

AlekseiNikiforovIBM closed this Aug 28, 2025

Allow s390x to load little endian models unmodified #11234

Allow s390x to load little endian models unmodified #11234

Uh oh!

Conversation

AlekseiNikiforovIBM commented Jan 14, 2025

Uh oh!

JohannesGaessler commented Jan 14, 2025

Uh oh!

AlekseiNikiforovIBM commented Jan 14, 2025

Uh oh!

JohannesGaessler Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

AlekseiNikiforovIBM Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

slaren Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

AlekseiNikiforovIBM Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

taronaeo commented Jan 15, 2025

Uh oh!

AlekseiNikiforovIBM commented Jan 22, 2025

Uh oh!

AlekseiNikiforovIBM commented Jan 22, 2025

Uh oh!

AlekseiNikiforovIBM commented Jan 30, 2025

Uh oh!

slaren commented Jan 30, 2025

Uh oh!

AlekseiNikiforovIBM commented Jan 30, 2025

Uh oh!

AlekseiNikiforovIBM commented Feb 18, 2025

Uh oh!

AlekseiNikiforovIBM commented Feb 20, 2025

Uh oh!

slaren commented Feb 26, 2025

Uh oh!

slaren commented Feb 26, 2025

Uh oh!

ggerganov commented Feb 26, 2025

Uh oh!

AlekseiNikiforovIBM commented Mar 10, 2025

Uh oh!

AlekseiNikiforovIBM commented Mar 11, 2025

Uh oh!

JohannesGaessler commented Mar 12, 2025

Uh oh!

AlekseiNikiforovIBM commented Mar 14, 2025

Uh oh!

AlekseiNikiforovIBM commented Apr 2, 2025

Uh oh!

ggerganov commented Apr 2, 2025

Uh oh!

ericcurtin commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlekseiNikiforovIBM commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ericcurtin commented Apr 30, 2025 •

edited

Loading