Skip to content

Conversation

@Thireus
Copy link
Contributor

@Thireus Thireus commented Nov 17, 2025

First, thank you for maintaining this project — it has been very useful, and I appreciate the work that has gone into it.

I initially created a fork to add automated Windows builds for my own use, since I needed ready-to-use binaries. Since this is functionality that could benefit other users as well, I’m submitting this pull request so the Windows build workflow can live directly in the main repository instead of in my fork.

This PR includes:

  1. GitHub Actions workflow for building the project on Windows
    
  2. Automatic artifact uploads so users can download ready-to-use Windows builds
    
  3. Other small tweaks to ensure the code compiles on Windows.
    

Builds must be manually triggered (I believe they could be automated after each commit, but I have not had the chance to dig into it). There is also a bit of cleanup to do, specifically regarding other automated jobs that run to conduct meaningless checks. The original code was obtained from mainline with some tweaking to adapt it to ik_llama.cpp.

My goal was to make the project more accessible to Windows users, specifically users who do not find the time to set up a dev env or don't have the knowledge to do it on Windows, such as myself. If you’d prefer changes to structure, naming, workflow triggers, or anything else, I’m happy to adjust the PR accordingly.

Thanks again for the project and for taking the time to review this!

};
auto compute_1row = [&] (const float * xr) {
float weight[kBlockSize];
std::vector<float> weight(kBlockSize); // float weight[kBlockSize]; - Fix for error C2131: expression did not evaluate to a constant
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has been disabled in #707

if (GGML_AVX512)
list(APPEND ARCH_FLAGS -mavx512f)
list(APPEND ARCH_FLAGS -mavx512bw)
list(APPEND ARCH_FLAGS -mavx512dq)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not actually enable the fast AVX512 matrix multiplication version. The fast version requires -mavx512vnni in addition to these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, which is the one below, which produces the avx512_vnni build variant:

        if (GGML_AVX512_VNNI)
            list(APPEND ARCH_FLAGS -mavx512vnni)
        endif()

Is there any advantage running avx512 without vnni over axv2?

Copy link
Owner

@ikawrakow ikawrakow Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very little, if any. If VNNI is not enabled, the matrix multiplication implementation will use the AVX2 version. There will be some portions in flash attention that will get processed with 512-bit registers/instructions, but the main cost (the K*Q matrix multiplication) will be still AVX2.

I haven't done a more fine-grained implementation of matrix multiplications because my CPU (Ryzen-7950X, so Zen4 core) uses "double-pumping", i.e., 512-bit instructions get executed as two 256-bit instructions, so the only time you gain something is when you can use AVX512 instructions that are not there (as 256-bit equivalents) in AVX2. The most important such instruction is _mm512_dpbusd_epi32, which is part of the VNNI extension. Without that instruction, using 512-bit registers/instructions brings zero benefit on Zen4 cores. Things are different on Zen5 and some of the high-end Intel CPUs, but without regular access to such CPUs there is no meaningful way to develop more fine-grained versions of the matrix multiplication kernels.

inline __m512i operator^(const __m512i& a, const __m512i& b) { return _mm512_xor_si512(a, b); }

// AVX2 integer bitwise operators
inline __m256i operator|(const __m256i& a, const __m256i& b) { return _mm256_or_si256(a, b); }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary.

inline __m256i operator^(const __m256i& a, const __m256i& b) { return _mm256_xor_si256(a, b); }

// NEON
#ifdef __ARM_NEON
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary. The |,&,^ operators ended up by mistake in the AVX512 version, but are not used in any of the other implementations.

@ikawrakow
Copy link
Owner

So, I specifically threw out all of llama.cpp CI and GitHub actions. And now we will have them all back, almost all failing?

@Thireus
Copy link
Contributor Author

Thireus commented Nov 17, 2025

Thank you for the comments.

If the goal is wider adoption then removing CI and builds works against it. The builds I produce have a few dozens of users (and other projects, e.g. Jan: janhq/jan#6917, are actively asking for ready-to-download ik_llama.cpp artifacts).

The demand is there. But without CI and artifact generation, only power users can realistically adopt ik_llama.cpp which is unfortunate, since users with lower-end consumer hardware - the ones who benefit most from ik_llama.cpp’s optimisations - are in my opinion the least likely to build from source. Same reasoning applies I suppose to why users keep sticking to Windows OS while clearly it's not the best OS to run LLMs.

From my perspective, facilitating integration with other frameworks and lowering the barrier to use will naturally boost adoption.

@ikawrakow
Copy link
Owner

I can see the benefit of producing ready builds. But if so, then only for the platforms that are actually supported. To me it looks like this has been copied over from llama.cpp, so there is SYCL, RiscV, and the kitchen sink.

llama.cpp has put way more effort into making it work on as many platforms as possible. Their back-ends are loaded dynamically, which allows to provide build artifacts more easily. None of this is true here. I would have thought that you have done a few Windows configurations (CPU-only, CPU+CUDA, etc.), and that was that. A lot of the llama.cpp stuff just does not apply here anymore.

@Thireus Thireus marked this pull request as draft November 17, 2025 16:41
@Thireus
Copy link
Contributor Author

Thireus commented Nov 17, 2025

I did what I could with the knowledge and resources I had. I had to start from something that worked and hacked my way into making it function for ik_llama.cpp. I acknowledge that this CI is definitely “dirty” code and much of it could be thrown away.

I can’t afford to spend the amount of time required for full DevOps and cross-platform support right now. That’s why everything except Windows CUDA is disabled. I tried multiple times to build without CUDA and for other OSs, but I couldn’t get it to work.

This setup certainly requires cleaning and isn’t ready to merge if the goal is fully clean, production-quality code. It was intended more as a starting point.

@Kreatifchk
Copy link

I did what I could with the knowledge and resources I had. I had to start from something that worked and hacked my way into making it function for ik_llama.cpp. I acknowledge that this CI is definitely “dirty” code and much of it could be thrown away.

I can’t afford to spend the amount of time required for full DevOps and cross-platform support right now. That’s why everything except Windows CUDA is disabled. I tried multiple times to build without CUDA and for other OSs, but I couldn’t get it to work.

This setup certainly requires cleaning and isn’t ready to merge if the goal is fully clean, production-quality code. It was intended more as a starting point.

do you have a CPU only version, I didn’t find it?

@Thireus
Copy link
Contributor Author

Thireus commented Nov 24, 2025

Unfortunately not, I tried that a while ago and spent countless hours on it but it fails to build without Cuda.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants