Skip to content

Conversation

@ochafik
Copy link
Collaborator

@ochafik ochafik commented Jan 14, 2025

  • Building specific archs separately to get maximum performance, smallest package size & shortest built times possible (compare a build for 7.5+8.0 vs. just 7.5 for instance: libggml-cuda.so is almost twice the size / ~70MB per arch)

  • Colab one-liner (example usage to install CUDA llama.cpp (will need to adjust to github releases; will write an install script when releases are available):

    # Temporarily, hosting binaries on my own server
    !wget -O llama-cpp.zip "https://download.ochafik.com/llama.cpp/llama-cpp-master-cuda-$( nvidia-smi | grep "CUDA Version: " | sed -E 's/.*?Version: ([0-9]+\.[0-9]+).*/\1/' )-cap-$( nvidia-smi --query-gpu=compute_cap --format=csv | tail -n 1 ).zip" && unzip -o llama-cpp.zip
    
    # Once this PR gets merged
    !wget -O llama-cpp.zip "$( curl --silent "https://api.github.com/repos/ggerganov/llama.cpp/releases/latest" | grep cuda-cu$( nvidia-smi | grep "CUDA Version: " | sed -E 's/.*?Version: ([0-9]+\.[0-9]+).*/\1/' )-cap$( nvidia-smi --query-gpu=compute_cap --format=csv | tail -n 1 ) | grep browser_download_url | sed -E 's/.*(https:.*)"/\1/' )"
    

TODO

  • Merge ci: ccache for all github worfklows #11516
  • Fix build on ci
  • Compare benefit of separate archives for server / cli vs. full?
  • Trigger a branch release if possible (to test entire mechanics)
  • Incubate install.sh (Unix incl. WSL) & install.ps1 (Windows) scripts that detect os, arch, cpu & gpu caps and install the right release (maybe through brew)

@github-actions github-actions bot added devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Jan 14, 2025
@slaren
Copy link
Member

slaren commented Jan 14, 2025

What's the reason for making a different release for each arch?

@ochafik
Copy link
Collaborator Author

ochafik commented Jan 14, 2025

What's the reason for making a different release for each arch?

@slaren Building for a single arch seems a lot faster, and having separate artefacts instead of (cuda-)fat binaries means smaller downloads / quicker setup on Colab. I couldn't finish a full build w/ all the architectures locally yet tho, maybe I'll try this to see how much overhead per arch we're talking about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants