This seems like one of the best options for quantization for the important new LLaMA-3 70B model so that it can be run on 1-2 consumer grade GPUs. However it looks like support for MQA is not present in llama.py so it will not work I think.
Are you planning to add support for LLaMA-3?