Update README.md: Add Huggingface repo for 7B and 13B quantization (#142)

cnbeining · web-flow · commit 33485854213f · 2023-05-01T21:46:01.000+08:00
* Update README.md: Add Huggingface repo for 7B and 13B quantization * Update requirements.txt to pin PEFT and BNB version Reason - For BNB: tloen/alpaca-lora#350 For PEFT: huggingface/peft@c21afbe#diff-b3b90f453dea37bf90203fd395e9dedc21b21c9a38464c6b1572368c049ef8b2L116-L128
diff --git a/large_language_models/alpaca-qlora/requirements.txt b/large_language_models/alpaca-qlora/requirements.txt
@@ -3,6 +3,6 @@ loralib
 sentencepiece
 git+https://github.com/huggingface/transformers.git
 accelerate
-bitsandbytes
-git+https://github.com/huggingface/peft.git
-gradio
+bitsandbytes==0.37.2
+peft==0.2.0
+gradio
diff --git a/large_language_models/llama/quantization/README.md b/large_language_models/llama/quantization/README.md
@@ -1,4 +1,5 @@
 ### Update News
+- LLaMA-7B and 13B quantization are also available [here](https://huggingface.co/cnbeining/sparsebit-llama-quantization-7b-13b).
 - We have updated a llama-13b checkpoint with 3-bit 128-group quantization [here](https://drive.google.com/file/d/1LjZmOU8tr2VT6HdAP_WbuX8cqmrs5DrR). For config_cache and tokenizer_cache, the files can be found [here in huggingface](https://huggingface.co/decapoda-research/llama-13b-hf).
 - We implemented a cuda kernel for groupsize=128(int3/int4) & groupsize=64(int2). In our experiments, setting groupsize=128(int3) can make all quantization models achieve a significant increase in ppl compared to groupsize=-1. All results are updated in Table A.
 - We add `--single_device_mode` to support all quant models run in a single GPU(i.e. 2080ti). Please refer to the inference section for details.

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,5 @@`
`1`	`1`	`### Update News`
	`2`	`+- LLaMA-7B and 13B quantization are also available [here](https://huggingface.co/cnbeining/sparsebit-llama-quantization-7b-13b).`
`2`	`3`	`- We have updated a llama-13b checkpoint with 3-bit 128-group quantization [here](https://drive.google.com/file/d/1LjZmOU8tr2VT6HdAP_WbuX8cqmrs5DrR). For config_cache and tokenizer_cache, the files can be found [here in huggingface](https://huggingface.co/decapoda-research/llama-13b-hf).`
`3`	`4`	`- We implemented a cuda kernel for groupsize=128(int3/int4) & groupsize=64(int2). In our experiments, setting groupsize=128(int3) can make all quantization models achieve a significant increase in ppl compared to groupsize=-1. All results are updated in Table A.`
`4`	`5`	- We add `--single_device_mode` to support all quant models run in a single GPU(i.e. 2080ti). Please refer to the inference section for details.