A Jupyter Notebook demo for running Falcon‑7B‑Instruct with 4‑bit quantization using BitsAndBytes and Hugging Face, enabling efficient inference on limited GPU resources.
This project demonstrates how to:
- Load Falcon‑7B‑Instruct with 4-bit quantization (via BitsAndBytes)
- Perform inference using Hugging Face’s Transformers pipeline
- Run the model on GPUs with limited VRAM (≤ 8 GB)
- Easily customize generation settings for varied use cases
- Model Loading: Uses
BitsAndBytesConfigto enableload_in_4bit=True, significantly reducing VRAM and memory usage. - Inference Pipeline: Leverages Transformers
pipeline("text-generation")for easy prompt-to-text generation. - Configurable Generation: Supports tuning of
max_length,top_k,top_p,temperature,num_return_sequences, etc. - Jupyter-Friendly: All code is in an interactive notebook—ideal for experimentation and easy sharing.
Install essential libraries:
pip install transformers bitsandbytes accelerate einopsAdd any other libs required in the notebook (e.g., torch, etc.).
- Clone the repo:
git clone https://github.com/tatuskarjaiwanth/quantizedfalcon.git cd quantizedfalcon - Install dependencies (see above).
- Open and run the notebook:
jupyter notebook quantizedfalcon.ipynb
- Modify prompt settings as desired and execute cells for quantized inference.
- BitsAndBytesConfig Setup: Configures 4-bit quantization with options like
bnb_4bit_compute_dtype,bnb_4bit_quant_type, andbnb_4bit_use_double_quant. - Model Load: Loads
tiiuae/falcon-7b-instructwith quantization anddevice_map="auto". - Inference Pipeline: Builds a
pipeline()for text generation. - Demo Prompts: Try sample prompts with different settings to see quantized performance.
- Add benchmarking cells (execution time, VRAM usage)
- Integrate with Hugging Face Inference API or TGI server
- Explore quantization alternatives: 8-bit, NF4 vs FP4, GPT-Q
- Extend to fine-tuning with PEFT or QLoRA on custom datasets
from transformers import pipeline, BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"tiiuae/falcon-7b-instruct",
quantization_config=bnb_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct", use_fast=True)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("What is the capital of France?", max_length=50, top_k=50, temperature=0.7)[0]["generated_text"])