Skip to content

tatuskarjaiwanth/quantizedfalcon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

QuantizedFalcon 🦅

A Jupyter Notebook demo for running Falcon‑7B‑Instruct with 4‑bit quantization using BitsAndBytes and Hugging Face, enabling efficient inference on limited GPU resources.

🚀 Overview

This project demonstrates how to:

  • Load Falcon‑7B‑Instruct with 4-bit quantization (via BitsAndBytes)
  • Perform inference using Hugging Face’s Transformers pipeline
  • Run the model on GPUs with limited VRAM (≤ 8 GB)
  • Easily customize generation settings for varied use cases

⚙️ Features

  • Model Loading: Uses BitsAndBytesConfig to enable load_in_4bit=True, significantly reducing VRAM and memory usage.
  • Inference Pipeline: Leverages Transformers pipeline("text-generation") for easy prompt-to-text generation.
  • Configurable Generation: Supports tuning of max_length, top_k, top_p, temperature, num_return_sequences, etc.
  • Jupyter-Friendly: All code is in an interactive notebook—ideal for experimentation and easy sharing.

🧩 Requirements

Install essential libraries:

pip install transformers bitsandbytes accelerate einops

Add any other libs required in the notebook (e.g., torch, etc.).

💾 Getting Started

  1. Clone the repo:
    git clone https://github.com/tatuskarjaiwanth/quantizedfalcon.git
    cd quantizedfalcon
  2. Install dependencies (see above).
  3. Open and run the notebook:
    jupyter notebook quantizedfalcon.ipynb
  4. Modify prompt settings as desired and execute cells for quantized inference.

📌 How It Works (Notebook Walkthrough)

  • BitsAndBytesConfig Setup: Configures 4-bit quantization with options like bnb_4bit_compute_dtype, bnb_4bit_quant_type, and bnb_4bit_use_double_quant.
  • Model Load: Loads tiiuae/falcon-7b-instruct with quantization and device_map="auto".
  • Inference Pipeline: Builds a pipeline() for text generation.
  • Demo Prompts: Try sample prompts with different settings to see quantized performance.

🧠 Suggested Improvements

  • Add benchmarking cells (execution time, VRAM usage)
  • Integrate with Hugging Face Inference API or TGI server
  • Explore quantization alternatives: 8-bit, NF4 vs FP4, GPT-Q
  • Extend to fine-tuning with PEFT or QLoRA on custom datasets

🛠️ Usage Example

from transformers import pipeline, BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM
import torch

bnb_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_compute_dtype=torch.float16,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
  "tiiuae/falcon-7b-instruct",
  quantization_config=bnb_config,
  device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct", use_fast=True)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

print(pipe("What is the capital of France?", max_length=50, top_k=50, temperature=0.7)[0]["generated_text"])

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors