QuantizedFalcon 🦅

A Jupyter Notebook demo for running Falcon‑7B‑Instruct with 4‑bit quantization using BitsAndBytes and Hugging Face, enabling efficient inference on limited GPU resources.

🚀 Overview

This project demonstrates how to:

Load Falcon‑7B‑Instruct with 4-bit quantization (via BitsAndBytes)
Perform inference using Hugging Face’s Transformers pipeline
Run the model on GPUs with limited VRAM (≤ 8 GB)
Easily customize generation settings for varied use cases

⚙️ Features

Model Loading: Uses BitsAndBytesConfig to enable load_in_4bit=True, significantly reducing VRAM and memory usage.
Inference Pipeline: Leverages Transformers pipeline("text-generation") for easy prompt-to-text generation.
Configurable Generation: Supports tuning of max_length, top_k, top_p, temperature, num_return_sequences, etc.
Jupyter-Friendly: All code is in an interactive notebook—ideal for experimentation and easy sharing.

🧩 Requirements

Install essential libraries:

pip install transformers bitsandbytes accelerate einops

Add any other libs required in the notebook (e.g., torch, etc.).

💾 Getting Started

Clone the repo:

git clone https://github.com/tatuskarjaiwanth/quantizedfalcon.git
cd quantizedfalcon

Install dependencies (see above).
Open and run the notebook:
```
jupyter notebook quantizedfalcon.ipynb
```
Modify prompt settings as desired and execute cells for quantized inference.

📌 How It Works (Notebook Walkthrough)

BitsAndBytesConfig Setup: Configures 4-bit quantization with options like bnb_4bit_compute_dtype, bnb_4bit_quant_type, and bnb_4bit_use_double_quant.
Model Load: Loads tiiuae/falcon-7b-instruct with quantization and device_map="auto".
Inference Pipeline: Builds a pipeline() for text generation.
Demo Prompts: Try sample prompts with different settings to see quantized performance.

🧠 Suggested Improvements

Add benchmarking cells (execution time, VRAM usage)
Integrate with Hugging Face Inference API or TGI server
Explore quantization alternatives: 8-bit, NF4 vs FP4, GPT-Q
Extend to fine-tuning with PEFT or QLoRA on custom datasets

🛠️ Usage Example

from transformers import pipeline, BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM
import torch

bnb_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_compute_dtype=torch.float16,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
  "tiiuae/falcon-7b-instruct",
  quantization_config=bnb_config,
  device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct", use_fast=True)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

print(pipe("What is the capital of France?", max_length=50, top_k=50, temperature=0.7)[0]["generated_text"])

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LLM.ipynb		LLM.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QuantizedFalcon 🦅

🚀 Overview

⚙️ Features

🧩 Requirements

💾 Getting Started

📌 How It Works (Notebook Walkthrough)

🧠 Suggested Improvements

🛠️ Usage Example

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

QuantizedFalcon 🦅

🚀 Overview

⚙️ Features

🧩 Requirements

💾 Getting Started

📌 How It Works (Notebook Walkthrough)

🧠 Suggested Improvements

🛠️ Usage Example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages