Generate comprehensive, dense portrait descriptions using a vision‑language model.
This repository provides a command-line tool (infer.py) that:
- Loads the
Qwen2.5-VL-7B-Instruct-abliteratedmodel - Processes images (supported formats:
.png,.jpg,.jpeg,.webp,.bmp,.gif) - Produces rich descriptive paragraphs covering emotional expression, posture, clothing or nudity, body type, hair, and environmental context
- Outputs one
.txtfile per image
git clone https://github.com/anto18671/image-to-dense-caption.git
cd image-to-dense-captionpython -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activatepip install -r requirements.txtgit lfs install
git clone https://huggingface.co/huihui-ai/Qwen2.5-VL-7B-Instruct-abliteratedEnsure the folder Qwen2.5-VL-7B-Instruct-abliterated sits in the same directory as infer.py.
- Place your images in a subfolder (default:
images/) - Run the script:
python infer.pyThis will:
- Scan the folder for valid image files
- Generate a
.txtwith dense descriptions for each image
images/photo1.jpg → images/photo1.txt
images/portrait.webp → images/portrait.txt
Each .txt includes a paragraph describing emotional expression, posture, clothing/nudity status, body type, hair, and environment.
-
GPU: Preferably use a GPU with ≥ 16 GB VRAM.
-
Memory options:
- Use 8‑bit quantization (via
bitsandbytes) for lower VRAM. - Switch to
torch_dtype=torch.float16if supported by your setup.
- Use 8‑bit quantization (via
-
Custom folder: Change the
image_folderpath ininfer.pyif needed.
-
OSError / Model not found: Confirm the model folder is correctly named and in place.
-
CUDA out-of-memory:
- Reduce VRAM usage by quantizing the model.
- Run on CPU by removing
.to("cuda")—will be slower.
-
Non‑image files: Unsupported extensions are automatically skipped.
MIT License — this script
Model usage under Hugging Face terms (see huihui-ai/Qwen2.5-VL-7B-Instruct-abliterated for details)