This document covers the complete data preparation pipeline for Being-VL training, including VQ token extraction, vBPE training, and dataset tokenization.
Before starting, ensure you have:
- Set up the workspace directory structure (see README.md)
- Installed Being-VL and dependencies
- Downloaded the required pre-trained models:
- Llama-3.1-8B checkpoint
- Being VQ-GAN model (8K vocabulary)
- Being tokenizer config: beingvl/config/being-tokenizer-config
- Downloaded raw image datasets for visual BPE tokenizer training
- Downloaded caption/conversation datasets for PT/SFT training (same as LLaVA format)
- Due to data licensing, we cannot provide direct downloads. You need to obtain datasets by yourself. Details of the data components can be found in our paper.
The first step is preparing VQ tokens for training the visual BPE tokenizer.
Convert your image dataset to VQ tokens using the pre-trained VQ-GAN model. This creates a compressed representation of images as discrete tokens.
import torch
import numpy as np
from PIL import Image
from transformers import ChameleonProcessor, ChameleonVQVAE
# Load processor and VQ-GAN model (using consistent workspace paths)
processor = ChameleonProcessor.from_pretrained("/path/to/your/workspace/models/being-tokenizer")
vqgan = ChameleonVQVAE.from_pretrained(
"/path/to/your/workspace/models/BeingVL-VQ-8K",
torch_dtype=torch.bfloat16,
)
vqgan.eval()
# Prepare your image list from workspace
images_list = [
"/path/to/your/workspace/data/images/image1.jpg",
"/path/to/your/workspace/data/images/image2.jpg",
"/path/to/your/workspace/data/images/image3.jpg",
]
# Load and process images
images = [Image.open(image_path).convert('RGB') for image_path in images_list]
images_pixels = processor.image_processor(
images,
return_tensors='pt'
)["pixel_values"].to(vqgan.device, vqgan.dtype)
# Encode images to tokens
_, _, toks = vqgan.encode(images_pixels)
# Save tokens to workspace vq_tokens directory
toks = toks.cpu().numpy() # Shape: (num_images, 32, 32)
np.save("/path/to/your/workspace/data/vq_tokens/train_tokens.npy", toks)
print(f"Saved {toks.shape[0]} image tokens with shape {toks.shape}")Once you have VQ tokens, train the visual BPE tokenizer to create an extended vocabulary for efficient image representation.
python beingvl/train/train_vbpe.py \
--data_path /path/to/your/workspace/data/vq_tokens/train_tokens.npy \
--output_path /path/to/your/workspace/data/vbpe/vbpe.pkl \
--num_merges 8192 \
--init_size 8192 \
--vocab_pad 128256 \
--vocab_end 136448 \
--validate_dataTraining Parameters:
--num_merges: Number of BPE merge operations (default: 8192)--init_size: Initial vocabulary size (8192 for VQ tokens)--vocab_pad: Padding to end of original text vocabulary (128256 for LLaMA 3)--vocab_end: End position for VQ tokens before BPE tokens start (136448)--validate_data: Validate input data format before training--save_intermediate: Save intermediate training statistics
Expected Output:
vbpe.pklfile containing the trained visual BPE tokenizer- Training logs showing merge operations and compression statistics
After vBPE training, prepare your datasets for the 3-stage training pipeline.
Being-VL accepts data in LLaVA format. See LLaVA Data for format details.
Input Format Example:
{
"image": "coco/train2017/000000175217.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nWhat is the color of the sink and vanity in the bathroom?"
},
{
"from": "gpt",
"value": "The sink and vanity in the bathroom are beige in color."
}
]
}Output Format (after tokenization):
{
"image": "coco/train2017/000000175217.jpg",
"conversations": [...],
"token_0": [128000, 128260],
"token_1": [128258, 3923, 374, ...],
"token_image_new": [131257, 130547, 135495, ...]
}Convert your prepared datasets into tokenized format for training:
# For pretraining datasets
python beingvl/utils/tokenize_dataset.py \
--mode pt \
--json_path /path/to/your/workspace/data/annotations/pretrain_data.json \
--image_path /path/to/your/workspace/data/images \
--output_dir /path/to/your/workspace/data/tokenized/pt \
--model_path /path/to/your/workspace/models/beingvl/base \
--vbpe_path /path/to/your/workspace/data/vbpe/vbpe.pkl \
--dataset_name "pretrain_data" \
--batch_size 32 \
--num_workers 32 \
--torch_dtype bfloat16
# For fine-tuning datasets (Stage 2)
python beingvl/utils/tokenize_dataset.py \
--mode sft \
--json_path /path/to/your/workspace/data/annotations/sft_stage2_data.json \
--image_path /path/to/your/workspace/data/images \
--output_dir /path/to/your/workspace/data/tokenized/sft_stage2 \
--model_path /path/to/your/workspace/models/beingvl/base \
--vbpe_path /path/to/your/workspace/data/vbpe/vbpe.pkl \
--dataset_name "sft_stage2_data" \
--batch_size 32 \
--num_workers 32 \
--torch_dtype bfloat16
# For fine-tuning datasets (Stage 3)
python beingvl/utils/tokenize_dataset.py \
--mode sft \
--json_path /path/to/your/workspace/data/annotations/sft_stage3_data.json \
--image_path /path/to/your/workspace/data/images \
--output_dir /path/to/your/workspace/data/tokenized/sft_stage3 \
--model_path /path/to/your/workspace/models/beingvl/base \
--vbpe_path /path/to/your/workspace/data/vbpe/vbpe.pkl \
--dataset_name "sft_stage3_data" \
--batch_size 32 \
--num_workers 32 \
--torch_dtype bfloat16Key parameters:
--mode: Set toptfor pretraining orsftfor fine-tuning--model_path: Use the initialized beingvl-base model (from Step 1 model initialization)--vbpe_path: Path to the trained vBPE tokenizer (from vBPE training step)--dataset_name: Custom name for the dataset (will be used in output filenames)--batch_size: Batch size for tokenization (adjust based on your GPU memory)--num_workers: Number of parallel workers for data loading (adjust based on your CPU cores)
Output files:
*_base.jsonl: Base tokenization without vBPE compression*_vbpe.jsonl: Enhanced tokenization with vBPE compression (use this for training)