|
| 1 | +# SFT Data Preprocessing |
| 2 | + |
| 3 | +### 1.🐳 Docker (Recommended) |
| 4 | + |
| 5 | +We strongly recommend using the docker environment for a seamless experience. |
| 6 | + |
| 7 | +```bash |
| 8 | +# Clone repository |
| 9 | +git clone https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5.git |
| 10 | +cd LLaVA-OneVision-1.5 |
| 11 | + |
| 12 | +docker build -t llava_megatron:25.04 . |
| 13 | + |
| 14 | +# Run container with -w to set working directory directly to the mounted volume |
| 15 | +docker run -it --gpus all \ |
| 16 | + --ipc host --net host --privileged --cap-add IPC_LOCK \ |
| 17 | + --ulimit memlock=-1 --ulimit stack=67108864 --rm \ |
| 18 | + -v $(pwd):/workspace/LLaVA-OneVision-1.5 \ |
| 19 | + -w /workspace/LLaVA-OneVision-1.5 \ |
| 20 | + --name "llava_megatron_container" \ |
| 21 | + llava_megatron:25.04 /bin/bash |
| 22 | +``` |
| 23 | + |
| 24 | +## 2\. Data Download |
| 25 | + |
| 26 | +Download LLaVA-NeXT-780k-webdataset at [LLaVA-NeXT-780k](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) |
| 27 | + |
| 28 | + |
| 29 | +## 3\. Execute the WebDataset Conversion |
| 30 | + |
| 31 | +### 3.1. Raw Data Processing |
| 32 | +Here we provide the code guidance to convert the dataset into the format of LLaVA-OneVision. |
| 33 | + |
| 34 | +```python |
| 35 | +import os |
| 36 | +import json |
| 37 | +import pandas as pd |
| 38 | +from PIL import Image |
| 39 | +from io import BytesIO |
| 40 | +from tqdm import tqdm |
| 41 | + |
| 42 | +# Configuration |
| 43 | +PARQUET_DIR = "LLaVA-NeXT-Data/data" |
| 44 | +OUTPUT_IMAGE_DIR = "images" |
| 45 | +OUTPUT_JSON_FILE = "mllm_mix.json" |
| 46 | + |
| 47 | +os.makedirs(OUTPUT_IMAGE_DIR, exist_ok=True) |
| 48 | + |
| 49 | +merged_data = [] |
| 50 | +for filename in tqdm(sorted(f for f in os.listdir(PARQUET_DIR) if f.endswith('.parquet'))): |
| 51 | + df = pd.read_parquet(os.path.join(PARQUET_DIR, filename), |
| 52 | + columns=['id', 'conversations', 'image']) |
| 53 | + |
| 54 | + for _, row in df.iterrows(): |
| 55 | + if row['image'] == None: |
| 56 | + merged_data.append({ |
| 57 | + "id": row['id'], |
| 58 | + "messages": row['conversations'].tolist() |
| 59 | + }) |
| 60 | + continue |
| 61 | + img = Image.open(BytesIO(row['image']['bytes'])) |
| 62 | + ext = 'jpg' if img.format in ['JPEG', 'JPG'] else 'png' |
| 63 | + img_name = f"{row['id']}.{ext}" |
| 64 | + img_path = os.path.join(OUTPUT_IMAGE_DIR, img_name) |
| 65 | + parent_dir = os.path.dirname(img_path) |
| 66 | + if not os.path.exists(parent_dir): |
| 67 | + os.makedirs(parent_dir, exist_ok=True) |
| 68 | + img.save(img_path) |
| 69 | + merged_data.append({ |
| 70 | + "id": row['id'], |
| 71 | + "messages": row['conversations'].tolist(), |
| 72 | + "images": [img_name] |
| 73 | + }) |
| 74 | + |
| 75 | +with open(OUTPUT_JSON_FILE, 'w', encoding='utf-8') as f: |
| 76 | + json.dump(merged_data, f, ensure_ascii=False, indent=4) |
| 77 | + |
| 78 | +``` |
| 79 | + |
| 80 | + |
| 81 | +### 3.2. WebDataset Generation |
| 82 | + |
| 83 | +Run the conversion script, providing the necessary arguments based on your data type. |
| 84 | + |
| 85 | +```bash |
| 86 | +python tools/data_preprocess/convert_to_webdataset.py \ |
| 87 | + --output_dir wds \ |
| 88 | + --json_file mllm_mix.json \ |
| 89 | + --image_dir images \ |
| 90 | + --maxcount 10000 |
| 91 | +``` |
| 92 | +Key Parameters |
| 93 | +| Parameter | Type | Required | Description | |
| 94 | +| :--- | :--- | :--- | :--- | |
| 95 | +| **`--output_dir`** | `str` | Yes | The directory to save the final WebDataset files | |
| 96 | +| **`--json_file`** | `str` | Yes | The path to the main JSON file containing the dataset metadata. | |
| 97 | +| **`--image_dir`** | `str` | No | The directory path containing the image files. Required if **media** is **image** or **mix**. | |
| 98 | +| **`--video_dir`** | `str` | No | The directory path containing the video files. Required if **media** is **image** or **mix**. | |
| 99 | +| **`--media`** | `str` | No | The type of media to process: **image**, **video**, or **mix** (default: **mix**). | |
| 100 | +| **`--maxcount`** | `int` | No | Maximum number of samples per WebDataset shard (default: $10000$). | |
| 101 | +| **`--maxsize`** | `int` | No | Maximum byte size of each shard (default: $3$ GB). | |
| 102 | +| **`--columns_messages`** | `str` | No | The key in the JSON entry that holds the conversational messages (default: **messages**). | |
0 commit comments