Skip to content

Commit 0eb6c66

Browse files
authored
Merge pull request #22 from wideyard/main
2 parents 8f6cbd5 + 4441e4c commit 0eb6c66

File tree

1 file changed

+102
-0
lines changed

1 file changed

+102
-0
lines changed

docs/sft_data_preprocessing.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# SFT Data Preprocessing
2+
3+
### 1.🐳 Docker (Recommended)
4+
5+
We strongly recommend using the docker environment for a seamless experience.
6+
7+
```bash
8+
# Clone repository
9+
git clone https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5.git
10+
cd LLaVA-OneVision-1.5
11+
12+
docker build -t llava_megatron:25.04 .
13+
14+
# Run container with -w to set working directory directly to the mounted volume
15+
docker run -it --gpus all \
16+
--ipc host --net host --privileged --cap-add IPC_LOCK \
17+
--ulimit memlock=-1 --ulimit stack=67108864 --rm \
18+
-v $(pwd):/workspace/LLaVA-OneVision-1.5 \
19+
-w /workspace/LLaVA-OneVision-1.5 \
20+
--name "llava_megatron_container" \
21+
llava_megatron:25.04 /bin/bash
22+
```
23+
24+
## 2\. Data Download
25+
26+
Download LLaVA-NeXT-780k-webdataset at [LLaVA-NeXT-780k](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data)
27+
28+
29+
## 3\. Execute the WebDataset Conversion
30+
31+
### 3.1. Raw Data Processing
32+
Here we provide the code guidance to convert the dataset into the format of LLaVA-OneVision.
33+
34+
```python
35+
import os
36+
import json
37+
import pandas as pd
38+
from PIL import Image
39+
from io import BytesIO
40+
from tqdm import tqdm
41+
42+
# Configuration
43+
PARQUET_DIR = "LLaVA-NeXT-Data/data"
44+
OUTPUT_IMAGE_DIR = "images"
45+
OUTPUT_JSON_FILE = "mllm_mix.json"
46+
47+
os.makedirs(OUTPUT_IMAGE_DIR, exist_ok=True)
48+
49+
merged_data = []
50+
for filename in tqdm(sorted(f for f in os.listdir(PARQUET_DIR) if f.endswith('.parquet'))):
51+
df = pd.read_parquet(os.path.join(PARQUET_DIR, filename),
52+
columns=['id', 'conversations', 'image'])
53+
54+
for _, row in df.iterrows():
55+
if row['image'] == None:
56+
merged_data.append({
57+
"id": row['id'],
58+
"messages": row['conversations'].tolist()
59+
})
60+
continue
61+
img = Image.open(BytesIO(row['image']['bytes']))
62+
ext = 'jpg' if img.format in ['JPEG', 'JPG'] else 'png'
63+
img_name = f"{row['id']}.{ext}"
64+
img_path = os.path.join(OUTPUT_IMAGE_DIR, img_name)
65+
parent_dir = os.path.dirname(img_path)
66+
if not os.path.exists(parent_dir):
67+
os.makedirs(parent_dir, exist_ok=True)
68+
img.save(img_path)
69+
merged_data.append({
70+
"id": row['id'],
71+
"messages": row['conversations'].tolist(),
72+
"images": [img_name]
73+
})
74+
75+
with open(OUTPUT_JSON_FILE, 'w', encoding='utf-8') as f:
76+
json.dump(merged_data, f, ensure_ascii=False, indent=4)
77+
78+
```
79+
80+
81+
### 3.2. WebDataset Generation
82+
83+
Run the conversion script, providing the necessary arguments based on your data type.
84+
85+
```bash
86+
python tools/data_preprocess/convert_to_webdataset.py \
87+
--output_dir wds \
88+
--json_file mllm_mix.json \
89+
--image_dir images \
90+
--maxcount 10000
91+
```
92+
Key Parameters
93+
| Parameter | Type | Required | Description |
94+
| :--- | :--- | :--- | :--- |
95+
| **`--output_dir`** | `str` | Yes | The directory to save the final WebDataset files |
96+
| **`--json_file`** | `str` | Yes | The path to the main JSON file containing the dataset metadata. |
97+
| **`--image_dir`** | `str` | No | The directory path containing the image files. Required if **media** is **image** or **mix**. |
98+
| **`--video_dir`** | `str` | No | The directory path containing the video files. Required if **media** is **image** or **mix**. |
99+
| **`--media`** | `str` | No | The type of media to process: **image**, **video**, or **mix** (default: **mix**). |
100+
| **`--maxcount`** | `int` | No | Maximum number of samples per WebDataset shard (default: $10000$). |
101+
| **`--maxsize`** | `int` | No | Maximum byte size of each shard (default: $3$ GB). |
102+
| **`--columns_messages`** | `str` | No | The key in the JSON entry that holds the conversational messages (default: **messages**). |

0 commit comments

Comments
 (0)