Skip to content

Commit 2436203

Browse files
NanoCode012winglianSalmanMohammadi
authored
fix: force train split for json,csv,txt for test_datasets and misc doc changes (#3226)
* fix: force train split for json,csv,txt for test_datasets * feat(doc): add info on mixing datasets for VLM * feat(doc): max memory * fix(doc): clarify lr groups * fix: add info on vision not being dropped * feat: add qwen3-vl to multimodal docs * fix: add moe blocks to arch list * feat(doc): improve mistral docs * chore: add helpful link [skip-e2e] * fix: add vram usage for mistral small * Update link in docs/faq.qmd Co-authored-by: salman <[email protected]> --------- Co-authored-by: Wing Lian <[email protected]> Co-authored-by: salman <[email protected]>
1 parent 3750fdc commit 2436203

File tree

9 files changed

+88
-4
lines changed

9 files changed

+88
-4
lines changed

docs/faq.qmd

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,14 @@ description: Frequently asked questions
6363
6464
> A: There seems to be a wheel issue with FA2 2.8.0 on CUDA 12.4. Try CUDA 12.6 instead or downgrade to FA2 2.7.4. Please refer to the upstream issue: https://github.com/Dao-AILab/flash-attention/issues/1717.
6565
66+
**Q: Can we mix text and text+image datasets for VLM training?**
67+
68+
> A: Yes, you can for newer VLM arch. The ones that would not work are LLaVA / Pixtral arch. If you notice one not working, please let us know!
69+
70+
**Q: Why is `memory/max_*` different from `nvidia-smi`?**
71+
72+
> A: We use `torch` APIs to retrieve this information. You can see https://docs.pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management for more information.
73+
6674
### Chat templates
6775
6876
**Q: `jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'content' / 'role' / ____`**

docs/lr_groups.qmd

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,9 @@ learning_rate: 2e-5
2727
In this example, we have a default learning rate of 2e-5 across the entire model, but we have a separate learning rate
2828
of 1e-6 for all the self attention `o_proj` modules across all layers, and a learning are of 1e-5 to the 3rd layer's
2929
self attention `q_proj` module.
30+
31+
::: {.callout-note}
32+
33+
We currently only support varying `lr` for now. If you're interested in adding support for others (`weight_decay`), we welcome PRs. See https://github.com/axolotl-ai-cloud/axolotl/blob/613bcf90e58f3ab81d3827e7fc572319908db9fb/src/axolotl/core/trainers/mixins/optimizer.py#L17
34+
35+
:::

docs/multimodal.qmd

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,10 +56,14 @@ image_resize_algorithm: bilinear
5656
5757
Please see [examples](https://github.com/axolotl-ai/axolotl/tree/main/examples) folder for full configs.
5858
59-
::: {.callout-warning}
59+
::: {.callout-tip}
6060
Some of our chat_templates have been extended to support broader dataset types. This should not break any existing configs.
6161
:::
6262
63+
::: {.callout-note}
64+
As of now, we do not truncate nor drop samples based on `sequence_len` as each arch has different ways to process non-text tokens. We are looking for help on this.
65+
:::
66+
6367
### Mllama {#sec-mllama}
6468

6569
```yaml
@@ -168,6 +172,14 @@ base_model: Qwen/Qwen2.5-VL-7B-Instruct
168172
chat_template: qwen2_vl # same as qwen2-vl
169173
```
170174

175+
### Qwen3-VL {#sec-qwen3-vl}
176+
177+
```yaml
178+
base_model: Qwen/Qwen3-VL-4B-Instruct
179+
180+
chat_template: qwen2_vl # same as qwen2-vl
181+
```
182+
171183
### SmolVLM2 {#sec-smolvlm2}
172184

173185
::: {.callout-tip}

examples/magistral/think/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Before starting, ensure you have:
1212
Run the thinking model fine-tuning:
1313

1414
```bash
15-
axolotl train magistral-small-think-qlora.yaml
15+
axolotl train examples/magistral/think/magistral-small-think-qlora.yaml
1616
```
1717

1818
This config uses about 19.1 GiB VRAM.

examples/magistral/vision/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Before starting, ensure you have:
2121

2222
3. Run the fine-tuning:
2323
```bash
24-
axolotl train magistral-small-vision-24B-qlora.yml
24+
axolotl train examples/magistral/vision/magistral-small-vision-24B-qlora.yml
2525
```
2626

2727
This config uses about 17GiB VRAM.
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Mistral Small 3.1/3.2 Fine-tuning
2+
3+
This guide covers fine-tuning [Mistral Small 3.1](mistralai/Mistral-Small-3.1-24B-Instruct-2503) and [Mistral Small 3.2](mistralai/Mistral-Small-3.2-24B-Instruct-2506) with vision capabilities using Axolotl.
4+
5+
## Prerequisites
6+
7+
Before starting, ensure you have:
8+
- Installed Axolotl (see [Installation docs](https://docs.axolotl.ai/docs/installation.html))
9+
10+
## Getting Started
11+
12+
1. Install the required vision lib:
13+
```bash
14+
pip install 'mistral-common[opencv]==1.8.5'
15+
```
16+
17+
2. Download the example dataset image:
18+
```bash
19+
wget https://huggingface.co/datasets/Nanobit/text-vision-2k-test/resolve/main/African_elephant.jpg
20+
```
21+
22+
3. Run the fine-tuning:
23+
```bash
24+
axolotl train examples/mistral/mistral-small/mistral-small-3.1-24B-lora.yml
25+
```
26+
27+
This config uses about 29.4 GiB VRAM.
28+
29+
## Dataset Format
30+
31+
The vision model requires multi-modal dataset format as documented [here](https://docs.axolotl.ai/docs/multimodal.html#dataset-format).
32+
33+
One exception is that, passing `"image": PIL.Image` is not supported. MistralTokenizer only supports `path`, `url`, and `base64` for now.
34+
35+
Example:
36+
```json
37+
{
38+
"messages": [
39+
{"role": "system", "content": [{ "type": "text", "text": "{SYSTEM_PROMPT}"}]},
40+
{"role": "user", "content": [
41+
{ "type": "text", "text": "What's in this image?"},
42+
{"type": "image", "path": "path/to/image.jpg" }
43+
]},
44+
{"role": "assistant", "content": [{ "type": "text", "text": "..." }]},
45+
],
46+
}
47+
```
48+
49+
## Limitations
50+
51+
- Sample Packing is not supported for multi-modality training currently.

examples/mistral/mistral-small/mistral-small-3.1-24B-lora.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ wandb_name:
3939
wandb_log_model:
4040

4141
gradient_accumulation_steps: 1
42-
micro_batch_size: 1
42+
micro_batch_size: 2
4343
num_epochs: 1
4444
optimizer: adamw_bnb_8bit
4545
lr_scheduler: cosine

src/axolotl/common/architectures.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,9 @@
1212
"mixtral": "MixtralSparseMoeBlock",
1313
"qwen2_moe": "Qwen2MoeSparseMoeBlock",
1414
"qwen3_moe": "Qwen3MoeSparseMoeBlock",
15+
"qwen3_vl_moe": "Qwen3VLMoeTextSparseMoeBlock",
1516
"deepseek_v2": "DeepseekV2MoE",
17+
"deepseek_v3": "DeepseekV3MoE",
1618
"gpt_oss": "GptOssDecoderLayer",
1719
"lfm2_moe": "Lfm2MoeSparseMoeBlock",
1820
}

src/axolotl/utils/data/shared.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -239,6 +239,11 @@ def _load_from_local_path(
239239
return load_dataset(dataset_config.path, **load_dataset_kwargs)
240240
elif local_path.is_file():
241241
dataset_type = get_dataset_type(dataset_config)
242+
243+
# For single file datasets, HF always creates only a "train" split
244+
if dataset_type in ("json", "csv", "text"):
245+
load_dataset_kwargs["split"] = "train"
246+
242247
return load_dataset(
243248
dataset_type,
244249
data_files=dataset_config.path,

0 commit comments

Comments
 (0)