A10G not using VRAM after generating training split in AutoTrain #925

LeivurGargiulo · 2025-08-12T16:34:42Z

LeivurGargiulo
Aug 12, 2025

Hi everyone,

I recently purchased a Hugging Face AutoTrain space with an NVIDIA A10G (24GB, reports ~22.49 GiB usable) for fine-tuning nothingiisreal/MN-12B-Celeste-V1.9 on josecannete/large_spanish_corpus.

When I start training, AutoTrain first generates the training split. During that step, VRAM usage is basically zero (around 2.88 MiB/22.49 GiB). After the split finishes, the process just stops — no training actually begins, and GPU usage never increases.

I expected VRAM usage to spike when training started, but it seems the job never reaches that stage.

Has anyone else experienced this with AutoTrain + A10G?
Could this be an issue with:

Dataset size or format?

The LoRA/PEFT + quantization setup I’m using?

Some AutoTrain pipeline bug for large models?

Any help would be appreciated — I just want to confirm if this is normal behavior for the split step, and why the actual training might not be starting.

Thanks in advance!

I want to train Celeste V1.9 to learn Spanish, then Spanish books with PleIAs/Spanish-PD-Books and then Argentine Spanish with ylacombe/google-argentinian-spanish. But I’m not that sure if my current JSON for Spanish Corpus is OK, or how to config the other JSON for next steps.

I've already wasted a ton of money and time on this and I need to know if it has a solution. Thanks!

This is my JSON:

{
  "model": "nothingiisreal/MN-12B-Celeste-V1.9",
  "data": "josecannete/large_spanish_corpus",
  "task": "text-generation",

  "hub_model_id": "SlayerL99/mn12b-celeste-espanol-stage1",

  "training_parameters": {
    "learning_rate": 0.00005,
    "num_train_epochs": 1,
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 16,
    "warmup_steps": 100,
    "max_seq_length": 2048,
    "weight_decay": 0.01,
    "lr_scheduler_type": "cosine",
    "seed": 42,
    "fp16": true,                // switch bf16 to fp16 for better compatibility with 3060 downstream
    "bf16": false,
    "gradient_checkpointing": true,
    "dataloader_num_workers": 4,  // boost data loading speed (assuming Linux/WSL)
    "push_to_hub": true,
    "save_total_limit": 2,
    "logging_steps": 25,
    "save_steps": 200,
    "eval_steps": 200,
    "evaluation_strategy": "steps",
    "load_best_model_at_end": true,
    "metric_for_best_model": "eval_loss",
    "greater_is_better": false,
    "report_to": "tensorboard"
  },

  "peft_parameters": {
    "use_peft": true,
    "lora_r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.1,
    "bias": "none",
    "task_type": "CAUSAL_LM",
    "target_modules": "all-linear"
  },

  "quantization_parameters": {
    "use_int4": true,
    "use_int8": false,
    "use_fp4": false,           // disable fp4 for compatibility/stability on RTX 3060 GGUF
    "use_double_quant": true,
    "bnb_4bit_quant_type": "nf4"
  }
}

I also tried this YAML:

task: llm
base_model: nothingiisreal/MN-12B-Celeste-V1.9
project_name: mn12b-celeste-espanol-stage1
log: tensorboard

data:
  path: josecannete/large_spanish_corpus
  train_split: train
  valid_split: null
  chat_template: null
  column_mapping:
    text_column: text

params:
  trainer: sft
  block_size: -1
  model_max_length: 4096
  epochs: 1
  batch_size: 1
  gradient_accumulation: 16
  lr: 5e-5
  warmup_ratio: 0.1
  optimizer: adamw_torch
  scheduler: linear
  weight_decay: 0.01
  logging_steps: 25
  eval_strategy: epoch
  save_total_limit: 2
  mixed_precision: fp16

  # QLoRA
  peft: true
  quantization: int4
  target_modules: all-linear
  lora_r: 16
  lora_alpha: 32
  lora_dropout: 0.10

  padding: right
  seed: 42

hub:
  username: SlayerL99
  push_to_hub: true

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A10G not using VRAM after generating training split in AutoTrain #925

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

A10G not using VRAM after generating training split in AutoTrain #925

Uh oh!

LeivurGargiulo Aug 12, 2025

Replies: 0 comments

LeivurGargiulo
Aug 12, 2025