Thanks for your great work!
I consider how to train in transformers models with deepspeed, code likes this:
accelerator = Accelerator(...)
model = Qwen3_5ForConditionalGeneration.from_pretrained(...)
optimizer = FlashAdamW(model.parameters(), ...)
train_dataloader = build_dataloader(...)
model, optimizer, train_dataloader = accelerator.prepare(model, optimizer, train_dataloader)
for batch in train_dataloader:
with accelerator.accumulate(model):
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()