Fix IndexError in finetune scripts when last logit chunk becomes empty #2141

Copilot · 2025-10-04T20:51:11Z

Problem

Users encountered an IndexError when running finetune_lora and other finetune scripts on Gemma models (and potentially other models with certain sequence lengths). The error occurred during training when processing batches with specific sequence lengths.

Root Cause

When lm_head_chunk_size=128 is used in the model forward pass, logits are returned as a list of chunks. The finetune code applies a shift operation to align predictions with targets:

logits = model(input_ids, lm_head_chunk_size=128)
logits[-1] = logits[-1][..., :-1, :]  # Shift to align output n with token n+1
loss = chunked_cross_entropy(logits, targets[..., 1:])

The bug: When the last chunk has a sequence length of exactly 1, the slicing operation [..., :-1, :] creates a chunk with length 0. This empty chunk then causes chunked_cross_entropy to fail because PyTorch's split() function doesn't accept a split size of 0.

This occurs when the total sequence length is of the form 128*n + 1 (e.g., 1, 129, 257, 385, etc.).

Solution

Added a simple check after the shift operation to remove empty chunks:

logits = model(input_ids, lm_head_chunk_size=128)
# shift the targets such that output n predicts token n+1
logits[-1] = logits[-1][..., :-1, :]
# Remove empty chunks (can happen when last chunk has size 1)
if logits[-1].size(1) == 0:
    logits = logits[:-1]
loss = chunked_cross_entropy(logits, targets[..., 1:])

This ensures all chunks passed to the loss function have non-zero sequence length.

Changes

Applied fix to all affected finetune scripts:
- litgpt/finetune/lora.py
- litgpt/finetune/adapter.py
- litgpt/finetune/adapter_v2.py
- litgpt/finetune/lora_legacy.py
- extensions/xla/finetune/adapter.py
Added comprehensive test test_chunked_cross_entropy_with_empty_last_chunk() to validate the fix

Impact

Fixes reported bug: Users can now finetune Gemma and other models without encountering IndexError on edge case sequence lengths
Minimal change: Only 3 lines added per file
Backward compatible: No changes to existing functionality for normal sequences
Model-agnostic: Benefits all models, not just Gemma

Fixes #[issue_number]

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

huggingface.co
- Triggering command: python3 (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

This section details on the original issue you should resolve

<issue_title>finetune_lora on gemma bug</issue_title>
<issue_description>### Bug description

I am trying to use finetune_lora to do PEFT on gemma model, and I have tried:

litgpt0.5.8.dev1: gemma-3-12b-it, gemma-3-27b-it
litgpt0.5.7: gemma-2-27b-it

both encouter IndexError. I have also tried other series models like QwQ and llama etc, all look fine.
It seems some people met similar bug( but on gemma-7b), not sure whether they are some problem.

What operating system are you using?

Linux

LitGPT Version

litgpt0.5.7 & litgpt0.5.8.dev1


Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

[rank: 0] Seed set to 1337
[rank: 1] Seed set to 1337
[rank: 2] Seed set to 1337
[rank: 3] Seed set to 1337
Number of trainable parameters: 10,616,832
Number of non-trainable parameters: 12,772,421,376
The longest sequence length in the train data is 460, the model's maximum sequence length is 460 and context length is 131072
Verifying settings ...
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/bin/litgpt", line 8, in <module>
[rank1]:     sys.exit(main())
[rank1]:              ^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/litgpt/__main__.py", line 69, in main
[rank1]:     CLI(parser_data)
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/jsonargparse/_cli.py", line 23, in CLI
[rank1]:     return auto_cli(*args, _stacklevel=3, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/jsonargparse/_cli.py", line 125, in auto_cli
[rank1]:     return _run_component(component, init.get(subcommand))
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/jsonargparse/_cli.py", line 210, in _run_component
[rank1]:     return component(**cfg)
[rank1]:            ^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/litgpt/finetune/lora.py", line 170, in setup
[rank1]:     fabric.launch(main, devices, seed, config, data, checkpoint_dir, out_dir, train, eval, optimizer, num_nodes)
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 837, in launch
[rank1]:     return self._wrap_and_launch(function, self, *args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 923, in _wrap_and_launch
[rank1]:     return to_run(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 928, in _wrap_with_setup
[rank1]:     return to_run(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/litgpt/finetune/lora.py", line 229, in main
[rank1]:     token_counts = fit(
[rank1]:                    ^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/litgpt/finetune/lora.py", line 298, in fit
[rank1]:     validate(fabric, model, val_dataloader, dataclasses.replace(eval, max_iters=2), verbose=False)  # sanity check
[rank1]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/litgpt/finetune/lora.py", line 426, in validate
[rank1]:     logits = model(input_ids)
[rank1]:              ^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^...

</details>
Fixes Lightning-AI/litgpt#2020

<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for you](https://github.com/Lightning-AI/litgpt/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: Borda <[email protected]>

Initial plan

c623f17

Copilot AI assigned Copilot and Borda Oct 4, 2025

Copilot started work on behalf of Borda October 4, 2025 20:51 View session

Copilot AI and others added 2 commits October 4, 2025 21:08

Fix IndexError in finetune with chunked logits on short sequences

38bae1f

Co-authored-by: Borda <[email protected]>

Add test for chunked cross entropy with empty last chunk

7f2aa36

Co-authored-by: Borda <[email protected]>

Copilot AI changed the title ~~[WIP] finetune_lora on gemma bug~~ Fix IndexError in finetune scripts when last logit chunk becomes empty Oct 4, 2025

Copilot finished work on behalf of Borda October 4, 2025 21:16

Copilot AI requested a review from Borda October 4, 2025 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix IndexError in finetune scripts when last logit chunk becomes empty #2141

Fix IndexError in finetune scripts when last logit chunk becomes empty #2141

Uh oh!

Copilot AI commented Oct 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fix IndexError in finetune scripts when last logit chunk becomes empty #2141

Are you sure you want to change the base?

Fix IndexError in finetune scripts when last logit chunk becomes empty #2141

Uh oh!

Conversation

Copilot AI commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Solution

Changes

Impact

I tried to connect to the following addresses, but was blocked by firewall rules:

What operating system are you using?

LitGPT Version

Uh oh!

Uh oh!

Copilot AI commented Oct 4, 2025 •

edited

Loading