Skip to content

Conversation

@NickLucche
Copy link
Collaborator

@NickLucche NickLucche commented Aug 31, 2025

Add LoRA support to gemma3n.

FIX #21746

@NickLucche NickLucche requested a review from hmellor as a code owner August 31, 2025 09:43
@mergify mergify bot added the documentation Improvements or additions to documentation label Aug 31, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds LoRA support for the Gemma3n model. The changes involve updating the Gemma3nForConditionalGeneration model to inherit from SupportsLoRA and updating the documentation to reflect this new capability.

My review identifies two main issues:

  1. The LoRA support for Gemma3nForConditionalGeneration appears incomplete as it does not include support for embedding layers, which is a crucial part of LoRA fine-tuning.
  2. The documentation is updated to claim LoRA support for Gemma3nForCausalLM, but the corresponding implementation changes are missing in this PR, leading to an inconsistency.

Both issues are of high severity and should be addressed to ensure complete and consistent feature implementation.

@pratapyash
Copy link
Contributor

No, it does not seem to work, trying loading up VLLM server with a LoRA adapter, but it fails torch compilation/CUDA graph generation. The adapter I used only targets support language/text modules.

I also raised a bug issue with in detailed context about the problem: #23970

Attached some log files, reproducing the issues on my side on the latest commit of this PR: dec277bc07d2c9368d88ee7c69ce2e2f7a97c6d5

  1. With compilation == level 3, cudagraph == piecewise
python -m vllm.entrypoints.openai.api_server \
  --model "$MODEL" \
  --host "$HOST" \
  --port "$PORT" \
  --gpu-memory-utilization "$GPU_MEM_UTIL" \
  --tensor-parallel-size "$TENSOR_PARALLEL_SIZE" \
  --max-model-len "$MAX_MODEL_LEN" \
  --trust-remote-code \
  --enable-lora \
  --max-lora-rank 32 \
  --lora-modules "$LORA_NAME=$LORA_PATH" \
  --served-model-name "$MODEL" \
  --compilation-config '{"level": 3, "cudagraph_mode": "PIECEWISE"}' \
  2>&1 | tee "$LOG_FILE"

vllm_server_lora_lev3_compilation.log

  1. With compilation == level 0, cudagraph == none
python -m vllm.entrypoints.openai.api_server \
  --model "$MODEL" \
  --host "$HOST" \
  --port "$PORT" \
  --gpu-memory-utilization "$GPU_MEM_UTIL" \
  --tensor-parallel-size "$TENSOR_PARALLEL_SIZE" \
  --max-model-len "$MAX_MODEL_LEN" \
  --trust-remote-code \
  --enable-lora \
  --max-lora-rank 32 \
  --lora-modules "$LORA_NAME=$LORA_PATH" \
  --served-model-name "$MODEL" \
  --compilation-config '{"level": 0, "cudagraph_mode": "None"}' \
  2>&1 | tee "$LOG_FILE"

vllm_server_lora.log

@pratapyash
Copy link
Contributor

pratapyash commented Aug 31, 2025

@NickLucche, getting LoRA support for more audio and vision encoder modules, too, would be great.

@pratapyash
Copy link
Contributor

Checked running a LoRA adapter from the latest commit on this PR; the server does not load up. https://github.com/vllm-project/vllm/pull/24003/commits/ccafe5e0da8d6b36a1858567e9bd65b970495016

(EngineCore_0 pid=5359)   File "/home/ubuntu/user/exp/vllm/vllm/v1/worker/utils.py", line 150, in sanity_check_mm_encoder_outputs
(EngineCore_0 pid=5359)     assert len(mm_embeddings) == expected_num_items, (
(EngineCore_0 pid=5359)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=5359) AssertionError: Expected number of multimodal embeddings to match number of input items: 7, but got len(mm_embeddings)=1792 instead. This is most likely due to the incorrect implementation of the model's `get_multimodal_embeddings` method.
[rank0]:[W901 09:45:15.516942456 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

vllm_server_lora.log

Scripts to reproduce.
run_server.sh

@mergify
Copy link

mergify bot commented Sep 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 1, 2025
@pratapyash
Copy link
Contributor

Any progress on LoRA support?

@NickLucche
Copy link
Collaborator Author

@pratapyash Apologies for the delay, I was off the last few days.
Any chance you can share an adapter to reproduce the compilation issue, so I can look into that when I have some bw?

@pratapyash
Copy link
Contributor

I have committed an adapter for gemma3n-E4B-it at this HuggingFace repo, which is publicly available: yashpratap/gemma3n-e4b-it-generic-adapter
url: https://huggingface.co/yashpratap/gemma3n-e4b-it-generic-adapter

Hope this helps you to reproduce the issue @NickLucche

@maot-github
Copy link

Hello, is there any update on the progress of this PR?

@NickLucche
Copy link
Collaborator Author

Found the cause of issue #23970 but this requires global changes (not gemma3n-specific).

@jeejeelee I would appreciate your review here as you have a lot more context.

Also I am now getting

[rank0]: ValueError: While loading /dev-network-share/users/engine/hub_cache/models--yashpratap--gemma3n-e4b-it-generic-adapter/snapshots/a64ad70475441c76bd3e33d2bd6bcf9109349630, expected target modules in ['1', 'modality_router', 'up_proj', 'embedding_projection', 'per_layer_input_gate', 'gate_proj', 'down_proj', 'prediction_coefs', 'o_proj', '0', 'linear_left', 'per_layer_model_projection', 'correction_coefs', '2', 'k_proj', 'linear_right', 'per_layer_projection', 'q_proj', 'v_proj'] but received ['audio_tower.conformer.0.attention.post', 'audio_tower.conformer.0.attention.post', 'audio_tower.conformer.0.lconv1d.linear_end', 'audio_tower.conformer.0.lconv1d.linear_end', 'audio_tower.conformer.0.lconv1d.linear_start', 'audio_tower.conformer.0.lconv1d.linear_start', 'audio_tower.conformer.1.attention.post', 'audio_tower.conformer.1.attention.post', 'audio_tower.conformer.1.lconv1d.linear_end', 'audio_tower.conformer.1.lconv1d.linear_end', 'audio_tower.conformer.1.lconv1d.linear_start', 'audio_tower.conformer.1.lconv1d.linear_start', 'audio_tower.conformer.10.attention.post', 'audio_tower.conformer.10.attention.post', 'audio_tower.conformer.10.lconv1d.linear_end', 'audio_tower.conformer.10.lconv1d.linear_end', 'audio_tower.conformer.10.lconv1d.linear_start', 'audio_tower.conformer.10.lconv1d.linear_start', 'audio_tower.conformer.11.attention.post', 'audio_tower.conformer.11.attention.post', 'audio_tower.conformer.11.lconv1d.linear_end', 'audio_tower.conformer.11.lconv1d.linear_end', 'audio_tower.conformer.11.lconv1d.linear_start', 'audio_tower.conformer.11.lconv1d.linear_start', 'audio_tower.conformer.2.attention.post', 'audio_tower.conformer.2.attention.post', 'audio_tower.conformer.2.lconv1d.linear_end', 'audio_tower.conformer.2.lconv1d.linear_end', 'audio_tower.conformer.2.lconv1d.linear_start', 'audio_tower.conformer.2.lconv1d.linear_start', 'audio_tower.conformer.3.attention.post', 'audio_tower.conformer.3.attention.post', 'audio_tower.conformer.3.lconv1d.linear_end', 'audio_tower.conformer.3.lconv1d.linear_end', 'audio_tower.conformer.3.lconv1d.linear_start', 'audio_tower.conformer.3.lconv1d.linear_start', 'audio_tower.conformer.4.attention.post', 'audio_tower.conformer.4.attention.post', 'audio_tower.conformer.4.lconv1d.linear_end', 'audio_tower.conformer.4.lconv1d.linear_end', 'audio_tower.conformer.4.lconv1d.linear_start', 'audio_tower.conformer.4.lconv1d.linear_start', 'audio_tower.conformer.5.attention.post', 'audio_tower.conformer.5.attention.post', 'audio_tower.conformer.5.lconv1d.linear_end', 'audio_tower.conformer.5.lconv1d.linear_end', 'audio_tower.conformer.5.lconv1d.linear_start', 'audio_tower.conformer.5.lconv1d.linear_start', 'audio_tower.conformer.6.attention.post', 'audio_tower.conformer.6.attention.post', 'audio_tower.conformer.6.lconv1d.linear_end', 'audio_tower.conformer.6.lconv1d.linear_end', 'audio_tower.conformer.6.lconv1d.linear_start', 'audio_tower.conformer.6.lconv1d.linear_start', 'audio_tower.conformer.7.attention.post', 'audio_tower.conformer.7.attention.post', 'audio_tower.conformer.7.lconv1d.linear_end', 'audio_tower.conformer.7.lconv1d.linear_end', 'audio_tower.conformer.7.lconv1d.linear_start', 'audio_tower.conformer.7.lconv1d.linear_start', 'audio_tower.conformer.8.attention.post', 'audio_tower.conformer.8.attention.post', 'audio_tower.conformer.8.lconv1d.linear_end', 'audio_tower.conformer.8.lconv1d.linear_end', 'audio_tower.conformer.8.lconv1d.linear_start', 'audio_tower.conformer.8.lconv1d.linear_start', 'audio_tower.conformer.9.attention.post', 'audio_tower.conformer.9.attention.post', 'audio_tower.conformer.9.lconv1d.linear_end', 'audio_tower.conformer.9.lconv1d.linear_end', 'audio_tower.conformer.9.lconv1d.linear_start', 'audio_tower.conformer.9.lconv1d.linear_start']. Please verify that the loaded LoRA module is correct

when adding LoRAs, but I think we should be able to filter the tower weights easily

@pratapyash
Copy link
Contributor

Is torch compilation level=3 and piecewise cudagraph working normally with LoRA?

Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
@mergify
Copy link

mergify bot commented Oct 8, 2025

Documentation preview: https://vllm--24003.org.readthedocs.build/en/24003/

@varun-sundar-rabindranath
Copy link
Contributor

Hi @pratapyash - I tried running,

python -m vllm.entrypoints.openai.api_server   --model google/gemma-3n-E2B-it     --gpu-memory-utilization 0.85   --tensor-parallel-size 1   --trust-remote-code   --enable-lora   --max-lora-rank 32   --lora-modules "lora0=yashpratap/gemma3n-e4b-it-generic-adapter"   --served-model-name google/gemma-3n-E2B-it   --compilation-config '{"level": 3, "cudagraph_mode": "PIECEWISE"}' 

I hit the error,

[rank0]: ValueError: While loading /dev-network-share/users/engine/hub_cache/models--yashpratap--gemma3n-e4b-it-generic-adapter/snapshots/a64ad70475441c76bd3e33d2bd6bcf9109349630, expected target modules in ['1', 'modality_router', 'up_proj', 'embedding_projection', 'per_layer_input_gate', 'gate_proj', 'down_proj', 'prediction_coefs', 'o_proj', '0', 'linear_left', 'per_layer_model_projection', 'correction_coefs', '2', 'k_proj', 'linear_right', 'per_layer_projection', 'q_proj', 'v_proj'] but received ['audio_tower.conformer.0.attention.post', 'audio_tower.conformer.0.attention.post', 'audio_tower.conformer.0.lconv1d.linear_end', 'audio_tower.conformer.0.lconv1d.linear_end', 'audio_tower.conformer.0.lconv1d.linear_start', 'audio_tower.conformer.0.lconv1d.linear_start', 'audio_tower.conformer.1.attention.post', 'audio_tower.conformer.1.attention.post', 'audio_tower.conformer.1.lconv1d.linear_end', 'audio_tower.conformer.1.lconv1d.linear_end', 'audio_tower.conformer.1.lconv1d.linear_start', 'audio_tower.conformer.1.lconv1d.linear_start', 
 ...
'audio_tower.conformer.4.lconv1d.linear_end', 'audio_tower.conformer.4.lconv1d.linear_end', 'audio_tower.conformer.4.lconv1d.linear_start', 'audio_tower.conformer.4.lconv1d.linear_start', 'audio_tower.conformer.5.attention.post', 'audio_tower.conformer.5.attention.post', 'audio_tower.conformer.5.lconv1d.linear_end', 'audio_tower.conformer.5.lconv1d.linear_end', 'audio_tower.conformer.5.lconv1d.linear_start', 'audio_tower.conformer.5.lconv1d.linear_start', 'audio_tower.conformer.6.attention.post', 'audio_tower.conformer.6.attention.post', 'audio_tower.conformer.6.lconv1d.linear_end', 'audio_tower.conformer.6.lconv1d.linear_end', 'audio_tower.conformer.6.lconv1d.linear_start', 'audio_tower.conformer.6.lconv]. Please verify that the loaded LoRA module is correct

This is because, when loading the LoRA adapters we specifically check that we load only the modules that vLLM supports. and audio_tower isn't supported. I have a PR NickLucche#3 that could fix this. However, I think vLLM shouldn't be doing what I have on that PR (I wanted to see if it'd work as a POC). I believe the LoRA adapter should have only the language_model adapters so vLLM can ensure correctness. Maybe I am missing a valid usecase / convenience for having an audio/vision adapter and not using it.

With the PR, I still couldn't get the server to come up successfully. I think the yashpratap/gemma3n-e4b-it-generic-adapter is incorrectly configured - The "intermediate" size of the gate_proj and up_proj in the base model is 8192, but on the LoRA adapters it is 16384.

Can you test the PR with a LoRA model that contains just the language model adapters ? If that passes, I think we should land the PR and we can probably consider NickLucche#3 decide if we want to land it separately.

cc @jeejeelee

@jeejeelee
Copy link
Collaborator

@NickLucche Sry for missing this PR due to holiday :bowtie:

@NickLucche
Copy link
Collaborator Author

Looking for some alternative adapter for testing cc @pratapyash

@pratapyash
Copy link
Contributor

pratapyash commented Oct 10, 2025

Looking for some alternative adapter for testing cc @pratapyash

I had used this unsloth notebook, can run training for a single step to get adapter weights for testing: Unsloth Notebook

Signed-off-by: NickLucche <[email protected]>
@NickLucche
Copy link
Collaborator Author

Testing out loading with different adapters as suggested by @varun-sundar-rabindranath appears to work fine (text-only, --max-lora-rank 32):

  • mohamed-stifi/final-lora-nia-gen-darija-gemma-3n-e2b-chat
  • sharmax-vikas/gemma-3n-E2B-donald-trump-2ep-LORA

Let me know what you think @jeejeelee

@mergify
Copy link

mergify bot commented Oct 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 14, 2025
@maot-github
Copy link

Hello, I'm happy to see there have been progress on this! Is there any update on this?

@NickLucche
Copy link
Collaborator Author

Just waiting on @jeejeelee review here as per my last update.
cc @lucianommartins who might be interested in this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation needs-rebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Add LoRA adapter support for Gemma3nForConditionalGeneration models

5 participants