Skip to content

Testing adaptation of the DINOv2/3 encoders for vision tasks with Low-Rank Adaptation (LoRA)

License

Notifications You must be signed in to change notification settings

RobvanGastel/dinov3-finetune

Repository files navigation

[2025-09-09] added further exploration of the DINOv3 patch embedding in this post-training repository.
[2025-08-27] Added DINOv3 weights to compare with DINOv2 experiments.
[2025-08-25] Added the ability to finetune DINOv3 encoders!

Finetuning DINOv2, DINOv3 with LoRA for Image Segmentation

This repository explores finetuning DINOv3 (Siméoni et al., 2025) or DINOv2 (Oquab et al., 2024) encoder weights using Low-Rank Adaptation (Hu et al., 2021) (LoRA) and a simple 1x1 convolution decoder. LoRA makes it possible to finetune to new tasks easier without adjusting the original encoder weights by adding a small set of weights between each encoder block. The DINOv2, DINOv3 encoder weights are learned by self-supervised learning and accurately capture the natural image domain. For example, by applying PCA to the outputs of the encoders, we can get a coarse segmentation of the objects in the image and see semantically similar objects colored in the same color.

Check out the Explanation.ipynb notebook for a more detailed walkthrough of the code and ideas behind it.

DINOv3. Les noise is visible when comparing the PCA outputs from DINOv3 versus the previous DINOv2.

Previously DINOv2 could only produce high resolution PCA videos with FeatUp. But currently with DINOv3 we can scale to high resolution videos without FeatUp. See the Embedding_visualization.ipynb.

output_dinov3.mp4

Setup

Install the packages using the requirements.txt file.

# using conda
conda create --name dino python=3.11
conda activate dino
# Install the package for dino_finetune imports,
pip install -e .

Special dependency if you want to investigate the encoder features in higher resolution using FeatUp. I recreated methods to process videos and images in the notebook Embedding_visualization.ipynb. To run it yourself in the notebook you need to install the FeatUp directory, and as it uses a custom kernel you need to make sure all the CUDA environment variables are configured properly.

# For CUDA_HOME/nvcc, make sure you install the cudatoolkit-dev tools
conda install -c conda-forge cudatoolkit-dev -y
# Now you should be able to run, 
nvcc -V
# So you can set the CUDA_HOME path
export CUDA_HOME=$CONDA_PREFIX
# For the LD_LIBRARY_PATH install cudnn
conda install -c conda-forge cudnn
# And set the variable
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/rob/miniconda3/envs/dino/lib

In the section below I explain all the flags used in the main.py to finetune to different datasets.

Usage

An example to run finetuning on the VOC dataset with LoRA and an FPN decoder, either DINOv3 or DINOv2.

python main.py --exp_name base_voc --dataset voc --size base --dino_type dinov3 --img_dim 308 308 --epochs 50 --use_fpn

Flags Some explanation of the more useful flags to use when running experiments.

  • --exp_name (str): The name of the experiment. This is used to identify the experiment and save results accordingly.
  • --debug (flag): A boolean flag to indicate whether to debug the main.py training code.
  • --dataset (str): The name of the dataset to use. either voc or ade20k
  • --size (str): The size configuration for the DINOv2 backbone parameter small, base, large, or giant
  • --r (int): the LoRA rank (r) parameter to determine the amount of parameters. Usually, a small value like 3-9.
  • --use_lora (flag): A boolean flag indicating whether to use Low-Rank Adaptation (LoRA). If this flag is present, LoRA is used.
  • --dino_type (str): Pass the DINO version to use either dinov2, or dinov3.
  • --use_fpn (flag): A boolean flag to indicate whether to use the FPN decoder.
  • --lora_weights (str): Path to the file location to load the LoRA weights and decoder head from.
  • --img_dim (tuple of int): The dimensions of the input images (height width). This should be specified as two integers. Example: 308 308.
  • --epochs (int): The number of training epochs. This determines how many times the model will pass through the entire training dataset. Example: 50.

There are some more unnamed parameters for training like the learning rate and batch size.

Results

Pascal VOC
I achieve a validation mean IoU of approximately 71.8% using LoRA and a 1x1 convolution decoder with DINOv3 ViT-L weights. When applying ImageNet-C corruptions (Hendrycks & Dietterich, 2019) to test robustness on Pascal VOC, the validation mean IoU drops to 65.7% with corruption severity level 5 (the maximum). The performance of the corrupted evaluation does fluctuate, I estimate between 2-5% depending on the type of finetuning. This also holds for the ADE20k dataset. Just the decoder or LoRA with 1x1 convolutional decoder fluctuates less than the fpn decoder. The qualitative performance of DINOv2 with LoRA and a 1x1 decoder is illustrated in the figure below. Based on their qualitative and quantitative performance, these pre-trained weights handle image corruption effectively.

You can use the pre-trained weights using the --lora_weights flag or using the load_parameters function call. Registers here mean that extra context global context tokens are learned, see the second reference. All models are finetuned for 100 epochs.

I also ran experiments with DINOv3 on sizes ViT-B, and ViT-H, for which I obtained 71.5% and 73.3% mIoU for finetuning with LoRA and a linear head. The base model while performing similar is less robust to the corruptions.

finetuned components pre-training model # of
params
with
registers
Pascal VOC
Validation mIoU
Pascal VOC-C
level 5
Validation mIoU
Directory
1x1 Conv decoder DINOv2 ViT-L/14 300 M 49.2% 40.0% output/dinov2/large_voc_no_lora.pt
LoRA + 1x1 Conv decoder DINOv3 ViT-L/16 300 M 71.8% 65.4% output/dinov3/large_base_voc_lora.pt
LoRA + 1x1 Conv decoder DINOv2 ViT-L/14 300 M 67.7% 57.3% output/dinov2/large_base_voc_lora.pt
LoRA + FPN decoder DINOv2 ViT-L/14 300 M 54.9% 46.7% output/dinov2/large_voc_fpn.pt

ADE20k
I achieve a validation mean IoU of approximately 40.0% using LoRA and a 1x1 convolution decoder with DINOv3 ViT-L weights. With ADE20k-C (corruption severity level 5) the performance drops to 33.3%. An qualitative performance example of the DINOv2 LoRA + 1x1 decoder is illustrated in the figure below.

finetuned components pre-training model # of
params
with
registers
ADE20k
Validation mIoU
ADE20k-C
level 5
Validation mIoU
Directory
1x1 Conv decoder DINOv2 ViT-L/14 300 M 31.3% 26.8% output/dinov2/large_ade20k_no_lora.pt
LoRA + 1x1 Conv decoder DINOv3 ViT-L/16 300M 40.0% 33.3% output/dinov3/large_ade20k_lora.pt
LoRA + 1x1 Conv decoder DINOv2 ViT-L/14 300 M 39.0% 30.1% output/dinov2/large_ade20k_lora.pt
LoRA + FPN decoder DINOv2 ViT-L/14 300 M 36.9% 28.9% output/dinov2/large_ade20k_fpn.pt

Citing

If you reference or use the codebase in your research, please cite:

@misc{2024dinov2_lora_seg,
    title={Finetuning DINOv2 and DINOv3 with LoRA for Image Segmentation},
    author = {Van Gastel, Rob},
    year={2024}
}

References

Siméoni, O., Vo, H. V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., … Bojanowski, P. (2025). DINOv3 (No. arXiv:2508.10104). arXiv. https://doi.org/10.48550/arXiv.2508.10104

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., … Bojanowski, P. (2024). DINOv2: Learning Robust Visual Features without Supervision (arXiv:2304.07193). arXiv. http://arxiv.org/abs/2304.07193

Darcet, T., Oquab, M., Mairal, J., & Bojanowski, P. (2024). Vision Transformers Need Registers (arXiv:2309.16588). arXiv. https://doi.org/10.48550/arXiv.2309.16588

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685). arXiv. http://arxiv.org/abs/2106.09685

Hendrycks, D., & Dietterich, T. G. (2019). Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations (arXiv:1807.01697). arXiv. https://doi.org/10.48550/arXiv.1807.01697

About

Testing adaptation of the DINOv2/3 encoders for vision tasks with Low-Rank Adaptation (LoRA)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published