[2025-09-09] added further exploration of the DINOv3 patch embedding in this post-training repository.
[2025-08-27] Added DINOv3 weights to compare with DINOv2 experiments.
[2025-08-25] Added the ability to finetune DINOv3 encoders!
This repository explores finetuning DINOv3 (Siméoni et al., 2025) or DINOv2 (Oquab et al., 2024) encoder weights using Low-Rank Adaptation (Hu et al., 2021) (LoRA) and a simple 1x1 convolution decoder. LoRA makes it possible to finetune to new tasks easier without adjusting the original encoder weights by adding a small set of weights between each encoder block. The DINOv2, DINOv3 encoder weights are learned by self-supervised learning and accurately capture the natural image domain. For example, by applying PCA to the outputs of the encoders, we can get a coarse segmentation of the objects in the image and see semantically similar objects colored in the same color.
Check out the Explanation.ipynb notebook for a more detailed walkthrough of the code and ideas behind it.
DINOv3. Les noise is visible when comparing the PCA outputs from DINOv3 versus the previous DINOv2.

Previously DINOv2 could only produce high resolution PCA videos with FeatUp. But currently with DINOv3 we can scale to high resolution videos without FeatUp. See the Embedding_visualization.ipynb.
output_dinov3.mp4
Install the packages using the requirements.txt file.
# using conda
conda create --name dino python=3.11
conda activate dino
# Install the package for dino_finetune imports,
pip install -e .Special dependency if you want to investigate the encoder features in higher resolution using FeatUp. I recreated methods to process videos and images in the notebook Embedding_visualization.ipynb. To run it yourself in the notebook you need to install the FeatUp directory, and as it uses a custom kernel you need to make sure all the CUDA environment variables are configured properly.
# For CUDA_HOME/nvcc, make sure you install the cudatoolkit-dev tools
conda install -c conda-forge cudatoolkit-dev -y
# Now you should be able to run,
nvcc -V
# So you can set the CUDA_HOME path
export CUDA_HOME=$CONDA_PREFIX
# For the LD_LIBRARY_PATH install cudnn
conda install -c conda-forge cudnn
# And set the variable
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/rob/miniconda3/envs/dino/libIn the section below I explain all the flags used in the main.py to finetune to different datasets.
An example to run finetuning on the VOC dataset with LoRA and an FPN decoder, either DINOv3 or DINOv2.
python main.py --exp_name base_voc --dataset voc --size base --dino_type dinov3 --img_dim 308 308 --epochs 50 --use_fpnFlags Some explanation of the more useful flags to use when running experiments.
- --exp_name (str): The name of the experiment. This is used to identify the experiment and save results accordingly.
- --debug (flag): A boolean flag to indicate whether to debug the main.py training code.
- --dataset (str): The name of the dataset to use. either
vocorade20k - --size (str): The size configuration for the DINOv2 backbone parameter
small,base,large, orgiant - --r (int): the LoRA rank (r) parameter to determine the amount of parameters. Usually, a small value like 3-9.
- --use_lora (flag): A boolean flag indicating whether to use Low-Rank Adaptation (LoRA). If this flag is present, LoRA is used.
- --dino_type (str): Pass the DINO version to use either
dinov2, ordinov3. - --use_fpn (flag): A boolean flag to indicate whether to use the FPN decoder.
- --lora_weights (str): Path to the file location to load the LoRA weights and decoder head from.
- --img_dim (tuple of int): The dimensions of the input images (height width). This should be specified as two integers. Example: 308 308.
- --epochs (int): The number of training epochs. This determines how many times the model will pass through the entire training dataset. Example: 50.
There are some more unnamed parameters for training like the learning rate and batch size.
Pascal VOC
I achieve a validation mean IoU of approximately 71.8% using LoRA and a 1x1 convolution decoder with DINOv3 ViT-L weights. When applying ImageNet-C corruptions (Hendrycks & Dietterich, 2019) to test robustness on Pascal VOC, the validation mean IoU drops to 65.7% with corruption severity level 5 (the maximum). The performance of the corrupted evaluation does fluctuate, I estimate between 2-5% depending on the type of finetuning. This also holds for the ADE20k dataset. Just the decoder or LoRA with 1x1 convolutional decoder fluctuates less than the fpn decoder. The qualitative performance of DINOv2 with LoRA and a 1x1 decoder is illustrated in the figure below. Based on their qualitative and quantitative performance, these pre-trained weights handle image corruption effectively.
You can use the pre-trained weights using the --lora_weights flag or using the load_parameters function call. Registers here mean that extra context global context tokens are learned, see the second reference. All models are finetuned for 100 epochs.
I also ran experiments with DINOv3 on sizes ViT-B, and ViT-H, for which I obtained 71.5% and 73.3% mIoU for finetuning with LoRA and a linear head. The base model while performing similar is less robust to the corruptions.
| finetuned components | pre-training | model | # of params |
with registers |
Pascal VOC Validation mIoU |
Pascal VOC-C level 5 Validation mIoU |
Directory |
|---|---|---|---|---|---|---|---|
| 1x1 Conv decoder | DINOv2 | ViT-L/14 | 300 M | ✅ | 49.2% | 40.0% | output/dinov2/large_voc_no_lora.pt |
| LoRA + 1x1 Conv decoder | DINOv3 | ViT-L/16 | 300 M | ✅ | 71.8% | 65.4% | output/dinov3/large_base_voc_lora.pt |
| LoRA + 1x1 Conv decoder | DINOv2 | ViT-L/14 | 300 M | ✅ | 67.7% | 57.3% | output/dinov2/large_base_voc_lora.pt |
| LoRA + FPN decoder | DINOv2 | ViT-L/14 | 300 M | ✅ | 54.9% | 46.7% | output/dinov2/large_voc_fpn.pt |
ADE20k
I achieve a validation mean IoU of approximately 40.0% using LoRA and a 1x1 convolution decoder with DINOv3 ViT-L weights. With ADE20k-C (corruption severity level 5) the performance drops to 33.3%. An qualitative performance example of the DINOv2 LoRA + 1x1 decoder is illustrated in the figure below.
| finetuned components | pre-training | model | # of params |
with registers |
ADE20k Validation mIoU |
ADE20k-C level 5 Validation mIoU |
Directory |
|---|---|---|---|---|---|---|---|
| 1x1 Conv decoder | DINOv2 | ViT-L/14 | 300 M | ✅ | 31.3% | 26.8% | output/dinov2/large_ade20k_no_lora.pt |
| LoRA + 1x1 Conv decoder | DINOv3 | ViT-L/16 | 300M | ✅ | 40.0% | 33.3% | output/dinov3/large_ade20k_lora.pt |
| LoRA + 1x1 Conv decoder | DINOv2 | ViT-L/14 | 300 M | ✅ | 39.0% | 30.1% | output/dinov2/large_ade20k_lora.pt |
| LoRA + FPN decoder | DINOv2 | ViT-L/14 | 300 M | ✅ | 36.9% | 28.9% | output/dinov2/large_ade20k_fpn.pt |
If you reference or use the codebase in your research, please cite:
@misc{2024dinov2_lora_seg,
title={Finetuning DINOv2 and DINOv3 with LoRA for Image Segmentation},
author = {Van Gastel, Rob},
year={2024}
}
Siméoni, O., Vo, H. V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., … Bojanowski, P. (2025). DINOv3 (No. arXiv:2508.10104). arXiv. https://doi.org/10.48550/arXiv.2508.10104
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., … Bojanowski, P. (2024). DINOv2: Learning Robust Visual Features without Supervision (arXiv:2304.07193). arXiv. http://arxiv.org/abs/2304.07193
Darcet, T., Oquab, M., Mairal, J., & Bojanowski, P. (2024). Vision Transformers Need Registers (arXiv:2309.16588). arXiv. https://doi.org/10.48550/arXiv.2309.16588
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685). arXiv. http://arxiv.org/abs/2106.09685
Hendrycks, D., & Dietterich, T. G. (2019). Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations (arXiv:1807.01697). arXiv. https://doi.org/10.48550/arXiv.1807.01697

