This log contains the major milestones of the training process, as well as failed attempts to improve training. The focus of the training of the transformer and the segmentation model is documented only briefly.
training/train.py segnet
This will download the dataset and start the training. Training takes about 1 hour.
Prerequisites:
- Linux
- CUDA
- rsvg-convert, e.g. via
sudo apt install librsvg2-bin
Download the datasets and convert them to the format required for training:
training/datasets/convert_primus.pytraining/datasets/convert_grandstaff.pytraining/datasets/convert_lieder.py
This will also download and run MuseScore as an AppImage. If this fails, check your setup to ensure that you can rundatasets/MuseScore.
Some checks:
training/transformer/training_vocabulary.py: Check that the vocabulary is completetraining/validate_music_xml_conversion.py: Can be used to test changes in MusicXML parsing and generationtraining/validate_music_xml_conversion.py: Visualize datasets; takes one ofdatasets/*/index.txtas an argument
Finally, start the training itself with: training/train.py transformer.
This takes around 2–4 days.
The homr pipeline is a two-step system consisting of staff detection (segnet) followed by music transcription (TrOmr transformer).
Accordingly, we report results from two different validation setups with different purposes and scopes.
The Transcription Smoke Test evaluates only the transformer-based transcription component on a small, fixed dataset.
It is intended as a fast indicator of transcription model quality and is mainly used to detect regressions or larger performance changes during development.
Implementation: symbol_error_rate_torch.py
The System Level Validation evaluates the full homr pipeline, including staff detection and transcription, on a dedicated test dataset.
This validation provides a more representative indication of overall system performance and is used to compare training runs and pipeline changes.
Note: The test dataset cannot be published due to copyright restrictions. In addition, the dataset is subject to change over time, which may affect the comparability of results across different runs.
Implementation: rate_validation_result.py
Commit: 6d996c3d118c1e183f8412832383168e52630ce8
Day: 17 Feb 2026
Transformer Smoke Test: SER 6%
System Level: 10.2 diffs, SER: 13.6%
convnextv2_base, #59
Commit: bb0ced2ff0cacdbd8ee33db4533a04c9e77f0ca8
Day: 17 Feb 2026
Transformer Smoke Test: SER 18%
System Level: 14.4 diffs, SER: 14.5%
convnextv2_tiny, #59
Commit: e10346542968cc71fbcce0c0696f3ac963f11ae1
Day: 17 Feb 2026
Transformer Smoke Test: SER 6%
System Level: 5.3 diffs, SER: 3.4%
Scheduled Sampling, #59
Commit: a78209527e2b8a4fb866fba9b2ef8540f4b8dad9
Day: 14 Feb 2026
Transformer Smoke Test: SER 10%
System Level: 16.0 diffs, SER: 9.3%
Harder augmentation, #59
Commit: 290d4e79aa377681523ca676b984b9cee3eb16ce
Day: 13 Feb 2026
Transformer Smoke Test: SER 13%
System Level: 42.8 diffs, SER: 32.8%
Backtracking, removed 90deg rotation and sinusoidal bias, #59
Commit: fd3d66d7d989003ec4cadd1d594ca2e820ece941
Day: 12 Feb 2026
Transformer Smoke Test: SER 12%
System Level: 18.1 diffs, SER: 22.4%
Improved pitch accuracy, #59
Commit: 6f72a0bc2577907503e7ec84ac9850a5a972ded0
Day: 4 Feb 2026
Transformer Smoke Test: SER 15%
System Level: 25.9 diffs, SER: 15.5%
ConvNext, #59
Commit: 87d30ed79a81b4f07a38a8f6419334c59633709a
Day: 30 Jan 2026
Transformer Smoke Test: SER 14% (the SER reported for the previous runs was too large due to an unreasonable large temperature setting during the smoke test)
System Level: 6.9 diffs, SER: 5.7%
Updated segnet model for staff detection.
Commit: 0daf75fea21e6ea6a865405e03a4bc7e73e9aa14
Day: 4 Jan 2026
Transformer Smoke Test: SER 26% (higher, due to an error in the smoke test)
System Level: 7.4 diffs
After fixing an issue with accidentals during the conversion of the PrIMuS dataset Run 242 (a00be6debbedf617acdf39558c89ba6113c06af3) was used as basis of a 15 epoch run which only trained the lift decoder.
Commit: a00be6debbedf617acdf39558c89ba6113c06af3
Day: 9 Dec 2025
Transformer Smoke Test: SER 23% (some errors in the smoke test fixed, still higher as some issue remained)
System Level: 7.2 diffs after fixing an error in the validation result calculation itself, before 8.1
Singe staff images now use the full resolution.
Commit: 922ad08f8895f6d9c0ae61954cd78a021ff950a7
Day: 26 Oct 2025
Transformer Smoke Test: SER 37% (higher, due to an error in the smoke test)
System Level: 9.8 diffs, 8.6 after some tweaks to the staff image preparation
Volta brackets, bf16
Commit:ea96f0150ec74388df8cb0bb78ee2c36782a00d9
Day: 01 Oct 2025
Transformer Smoke Test: SER 39% (higher, due to an error in the smoke test)
System Level: 9.8 diffs
Grandstaff support.
Some notes about other experiments which have been performed on the lieder dataset with only 15 epochs:
- Baseline
- Final eval loss: 0.6007007360458374
- SER: 112%
- State
- Final eval loss: 0.5414645671844482
- SER: 109%
- Clef unification
- Final eval loss: 1.849281907081604
- SER: 146%
- fp16 and flash attention
- Final eval loss: 1.522333025932312
- SER: 144%
- fp16 and no flash attention
- Final eval loss: 1.00400710105896
- SER: 130%
- Trains about twice as fast so the poorer performance can be adressed with additional epochs
Commit: c50aec7de6469480cf6f547695f48aed76d8422e
Day: 05 Sep 2025
Transformer Smoke Test: SER 8%
System Level: 8.8 diffs
Added articulations and other symbols.
Commit: 4c8d68b941c647c96f82d977ac0bb59d4f2b7a8c
Day: 05 Aug 2025
Transformer Smoke Test: SER 10%
System levle: 9.1 diffs
New decoder branch just for triplets and dots. The result works mostly fine but it's too eager to detect triplets.
Commit: 4915073f892f6ab199844b1bff0c968cdf8be03e
Day: 01 Aug 2025
Transformer Smoke Test: SER 8%
System Level: 8.0 diffs
Larger encoder.
Commit: 4915073f892f6ab199844b1bff0c968cdf8be03e
Day: 01 Aug 2025
Transformer Smoke Test: SER 8%
Sytem level: 8.3 diffs
Larger encoder.
Commit: 3f0631db15012e928ad3d4da739817f92d958979
Day: 30 Jul 2025
Transformer Smoke Test: SER 10%
System levle: 7.9 diffs
Removed cases of incorrect triplets from the dataset
Commit: 74d500a5d94e553f24dbbd57a0e71b8566e2e554
Day: 25 Jul 2025
Transformer Smoke Test: SER 11%
System levle: 9.6 diffs
Larger encoder.
Note: This branch was rebased, commit hash was updated to match the version which was merged to main.
Commit: a1ec2fff7d7ba562807f03badf5ed963b48649a5
Day: 27 Jul 2025
Transformer Smoke Test: SER 9%
System Level: 11.6 diffs
Increased degrees of freedom in decoder.
Commit: eb5fbfd4692b56d24d615e2fa3586903ad681132
Day: 26 Jul 2025
Transformer Smoke Test: SER 8%
System Level: 9.7 diffs
Added triplets.
Commit: 1cd1d06543e885e4d64a74d985b4725c50054c2a
Day: 11 Jul 2025
Transformer Smoke Test: SER 10%
System levle: 8.0 diffs
Transformer depth of 8.
Commit: bf39c935c9081d04dc1d97e25dcda68ebb0ca40c
Day: 10 Jul 2025
Transformer Smoke Test: SER 9%
System Level: 7.6 diffs
Transformer depth of 6.
Commit: a9dd113eb203979b6c2b21403574832da39fee76
Day: 09 Jul 2025
Transformer Smoke Test: SER 11%
System levle: 8.4 diffs
Transformer depth of 4 (as Polymorphic-TrOmr is using). Training was stopped at epoch 59 by a Windows Update.
Commit: 46ff7e18fd85d9d2026f9ed18eacf7ae0638a14c
Day: 07 Jul 2025
Transformer Smoke Test: SER 9%
System levle: 7.4 diffs
Updated staff detection:
- Resnet18 after 3 epochs (1240eedca553155b3c75fc9c7f643465383430a0): 7.4
- Resnet18 after 10 epochs (66dd2392759d1746cc9458c097e25aaaa1559fc5): 10.8 (overfitting?)
- Resnet34 after 3 epochs (1cd1d06543e885e4d64a74d985b4725c50054c2a): 7.3
- Resnet34 after 10 epochs (a9dd113eb203979b6c2b21403574832da39fee76): 8.3
Note at this point the transformer depth is 8 for the decoder and 12 for the encoder.
Commit: 46ff7e18fd85d9d2026f9ed18eacf7ae0638a14c
Day: 05 Jul 2025
Transformer Smoke Test: SER 9%
System levle: 8.3 diffs
Updated dependencies.
Commit: 4a0d7991b3824f2a667a237b1370a8999cd3695e
Day: 29 Jun 2025
Transformer Smoke Test: SER 9%
System Level: 7.8 diffs
Updated dependencies.
Commit: ba12ebef4606948816a06f4a011248d07a6f06da
Date: 10 Sep 2024
Transformer Smoke Test: SER 9%
System Level: 6.4 diffs
Training runs now pick the last iteration and not the one with the lowest validation loss.
Commit: e317d1ba4452798036d2b24a20f37061b8441bae
Date: 10 Sep 2024
Transformer Smoke Test: SER 14%
System Level: 7.6 diffs
Increased model depth from 4 to 8.
Commit: 11c1eeaf5760d617f09678e276866d31253a5ace
Date: 30 Jun 2024
Transformer Smoke Test: SER 16%
System Level: 8.6 diffs
Fixed issue with courtesey accidentals in Lieder dataset and in splitting naturals into tokens. Removed CPMS dataset as it seems impossible to reliably tell if a natural is in an image.
Commit: 78ace9d99ff38cde0196e47ab2a04309037b1e91
Date: 28 Jun 2024
Transformer Smoke Test: SER 17%
System Level: 8.9 diffs
Fixed another issue with backups in music xml. The poorer validation result seems to be mainly caused by one piece where it fails to detect the naturals.
Commit: e38ad001a548ffd9be89591ce68ed732565a38ae
Date: 21 Jun 2024
Transformer Smoke Test: SER 26%
Sytem levle: 8.1 diffs
Fixed an issue with parallel voices in the data set. Added Lieder dataset.
Commit: f00a627e030828844c45ecde762146db719d72aa
Date: 9 Jun 2024
SER: 44%
System Level: 8.0 diffs
Set dropout to 0, increased the number of samples and decreased the number of epochs.
Commit: 6a288bc25c99a10cdcdf19982d5df79d65c82910
Date: 16 May 2024
Transformer Smoke Test: SER 46%
Removed the negative data set and fixed the ordering of chords.
Commit: 8107bb8bdfaaeb3300477ec534b49cbf1c2a70c6
Date: 12 May 2024
Transformer Smoke Test: SER 53%
First training run within the homr repo.
These runs where performed inside the Polyphonic-TrOMR repo.
Date: 6 May 2024
Training time: ~14h (fast option)
Commit: 8f774545179f3e7bfdbd58fe1a6c55473b8d4343
System Level: 14.5 diffs
Date: 3 May 2024
Training time: ~14h (fast option)
Commit: b22104265be285b5a1d461c3fab2aa4589eb08cc
System Level: 17.9 diffs
Date: 1 May 2024
Training time: ~17h (fast option)
Commit: cf7313f0bcec82f4f7da738fbacabd56084f6604
System levle: 17.5 diffs
Date: 30 Apr 2024
Training time: ~18h (fast option)
Commit: 80896fdba4dbe4f9b2bbba3dd66377b3b0d1faa5
Enabled CustomVisionTransformer again.
Date: 29 Apr 2024
Training time: ~18h (fast option)
Commit: acbdf6dc235f393ef75158bdcf539e3b2e5b435e
System Level: 12.9 diffs
Increased alpha to 0.2.
Date: 29 Apr 2024
Training time: ~18h (fast option)
Commit: 185c235cd0979faa2c087e59e71dbba684a68fb6
System levle: 13.1 diffs
Reverting 9e2c14122607a63c25253d1c5378c706859395ab and reverting to a depth of 4.
Date: 28 Apr 2024
Training time: ~18h (fast option)
Commit: 840318915929e5efe780780a543ea053b479d375
Date: 27 Apr 2024
Training time: ~18h (fast option)
Commit: f732c3abc10b5b0b3e8942f722d695eb725e3e53
System Level: 80.9 diffs
So far we used the format which TrOMR seems to use: Semantic format but with accidentals depending on how they are placed.
E.g. the semantic format is Key D Major, Note C#, Note Cb, Note Cb so the TrOMR will be: Key D Major, Note C, Note Cb, Note C because the flat is the only visible accidental in the image.
With this attempt we try to use the semantic format without any changes to the accidentals.
Date: 26 Apr 2024
Training time: ~19h (fast option)
Commit: 9e2c14122607a63c25253d1c5378c706859395ab
System Level: 22.3 diffs
Encoder & decoder depth was increased from 4 to 6
Date: 25 Apr 2024
Training time: ~16h (fast option)
Commit: 75d8688719494169f4b629fc51224d4aa846eee7
Fixed that the training data didn't contain any natural accidentals.
Date: 24 Apr 2024
Training time: ~24h (fast option)
Commit: b4af54249fca5bf93650c518c7220f5de98c843c
After experiments with focal loss and weight decay, we are backtracking to run 63.
Date: 23 Apr 2024
Training time: ~24h (fast option)
Commit: 6580500e71602d5c74decde2946498c8e883392e
Adding a weight to the lift/accidental tokens.
Date: 22 Apr 2024
Training time: ~17h (fast option)
Commit: 3b92eee2e56647fcb538b4ef5ef3704f12bfb2d1
Reduced weight decay.
Date: 21 Apr 2024
Training time: ~17h (fast option), aborted after epoch 16 from 25
Commit: a6b87b71b3b69d87d424f3c86500081f6146d436
Looks like a focal loss doesn't help to improve the performance of the lift detection.
Date: 11 Apr 2024
Training time: ~26h (fast option)
Commit: c360ab726df18879973e6829a1423c627a99afd5
System Level: 13.7 diffs
Increased data set size by introducing a negative data set with no musical symbols. And by using positive data sets more often with different mask values.
Date: 07 Apr 2024
Training time: ~14h (fast option)
Commit: 3fc893c0ab547fe1958adf500b0afaf0f6990f80
Changes to the conversion of the grandstaff dataset haven't been applied yet.
Date: 07 Apr 2024
Training time: ~14h (fast option)
Commit: 5ec6beaf461c034340ad0d2f832d842bef8bee75
System Level: 13.8 diffs
Changes to the conversion of the grandstaff dataset haven't been applied yet.
Date: 06 Apr 2024
Training time: ~14h (fast option)
Commit: d73d5a9d342d4d934c21409632f4e2854d14d333
System Level: 17.0 diffs
Changes to the conversion of the grandstaff dataset haven't been applied yet.
Start of dropout tests, number ranges for dropouts are mainly based on https://arxiv.org/pdf/2303.01500.pdf.
Date: 05 Apr 2024
Training time: ~14h (fast option)
Commit: cd445caa5337d86cf723854cb2ef9e98dd4c5b76
System Level: 18.4 diffs
We changed how we number runs and established a link between the run number and the git history.
Date: 05 Apr 2024
Training time: ~19h (fast option)
Commit: a57ee4c046842c0135adca84f06260cff8af732f
We tried InceptionResnetV2. The training run showed overfitting and the resulting SER indicates poor results. The model is over 3 times larger than the ResNetV2 model and might require more work to prevent overfitting.
Date: 02 Apr 2024
Training time: ~24h (fast option)
Commit: 9ddfff8b5782473e8831ca3791d9bef99f726654
System Level: 23.4 diffs
We decreased the vocabulary, the alpha/beta ratio in the loss function and made changes to the grandstaff dataset. While still performing worse than Run 0 in the manual validation, it gets closer now and in some specific tests performs even better than Run 0. We will have to backtrack from this point to find out which of the changes lead to an improved result.
Date: 01 Apr 2024
Training time: ~48h
Commit: 516093a3f3840cb82922b4d7300d1568455277d568f85ea96fe41235a06ca8de6759f1db6b8fc39a
Date: 24 Mar 2024
Training time: ~24h (fast option)
Commit: 516093a3f3841235a06ca8de6759f1db6b8fc39a
The weights from the original paper.
System Level: 9.3 diffs