homr is an Optical Music Recognition (OMR) software designed to transform camera pictures of sheet music into machine-readable MusicXML format. The resulting MusicXML files can be further processed using tools such as musescore.
- Python 3.11
- Poetry
- Optional: NVidia GPU with CUDA 12.1
- Clone the repository
- Install dependencies for:
- GPU (requires CUDA):
poetry install --only main,gpu - CPU:
poetry install --only main - Development:
poetry install
- GPU (requires CUDA):
- Run the program using
poetry run homr <image> - The resulting MusicXML file will be saved in the same directory as the input image
- To combine the MusicXML results from multiple images, you can use relieur
The example below provides an overview of the current performance of the implementation. While some errors are present in the output, the overall structure remains accurate.
| Original Image | homr Result |
|---|---|
![]() |
The homr result is obtained by processing the homr output and rendering it with musescore.
The current implementation focuses on pitch and rhythm information on the bass or treble clef, neglecting dynamics, articulation, double sharps/flats, and other musical symbols.
homr uses a two-stage pipeline: segmentation for structural analysis followed by semantic symbol recognition via transformer models.
homr employs UNet-based segmentation models (adapted from oemer) to extract structural components from the sheet music image:
- Staff lines and symbols: Detected via trained segmentation networks that identify:
- Staff line fragments
- Note heads
- Stems and rests
- Bar lines
- Clefs and key signatures
The segmentation process generates bounding boxes for each detected element. These predictions serve as inputs for the staff detection algorithm.
Using the segmentation outputs, homr constructs staffs through the following steps:
-
Staff Anchor Detection: The algorithm identifies "staff anchors" (clefs and bar lines) that serve as reference points for accurate staff localization, even when symbols partially obscure staff lines.
-
Unit Size Estimation: For each staff, the algorithm calculates the "unit size" (distance between staff lines). This accommodates camera perspective variations and non-uniform staff spacing.
-
Staff Reconstruction: Around each anchor, five staff lines are located and the remaining staff structure is reconstructed using the estimated unit size.
-
Grand Staff Merging: Braces and brackets are identified to merge related staffs, supporting:
- Grand staffs (piano, organ)
- Multiple voices on a single staff
- Mixed instrument groups
Each staff is dewarped (perspective-corrected) and passed through a transformer-based model (based on Polyphonic-TrOMR) that performs end-to-end symbol sequence recognition. The model outputs:
- Rhythm symbols: Note durations, rests, and tuplet information
- Pitch information: Absolute pitch values with accidentals (sharps, flats, naturals)
- Articulation marks: Accents, staccato, tenuto, and slur markers
- Performance annotations: Dynamic expressions and other musical notation
The transformer model generates these predictions in sequence, processing the dewarped staff image to understand the spatial and temporal relationships between musical symbols.
Note: The transformer output provides the sequence of symbols but does not include explicit positional information (horizontal or vertical coordinates). However, the model computes the center of attention as a byproduct of the attention mechanism, which can be used to estimate the focus point on the staff image.
The symbol sequence is converted into MusicXML format and saved to disk. The resulting file can be processed with tools like musescore or relieur (for multi-image combinations).
If you use this code in your research work, please cite oemer and Polyphonic-TrOMR.
The name "homr" stands for Homer's Optical Music Recognition (OMR), leaving the interpretation of "Homer" to the user's discretion, whether referring to the ancient poet Homer or the iconic character from The Simpsons.
This project builds upon previous work, including:
- The segmentation models of oemer
- The transformer model of Polyphonic-TrOMR
- The starter template provided by Benjamin Roland
