Skip to content

Extraction of title from "section header" pages#102

Open
christianabbet wants to merge 29 commits intodevelopfrom
feat/issue-98/title-section-header
Open

Extraction of title from "section header" pages#102
christianabbet wants to merge 29 commits intodevelopfrom
feat/issue-98/title-section-header

Conversation

@christianabbet
Copy link
Collaborator

@christianabbet christianabbet commented Feb 24, 2026

Close #98

Description

In this PR we include a title estimation for pages that are labeled as section headers. The ground truth in data/gt_single_pages.json is updated with title entries. To extract the title GT, we used the Pixtral model and a predefined prompt. The Pixtral interface was refactored so that it can be used in the future for other tasks. For title detection we send the image along with the extracted text. Pixtral seems to have poor OCR capability.

Moreover, we started some refactoring of the evaluation process. We opened a new issue for future cleanup (#103)

Performance

The title is estimated based on multiple criteria and achieves 29.05% F1-score. For each criterion, the cumulative score is indicated.

  • Font size: Titles are more likely to have a larger font (18.99%)
  • Horizontality: The title should begin in the left half of the page (20.11%)
  • Verticality: The title should begin within the top 75% of the page vertically (23.46%)
  • Length: The title should be at least 5 characters long (24.58%)
  • Alpha: The title should consist mostly of alphabetic characters, with few numbers (25.70%)
  • Highness: The title is more likely to appear near the top of the page (29.05%)

API response

// API V2
{
  "has_finished": true,
  "data": {
    "filename": "35016.pdf",
    "page_count": 12,
    "languages": [
      "de"
    ],
    "entities": [
      {
        "classification": "title_page",
        "language": "de",
        "page_start": 1,
        "page_end": 1,
        "title": null
      },
      {
        "classification": "text",
        "language": "de",
        "page_start": 2,
        "page_end": 5,
        "title": "Boden Wasser Luft"
      },
      {
        "classification": "map",
        "language": "de",
        "page_start": 6,
        "page_end": 6,
        "title": null
      },
      {
        "classification": "boreprofile",
        "language": "de",
        "page_start": 7,
        "page_end": 7,
        "title": "RB 1"
      },
      {
        "classification": "boreprofile",
        "language": "de",
        "page_start": 8,
        "page_end": 8,
        "title": "RB 10"
      },
      {
        "classification": "diagram",
        "language": "de",
        "page_start": 9,
        "page_end": 12,
        "title": null
      }
    ]
  }
}
// API V1
{
  "has_finished": true,
  "data": [
    {
      "filename": "35016.pdf",
      "metadata": {
        "page_count": 12,
        "languages": [
          "de"
        ]
      },
      "pages": [
        {
          "predicted_class": "TitlePage",
          "page_number": 1,
          "page_metadata": {
            "language": "de",
            "is_frontpage": true
          }
        },
        {
          "predicted_class": "Text",
          "page_number": 2,
          "page_metadata": {
            "language": "de",
            "is_frontpage": false
          }
        },
        {
          "predicted_class": "Text",
          "page_number": 3,
          "page_metadata": {
            "language": "de",
            "is_frontpage": false
          }
        },
        {
          "predicted_class": "Text",
          "page_number": 4,
          "page_metadata": {
            "language": "de",
            "is_frontpage": false
          }
        },
        {
          "predicted_class": "Text",
          "page_number": 5,
          "page_metadata": {
            "language": "de",
            "is_frontpage": false
          }
        },
        {
          "predicted_class": "Map",
          "page_number": 6,
          "page_metadata": {
            "language": "de",
            "is_frontpage": false
          }
        },
        {
          "predicted_class": "Boreprofile",
          "page_number": 7,
          "page_metadata": {
            "language": "de",
            "is_frontpage": false
          }
        },
        {
          "predicted_class": "Boreprofile",
          "page_number": 8,
          "page_metadata": {
            "language": "de",
            "is_frontpage": false
          }
        },
        {
          "predicted_class": "Diagram",
          "page_number": 9,
          "page_metadata": {
            "language": "de",
            "is_frontpage": false
          }
        },
        {
          "predicted_class": "Diagram",
          "page_number": 10,
          "page_metadata": {
            "language": "de",
            "is_frontpage": false
          }
        },
        {
          "predicted_class": "Diagram",
          "page_number": 11,
          "page_metadata": {
            "language": "de",
            "is_frontpage": false
          }
        },
        {
          "predicted_class": "Diagram",
          "page_number": 12,
          "page_metadata": {
            "language": "de",
            "is_frontpage": false
          }
        }
      ]
    }
  ]
}

@christianabbet christianabbet changed the title Feat/issue 98/title section header Extraction of title from "section header" pages Mar 4, 2026
@christianabbet christianabbet marked this pull request as ready for review March 5, 2026 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extraction of title from "section header" pages

2 participants