Skip to content

emoji detected PictureItem and colons not found #2831

@Choi-YoungHyun

Description

@Choi-YoungHyun

Question

Thank you for developing the Docling library.
I have two questions.

The test file is a PDF exported from Notion.

When converting this file to Markdown, I encountered two issues and would like some advice.

Issue 1. Colons are not recognized correctly and are instead interpreted as \ue092 and \ue09d.
Some of these cases may be due to text that was intentionally written without spaces for testing.

Issue 2. When emojis are used, there is a tendency for the text to be recognized as an image.

[NOTI]

  • This file is exported to notion page

[TEST-FILE]
notion_to.pdf

[TEST CODE]

from docling.datamodel.pipeline_options import PdfPipelineOptions, TableStructureOptions, EasyOcrOptions
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.base_models import InputFormat
from docling_core.types.doc import TableItem,TextItem,RefItem
from docling_core.types.doc import ImageRefMode

accelerator_options = AcceleratorOptions(
    num_threads=5, 
    device=AcceleratorDevice.CPU
)

pipeline_options = PdfPipelineOptions(
    generate_parsed_pages=True,
    generate_picture_images=True,
    do_table_structure=True,
    do_ocr=True,
    ocr_options=EasyOcrOptions(lang=['ko', 'en']),
    accelerator_options=accelerator_options,
    images_scale=2.0,  # 해상도 높이기
    generate_page_images=True,  # 추가!
    table_structure_options=TableStructureOptions(
        do_cell_matching=True,
        mode="accurate"
    )
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options
        )
    }
)

result = converter.convert("이모지테스트_노션_DPF.pdf") #파일명

for item,_  in result.document.iterate_items():
    print(item)

# 4. 이제 마크다운 출력
markdown = result.document.export_to_markdown(
    image_mode=ImageRefMode.EMBEDDED
)

print(markdown)

[RESULT PRINT ISSUE ITEM]

self_ref='#/texts/2' parent=RefItem(cref='#/body') children=[] content_layer=<ContentLayer.BODY: 'body'> meta=None label=<DocItemLabel.TEXT: 'text'> prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=72.0, t=675.0450029101562, r=121.107, b=636.5250029101562, coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'>), charspan=(0, 14))] orig='SERVER\ue092 FRONT\ue09d' text='SERVER\ue092 FRONT\ue09d' formatting=None hyperlink=None

[RESULT]
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions