-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
questionFurther information is requestedFurther information is requested
Description
Question
Thank you for developing the Docling library.
I have two questions.
The test file is a PDF exported from Notion.
When converting this file to Markdown, I encountered two issues and would like some advice.
Issue 1. Colons are not recognized correctly and are instead interpreted as \ue092 and \ue09d.
Some of these cases may be due to text that was intentionally written without spaces for testing.
Issue 2. When emojis are used, there is a tendency for the text to be recognized as an image.
[NOTI]
- This file is exported to notion page
[TEST-FILE]
notion_to.pdf
[TEST CODE]
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableStructureOptions, EasyOcrOptions
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.base_models import InputFormat
from docling_core.types.doc import TableItem,TextItem,RefItem
from docling_core.types.doc import ImageRefMode
accelerator_options = AcceleratorOptions(
num_threads=5,
device=AcceleratorDevice.CPU
)
pipeline_options = PdfPipelineOptions(
generate_parsed_pages=True,
generate_picture_images=True,
do_table_structure=True,
do_ocr=True,
ocr_options=EasyOcrOptions(lang=['ko', 'en']),
accelerator_options=accelerator_options,
images_scale=2.0, # 해상도 높이기
generate_page_images=True, # 추가!
table_structure_options=TableStructureOptions(
do_cell_matching=True,
mode="accurate"
)
)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options
)
}
)
result = converter.convert("이모지테스트_노션_DPF.pdf") #파일명
for item,_ in result.document.iterate_items():
print(item)
# 4. 이제 마크다운 출력
markdown = result.document.export_to_markdown(
image_mode=ImageRefMode.EMBEDDED
)
print(markdown)
[RESULT PRINT ISSUE ITEM]
self_ref='#/texts/2' parent=RefItem(cref='#/body') children=[] content_layer=<ContentLayer.BODY: 'body'> meta=None label=<DocItemLabel.TEXT: 'text'> prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=72.0, t=675.0450029101562, r=121.107, b=636.5250029101562, coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'>), charspan=(0, 14))] orig='SERVER\ue092 FRONT\ue09d' text='SERVER\ue092 FRONT\ue09d' formatting=None hyperlink=None
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested
