Document splitting based on bookmarks #17465

JOHKRE277 · 2024-02-13T15:48:51Z

JOHKRE277
Feb 13, 2024

Checked

I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it

Feature request

The idea ist to split a PDF or WORD file based on the bookmarks of the files. The chunk size must be set dynamicy based on an index function or the like. This way each chunk would contain the chapter context only and no overlap would be needed.

I tested to extract bookmarks from PDFs and it works with a function like this one:

import PyPDF2
pdf_path = "PATH OF THE PDF FILE"

def get_bookmarks_from_pdf(pdf_path):
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
bookmarks = []
def _get_bookmarks(outline_items):
for item in outline_items:
if isinstance(item, list):
# Recursive call for nested bookmarks
_get_bookmarks(item)
elif isinstance(item, PyPDF2.generic.Destination):
bookmarks.append(item['/Title'])
if reader.outlines:
_get_bookmarks(reader.outlines)
return bookmarks

bookmarks = get_bookmarks_from_pdf(pdf_path)

Motivation

I didnt find a document splitter from langchain that works with bookmarks and fills the chunks with the chapter context.

Proposal (If applicable)

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Document splitting based on bookmarks #17465

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Document splitting based on bookmarks #17465

Uh oh!

Uh oh!

JOHKRE277 Feb 13, 2024

Checked

Feature request

Motivation

Proposal (If applicable)

Replies: 0 comments

JOHKRE277
Feb 13, 2024