For non-page-based formats like epub, how to get the text by chapters? #2503
Replies: 3 comments 3 replies
-
You probably just have to iterate over the EPUB chapters: for i in range(doc.chapter_count):
chapter_page_count = doc.chapter_page_count(i)
chapter_text = ""
for j in range(chapter_page_count):
page = doc[[(i, j)]
chapter_text += page.get_text()
print(f"Text of chapter {i}")
print(chapter_text) |
Beta Was this translation helpful? Give feedback.
-
Thank you for the example, @JorjMcKie . I had overlooked that the I tested this with two epub files. Unfortunately, this does not work as expected. |
Beta Was this translation helpful? Give feedback.
-
I think that from my tests, I have made an incorrect assumption about the relationship between "chapters" and "toc" entries. I thought I could get the chapter text like in your example and the name of the chapter from the toc. But as I currently understand, those two items do not map to one another (i.e. chapter 1 <> toc entry 1 of level 1) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I tried to get the text of a epub book by chapter, but could not find any information on how to do that.
PyMuPdf splits chapters into pages according to a rule that is not obvious to me.
What I would like to get in the end is a list of dicts like [{'Chaptername': chaptertext}].
Is this possible?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions