For non-page-based formats like epub, how to get the text by chapters? #2503

workflowsguy · 2023-06-28T18:19:25Z

workflowsguy
Jun 28, 2023

I tried to get the text of a epub book by chapter, but could not find any information on how to do that.
PyMuPdf splits chapters into pages according to a rule that is not obvious to me.

What I would like to get in the end is a list of dicts like [{'Chaptername': chaptertext}].

Is this possible?

Thanks!

JorjMcKie · 2023-06-28T18:48:27Z

JorjMcKie
Jun 28, 2023
Maintainer

You probably just have to iterate over the EPUB chapters:

for i in range(doc.chapter_count):
    chapter_page_count = doc.chapter_page_count(i)
    chapter_text = ""
    for j in range(chapter_page_count):
        page = doc[[(i, j)]
        chapter_text += page.get_text()
    print(f"Text of chapter {i}")
    print(chapter_text)

0 replies

workflowsguy · 2023-06-30T11:41:29Z

workflowsguy
Jun 30, 2023
Author

Thank you for the example, @JorjMcKie . I had overlooked that the doc class provides support for chapters.

I tested this with two epub files. Unfortunately, this does not work as expected.
In PyMuPDF 1.21.1 chapter_page_countreturns an incorrect value (it seems like for chapter "n", it actually gets the page count for chapter "n+1"...)

2 replies

JorjMcKie Jun 30, 2023
Maintainer

If you were right, then the follow test must fail, right?

chapter_sum = 0
for i in range(doc.chapter_count):
    chapter_sum += doc.chapter_page_count(i)
assert chapter_sum == doc.page_count

JorjMcKie Jun 30, 2023
Maintainer

it doesn't in my test cases.

workflowsguy · 2023-06-30T12:43:15Z

workflowsguy
Jun 30, 2023
Author

I think that from my tests, I have made an incorrect assumption about the relationship between "chapters" and "toc" entries.

I thought I could get the chapter text like in your example and the name of the chapter from the toc. But as I currently understand, those two items do not map to one another (i.e. chapter 1 <> toc entry 1 of level 1)

1 reply

JorjMcKie Jun 30, 2023
Maintainer

You are right. Chapter is a fully technical term in the EPUB's file structure. There is no reason to assume / rely on any relationship with the items in the TOC. The latter may be created independently, following other considerations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

For non-page-based formats like epub, how to get the text by chapters? #2503

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

For non-page-based formats like epub, how to get the text by chapters? #2503

Uh oh!

workflowsguy Jun 28, 2023

Replies: 3 comments · 3 replies

Uh oh!

JorjMcKie Jun 28, 2023 Maintainer

Uh oh!

workflowsguy Jun 30, 2023 Author

Uh oh!

JorjMcKie Jun 30, 2023 Maintainer

Uh oh!

JorjMcKie Jun 30, 2023 Maintainer

Uh oh!

workflowsguy Jun 30, 2023 Author

Uh oh!

JorjMcKie Jun 30, 2023 Maintainer

workflowsguy
Jun 28, 2023

Replies: 3 comments 3 replies

JorjMcKie
Jun 28, 2023
Maintainer

workflowsguy
Jun 30, 2023
Author

JorjMcKie Jun 30, 2023
Maintainer

JorjMcKie Jun 30, 2023
Maintainer

workflowsguy
Jun 30, 2023
Author

JorjMcKie Jun 30, 2023
Maintainer