Skip to content

Commit 114525d

Browse files
committed
Improve robustness
1 parent f6993a1 commit 114525d

File tree

9 files changed

+290
-39
lines changed

9 files changed

+290
-39
lines changed

.gitignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,6 @@ env/
1111

1212
# Content Files
1313
*.epub
14-
!/test-data/*.epub
1514
*.txt
1615
!requirements.txt
1716

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
* **Batch Processing**: Convert multiple files or top-level folders in one run (non-recursive).
2121
* **Formatting**: Adds blank lines between paragraphs.
2222
* **Text Extraction**: Strips images, styles, scripts, and metadata—keeps only text from `.html/.htm/.xhtml` files.
23+
* **Smart List Handling**: Converts ordered and unordered lists into clean, indented bullet points, preserving nested structures.
2324
* **Interactive Mode**: Run without arguments to enter interactive mode. Supports dragging multiple files and folders.
2425
* **Output Handling**: If `-o` points to a new folder, it will be created; `-o` is for single inputs only.
2526

@@ -115,6 +116,7 @@ If you prefer to run it manually or don't want to use the helper scripts:
115116
* **一括処理**: 複数のファイルやトップレベルのフォルダを一度に変換します (再帰的ではありません)。
116117
* **整形**: 段落間に空行を追加します。
117118
* **テキスト抽出**: 画像、スタイル、スクリプト、メタデータを削除し、`.html/.htm/.xhtml` ファイルからテキストのみを保持します。
119+
* **リストの整形**: 箇条書き・番号付きリストを、階層を保ったまま読みやすく整形します。
118120
* **インタラクティブモード**: 引数なしで実行するとインタラクティブモードに入ります。複数のファイルやフォルダのドラッグ&ドロップに対応しています。
119121
* **出力処理**: `-o` で存在しないフォルダを指定した場合は作成します。`-o` は単一入力専用です。
120122
@@ -217,6 +219,7 @@ If you prefer to run it manually or don't want to use the helper scripts:
217219
* **批量處理**: 一次轉換多個檔案或資料夾 (僅掃描第一層)。
218220
* **段落排版**: 在段落之間添加空行。
219221
* **文字提取**: 移除圖片、樣式、腳本與中繼資料,只保留 `.html/.htm/.xhtml` 文字。
222+
* **列表格式優化**: 自動保留列表的層級結構,並轉換為整齊易讀的縮排格式。
220223
* **互動模式**: 無參數執行即可進入互動模式,支援拖放多個檔案與資料夾。
221224
* **輸出處理**: 若 `-o` 指向的新資料夾不存在會自動建立;`-o` 只適用單檔輸入。
222225

app/epub2txt.js

Lines changed: 108 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -5,28 +5,28 @@ document.addEventListener('DOMContentLoaded', () => {
55

66
const ERRORS = {
77
en: {
8-
tooLarge: (size) => `File too large. Please use an EPUB under ${size}MB.`,
9-
tooManyFiles: "EPUB has too many content files to process safely.",
10-
noContent: "No readable HTML/XHTML content found in EPUB.",
11-
missingOpf: "Invalid EPUB: OPF file declared in container.xml not found.",
12-
invalidEpub: "Invalid EPUB/ZIP file.",
13-
invalidOpf: "Invalid EPUB: OPF file is missing required sections."
8+
tooLarge: (size) => `File too large. Please use an EPUB under ${size}MB`,
9+
tooManyFiles: "EPUB has too many content files to process safely",
10+
noContent: "No readable HTML/XHTML content found in EPUB",
11+
missingOpf: "Invalid EPUB: OPF file declared in container.xml not found",
12+
invalidEpub: "Invalid EPUB/ZIP file",
13+
invalidOpf: "Invalid EPUB: OPF file is missing required sections"
1414
},
1515
ja: {
16-
tooLarge: (size) => `ファイルサイズが大きすぎます。${size}MB未満のEPUBを使用してください`,
17-
tooManyFiles: "EPUBに含まれるコンテンツファイルが多すぎるため、安全に処理できません",
18-
noContent: "EPUB内に読み取り可能なHTML/XHTMLコンテンツが見つかりません",
19-
missingOpf: "無効なEPUBです: container.xmlで指定されたOPFファイルが見つかりません",
20-
invalidEpub: "無効なEPUB/ZIPファイルです",
21-
invalidOpf: "無効なEPUBです: OPFファイルに必要なセクションが欠落しています"
16+
tooLarge: (size) => `ファイルサイズが大きすぎます。${size}MB未満のEPUBを使用してください`,
17+
tooManyFiles: "EPUBに含まれるコンテンツファイルが多すぎるため、安全に処理できません",
18+
noContent: "EPUB内に読み取り可能なHTML/XHTMLコンテンツが見つかりません",
19+
missingOpf: "無効なEPUBです: container.xmlで指定されたOPFファイルが見つかりません",
20+
invalidEpub: "無効なEPUB/ZIPファイルです",
21+
invalidOpf: "無効なEPUBです: OPFファイルに必要なセクションが欠落しています"
2222
},
2323
zh: {
24-
tooLarge: (size) => `檔案過大,請使用小於 ${size}MB 的 EPUB`,
25-
tooManyFiles: "此 EPUB 內容檔案過多,無法安全處理",
26-
noContent: "EPUB 中沒有可讀的 HTML/XHTML 內容",
27-
missingOpf: "無效的 EPUB: container.xml 指定的 OPF 檔案不存在",
28-
invalidEpub: "無效的 EPUB/ZIP 檔案",
29-
invalidOpf: "無效的 EPUB: OPF 檔案缺少必要的區段"
24+
tooLarge: (size) => `檔案過大,請使用小於 ${size}MB 的 EPUB`,
25+
tooManyFiles: "此 EPUB 內容檔案過多,無法安全處理",
26+
noContent: "EPUB 中沒有可讀的 HTML/XHTML 內容",
27+
missingOpf: "無效的 EPUB: container.xml 指定的 OPF 檔案不存在",
28+
invalidEpub: "無效的 EPUB/ZIP 檔案",
29+
invalidOpf: "無效的 EPUB: OPF 檔案缺少必要的區段"
3030
}
3131
};
3232

@@ -80,8 +80,8 @@ document.addEventListener('DOMContentLoaded', () => {
8080
packaging: "Packaging ZIP...",
8181
processingFile: (current, total) => `File ${current}/${total}:`,
8282
errorPrefix: "Error: ",
83-
onlyEpub: "Only .epub files are supported.",
84-
genericError: "An unexpected error occurred.",
83+
onlyEpub: "Only .epub files are supported",
84+
genericError: "An unexpected error occurred",
8585
convertAnother: "Drag other .epub files to convert",
8686
selectFile: "select file(s)",
8787
downloadTxt: "Download TXT",
@@ -98,8 +98,8 @@ document.addEventListener('DOMContentLoaded', () => {
9898
packaging: "ZIPを作成中...",
9999
processingFile: (current, total) => `ファイル ${current}/${total}:`,
100100
errorPrefix: "エラー: ",
101-
onlyEpub: ".epubファイルのみ対応しています",
102-
genericError: "予期しないエラーが発生しました",
101+
onlyEpub: ".epubファイルのみ対応しています",
102+
genericError: "予期しないエラーが発生しました",
103103
convertAnother: "他の .epub ファイルをドラッグして変換",
104104
selectFile: "ファイルを選択",
105105
downloadTxt: "TXTをダウンロード",
@@ -116,8 +116,8 @@ document.addEventListener('DOMContentLoaded', () => {
116116
packaging: "正在打包 ZIP...",
117117
processingFile: (current, total) => `檔案 ${current}/${total}:`,
118118
errorPrefix: "錯誤: ",
119-
onlyEpub: "請選擇 .epub 檔案",
120-
genericError: "發生未預期的錯誤",
119+
onlyEpub: "僅支援 .epub 檔案",
120+
genericError: "發生未預期的錯誤",
121121
convertAnother: "拖放其他 .epub 檔案以轉換",
122122
selectFile: "選擇檔案",
123123
downloadTxt: "下載 TXT",
@@ -279,7 +279,13 @@ document.addEventListener('DOMContentLoaded', () => {
279279
// Handle missing files in spine gracefully
280280
let content;
281281
try {
282-
content = await zip.file(path).async("string");
282+
const entry = zip.file(path);
283+
if (!entry) {
284+
console.warn("Could not read file:", path);
285+
continue;
286+
}
287+
const bytes = await entry.async("uint8array");
288+
content = decodeBytesToString(bytes);
283289
} catch (e) {
284290
console.warn("Could not read file:", path);
285291
continue;
@@ -342,6 +348,44 @@ document.addEventListener('DOMContentLoaded', () => {
342348
return stack.join('/');
343349
}
344350

351+
function decodeBytesToString(bytes) {
352+
const encoding = sniffEncoding(bytes) || 'utf-8';
353+
try {
354+
return new TextDecoder(encoding).decode(bytes);
355+
} catch (e) {
356+
return new TextDecoder('utf-8').decode(bytes);
357+
}
358+
}
359+
360+
function sniffEncoding(bytes) {
361+
if (!bytes || !bytes.length) return null;
362+
const headerBytes = bytes.subarray(0, 2048);
363+
let headerText = '';
364+
try {
365+
headerText = new TextDecoder('utf-8').decode(headerBytes);
366+
} catch (e) {
367+
return null;
368+
}
369+
370+
const xmlMatch = headerText.match(/<\?xml[^>]*encoding=["']([^"']+)["']/i);
371+
if (xmlMatch) return normalizeEncodingName(xmlMatch[1]);
372+
373+
const metaCharsetMatch = headerText.match(/<meta[^>]*charset=["']?\s*([^"'\s/>]+)/i);
374+
if (metaCharsetMatch) return normalizeEncodingName(metaCharsetMatch[1]);
375+
376+
const metaHttpEquivMatch = headerText.match(/<meta[^>]*http-equiv=["']content-type["'][^>]*content=["'][^"']*charset=([^"']+)["']/i);
377+
if (metaHttpEquivMatch) return normalizeEncodingName(metaHttpEquivMatch[1]);
378+
379+
return null;
380+
}
381+
382+
function normalizeEncodingName(name) {
383+
if (!name) return null;
384+
const cleaned = String(name).trim().toLowerCase().replace(/_/g, '-');
385+
if (cleaned === 'utf8') return 'utf-8';
386+
return cleaned;
387+
}
388+
345389
function resolveZipPath(opfDir, href) {
346390
const cleaned = href.split('#')[0];
347391
if (!cleaned) return null;
@@ -610,7 +654,17 @@ document.addEventListener('DOMContentLoaded', () => {
610654
return combined;
611655
}
612656

613-
function collectTextSegments(element, inPre = false, segments = [], state = null) {
657+
/**
658+
* Recursive function to traverse the DOM and collect text segments.
659+
* Mirrors the logic in the Python script's `get_clean_text`.
660+
*
661+
* @param {Node} element - The DOM node to traverse.
662+
* @param {boolean} inPre - Whether the current node is inside a <pre> tag.
663+
* @param {Array} segments - Accumulator for text segments.
664+
* @param {Object} state - Tracks state across recursion (e.g., hasContent, lastWasSeparator).
665+
* @param {number} listDepth - Current nesting level of lists (for indentation).
666+
*/
667+
function collectTextSegments(element, inPre = false, segments = [], state = null, listDepth = 0) {
614668
if (!element) return segments;
615669
if (!state) {
616670
state = { hasContent: false, lastWasSeparator: false };
@@ -645,6 +699,25 @@ document.addEventListener('DOMContentLoaded', () => {
645699
return;
646700
}
647701

702+
// Handle Lists
703+
if (tagName === 'UL' || tagName === 'OL') {
704+
if (!inPre) pushSegment("\n", false);
705+
collectTextSegments(node, inPre, segments, state, listDepth + 1);
706+
if (!inPre) pushSegment("\n", false);
707+
return;
708+
}
709+
710+
if (tagName === 'LI') {
711+
if (!inPre) {
712+
pushSegment("\n", false);
713+
const indent = " ".repeat(Math.max(0, listDepth - 1));
714+
pushSegment(indent + "- ", true);
715+
}
716+
collectTextSegments(node, inPre, segments, state, listDepth);
717+
if (!inPre) pushSegment("\n", false);
718+
return;
719+
}
720+
648721
const headingLevel = HEADING_TAGS[tagName];
649722
if (headingLevel && !inPre) {
650723
const headingText = node.textContent.replace(/\s+/g, ' ').trim();
@@ -664,7 +737,7 @@ document.addEventListener('DOMContentLoaded', () => {
664737
pushSegment("\n", false);
665738
}
666739

667-
collectTextSegments(node, nextPre, segments, state);
740+
collectTextSegments(node, nextPre, segments, state, listDepth);
668741

669742
if (isBlock && !inPre) {
670743
pushSegment("\n", false);
@@ -698,6 +771,10 @@ document.addEventListener('DOMContentLoaded', () => {
698771
return elements.length ? elements[0] : null;
699772
}
700773

774+
/**
775+
* Handles the creation of a temporary Object URL for downloading.
776+
* Revokes any existing URL to prevent memory leaks before creating a new one.
777+
*/
701778
function prepareBlobDownload(blob, filename, downloadType) {
702779
safeRevokeBlob();
703780
currentBlobUrl = URL.createObjectURL(blob);
@@ -769,6 +846,10 @@ document.addEventListener('DOMContentLoaded', () => {
769846
}
770847
}
771848

849+
/**
850+
* Generates a unique filename by appending a counter if the name already exists.
851+
* e.g., "book.txt" -> "book (2).txt" -> "book (3).txt"
852+
*/
772853
function makeUniqueFilename(name, usedNames) {
773854
if (!usedNames.has(name)) return name;
774855
const dotIndex = name.lastIndexOf('.');

epub2txt.py

Lines changed: 43 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -126,15 +126,20 @@ def parse_opf(zip_ref: zipfile.ZipFile, opf_path: str):
126126
root = ET.fromstring(opf_content)
127127

128128
# Create the OPF namespace map dynamically to handle varying versions (2.0 vs 3.0).
129-
# Grabs the namespace from the root tag itself.
130-
ns = {'pkg': root.tag.split('}')[0].strip('{')}
129+
# Some OPFs are un-namespaced; handle both cases.
130+
has_namespace = '}' in root.tag
131+
ns = {'pkg': root.tag.split('}')[0].strip('{')} if has_namespace else {}
131132

132133
# 1. Parse Manifest: Map ID -> Href (File Path)
133134
# Creates a dictionary where valid IDs point to their actual file locations.
134135
manifest_items = {}
135136
nav_href = None
136137
ncx_href = None
137-
for item in root.findall(".//pkg:manifest/pkg:item", ns):
138+
if has_namespace:
139+
manifest_items_iter = root.findall(".//pkg:manifest/pkg:item", ns)
140+
else:
141+
manifest_items_iter = root.findall(".//manifest/item")
142+
for item in manifest_items_iter:
138143
item_id = item.attrib.get('id')
139144
href = item.attrib.get('href')
140145
if not item_id or not href:
@@ -150,12 +155,16 @@ def parse_opf(zip_ref: zipfile.ZipFile, opf_path: str):
150155
# 2. Parse Spine: Get linear reading order
151156
# The spine tells the parser the order in which to display the items found in the manifest.
152157
spine_hrefs = []
153-
spine = root.find(".//pkg:spine", ns)
158+
spine = root.find(".//pkg:spine", ns) if has_namespace else root.find(".//spine")
154159
if spine is not None:
155160
toc_id = spine.attrib.get('toc')
156161
if toc_id and toc_id in manifest_items:
157162
ncx_href = manifest_items[toc_id]
158-
for itemref in spine.findall(".//pkg:itemref", ns):
163+
if has_namespace:
164+
spine_items = spine.findall(".//pkg:itemref", ns)
165+
else:
166+
spine_items = spine.findall(".//itemref")
167+
for itemref in spine_items:
159168
item_id = itemref.attrib.get('idref')
160169
if item_id in manifest_items:
161170
spine_hrefs.append(manifest_items[item_id])
@@ -396,7 +405,7 @@ def epub_to_text(epub_path: str, output_txt_path: str) -> None:
396405
element.decompose()
397406

398407
# Step 4: Extract text
399-
# Use our custom function to handle spacing intelligently
408+
# Use helper function to handle spacing intelligently
400409
normalized_path = posixpath.normpath(file_path)
401410
anchor_ids = chapter_anchors.get(normalized_path, [])
402411
insert_anchor_markers(soup, anchor_ids)
@@ -469,6 +478,11 @@ def get_clean_text(soup: BeautifulSoup) -> str:
469478
"""
470479
Extract text from BeautifulSoup object with intelligent whitespace handling.
471480
Preserves sentence structure for LLMs while maintaining paragraph separation.
481+
482+
This function traverses the DOM tree recursively:
483+
- Block elements (p, div, etc.) trigger line breaks.
484+
- Lists are flattened with indentation to preserve hierarchy.
485+
- Script/Style/Meta tags are ignored.
472486
"""
473487
root = soup.body or soup
474488
if not root:
@@ -490,7 +504,7 @@ def add_separator():
490504
parts.append(("\n\n---\n\n", False))
491505
state['last_sep'] = True
492506

493-
def walk(node, in_pre: bool = False):
507+
def walk(node, in_pre: bool = False, list_depth: int = 0):
494508
for child in node.children:
495509
if isinstance(child, NavigableString):
496510
text = str(child)
@@ -513,6 +527,27 @@ def walk(node, in_pre: bool = False):
513527
if name == 'br':
514528
add_text("\n", in_pre)
515529
continue
530+
531+
# Handle Lists
532+
if name in ('ul', 'ol'):
533+
if not in_pre:
534+
add_text("\n", False)
535+
walk(child, in_pre, list_depth + 1)
536+
if not in_pre:
537+
add_text("\n", False)
538+
continue
539+
540+
if name == 'li':
541+
if not in_pre:
542+
add_text("\n", False)
543+
# Indent based on depth (depth 1 = no indent, depth 2 = 2 spaces, etc.)
544+
indent = " " * max(0, list_depth - 1)
545+
add_text(indent + "- ", True)
546+
walk(child, in_pre, list_depth)
547+
if not in_pre:
548+
add_text("\n", False)
549+
continue
550+
516551
heading_level = HEADING_TAGS.get(name)
517552
if heading_level and not in_pre:
518553
heading_text = child.get_text(" ", strip=True)
@@ -528,7 +563,7 @@ def walk(node, in_pre: bool = False):
528563
if is_block and not in_pre:
529564
add_text("\n", False)
530565

531-
walk(child, next_pre)
566+
walk(child, next_pre, list_depth)
532567

533568
if is_block and not in_pre:
534569
add_text("\n", False)

index.html

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
<link rel="icon" type="image/png" sizes="16x16" href="assets/favicon-16x16.png">
1111
<link rel="shortcut icon" href="assets/favicon.ico">
1212
<title>epub2txt - Convert EPUB to Text</title>
13+
<link rel="canonical" href="https://spacesoda.github.io/epub2txt/" />
1314
<link rel="alternate" hreflang="en" href="https://spacesoda.github.io/epub2txt/" />
1415
<link rel="alternate" hreflang="ja" href="https://spacesoda.github.io/epub2txt/ja/" />
1516
<link rel="alternate" hreflang="zh-TW" href="https://spacesoda.github.io/epub2txt/zh/" />
@@ -130,7 +131,7 @@ <h2 id="success-filename">book.txt</h2>
130131
</div>
131132
<div id="error-state" class="hidden">
132133
<div class="icon error">⚠️</div>
133-
<p id="error-msg">Only .epub files are supported.</p>
134+
<p id="error-msg">An unexpected error occurred</p>
134135
<button id="retry-btn" class="cta-button secondary">Try Again</button>
135136
</div>
136137
</div>

0 commit comments

Comments
 (0)