Skip to content

Commit e1dabff

Browse files
committed
Merge branch 'devel'
2 parents 2a92485 + de73180 commit e1dabff

File tree

9 files changed

+429
-170
lines changed

9 files changed

+429
-170
lines changed

CHANGELOG.md

Lines changed: 32 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
## Changes
22

3-
![Downloads](https://img.shields.io/github/downloads/Ljzd-PRO/KToolBox/v0.21.0/total)
3+
![Downloads](https://img.shields.io/github/downloads/Ljzd-PRO/KToolBox/v0.21.1/total)
44

55
### ✨ Features
66

7-
- Improved **download progress display**, providing a more **elegant and intuitive** progress bar
7+
- Improved **download progress display**, providing a more **elegant and intuitive** progress bar (v0.21.0)
88
```log
99
2025-08-19 13:42:07 | INFO | ktoolbox.cli - Got creator information - name: Ljzd-PRO, id: 12345678
1010
@@ -20,7 +20,7 @@
2020
2121
2025-08-19 13:44:01 | SUCCESS | ktoolbox.job.runner - All jobs in queue finished
2222
```
23-
- Automatically check for updates at program startup and notify the user if a new version is available
23+
- Automatically check for updates at program startup and notify the user if a new version is available (v0.21.0)
2424
```log
2525
2025-08-19 13:41:23 | INFO | ktoolbox.utils - Update available: 0.21.0 (current: 0.20.0)
2626
@@ -29,14 +29,26 @@
2929
3030
### 🪲 Fixes
3131
32+
- Fixed the issue where the feature "**support downloading images embedded in the post HTML content**" added in [v0.17.0](https://github.com/Ljzd-PRO/KToolBox/releases/tag/v0.17.0) did not actually exist - #332
33+
- This is indeed a serious issue. It seems that the relevant branch was not merged into the main branch, resulting in the feature **not being implemented in v0.17.0**.
34+
- Related configuration:
35+
- `job.extract_content_images`: Whether to parse and download images embedded in the post HTML content, disabled by default (`False`)
36+
- This feature is disabled by default because when using the `sync-creator` command to download all posts from a creator, the content of each post must be fetched individually, which can easily trigger DDoS protection and get blocked.
37+
- You can edit this configuration via `ktoolbox config-editor` (`Job -> extract_content_images`)
38+
- Or manually edit it in the `.env` file or environment variables:
39+
```dotenv
40+
# Enable parsing and downloading images embedded in the post HTML content
41+
KTOOLBOX_JOB__EXTRACT_CONTENT_IMAGES=True
42+
```
43+
- 📖 More info: [Configuration Reference - JobConfiguration](https://ktoolbox.readthedocs.io/latest/configuration/reference/#ktoolbox.configuration.JobConfiguration)
3244
- Fixed the issue where the `download-post` command would **still not generate the text content file** (`content.txt`)
33-
and external links file (`external_links.txt`) even when the `job.extract_content` and `job.extract_external_links` options were enabled - #332
45+
and external links file (`external_links.txt`) even when the `job.extract_content` and `job.extract_external_links` options were enabled - #332 (v0.21.0)
3446
3547
- - -
3648
3749
### ✨ 新特性
3850
39-
- 改进了**下载进度显示**,提供更加**优美和直观**的进度条显示
51+
- 改进了**下载进度显示**,提供更加**优美和直观**的进度条显示 (v0.21.0)
4052
```log
4153
2025-08-19 13:42:07 | INFO | ktoolbox.cli - Got creator information - name: Ljzd-PRO, id: 12345678
4254
@@ -52,7 +64,7 @@ and external links file (`external_links.txt`) even when the `job.extract_conten
5264
5365
2025-08-19 13:44:01 | SUCCESS | ktoolbox.job.runner - All jobs in queue finished
5466
```
55-
- 程序启动时自动检查更新,并在有新版本时提示用户
67+
- 程序启动时自动检查更新,并在有新版本时提示用户 (v0.21.0)
5668
```log
5769
2025-08-19 13:41:23 | INFO | ktoolbox.utils - Update available: 0.21.0 (current: 0.20.0)
5870
@@ -61,8 +73,20 @@ and external links file (`external_links.txt`) even when the `job.extract_conten
6173
6274
### 🪲 修复
6375
76+
- 修复了 [v0.17.0](https://github.com/Ljzd-PRO/KToolBox/releases/tag/v0.17.0) 新增的“**支持下载帖子 HTML 内容中嵌入的图片**”实际上并不存在的问题 - #332
77+
- 这确实是个严重的问题,似乎相关分支没有被合并到主分支,导致该功能**在 v0.17.0 版本中并未实现**
78+
- 相关配置项:
79+
- `job.extract_content_images`:是否解析并下载帖子 HTML 内容中嵌入的图片,默认关闭(`False`)
80+
- 该功能默认关闭,因为当使用 `sync-creator` 命令下载作者全部帖子时,只能逐个获取帖子内容(content),这容易导致触发 DDoS 防御机制而被阻断
81+
- 可通过运行 `ktoolbox config-editor` 编辑这些配置(`Job -> extract_content_images`)
82+
- 或手动在 `.env` 文件或环境变量中编辑:
83+
```dotenv
84+
# 开启解析并下载帖子 HTML 内容中嵌入的图片
85+
KTOOLBOX_JOB__EXTRACT_CONTENT_IMAGES=True
86+
```
87+
- 📖更多信息:[配置参考-JobConfiguration](https://ktoolbox.readthedocs.io/latest/zh/configuration/reference/#ktoolbox._configuration_zh.JobConfiguration)
6488
- 修复即使启用了 `job.extract_content` 和 `job.extract_external_links` 配置项,`download-post` 命令
65-
仍然**不会生成文本内容文件**(`content.txt`)和外部链接文件(`external_links.txt`)的问题 - #332
89+
仍然**不会生成文本内容文件**(`content.txt`)和外部链接文件(`external_links.txt`)的问题 - #332 (v0.21.0)
6690
6791
## Upgrade
6892
@@ -71,4 +95,4 @@ Use this command to upgrade if you are using **pipx**:
7195
pipx upgrade ktoolbox
7296
```
7397

74-
**Full Changelog**: https://github.com/Ljzd-PRO/KToolBox/compare/v0.20.0...v0.21.0
98+
**Full Changelog**: https://github.com/Ljzd-PRO/KToolBox/compare/v0.21.0...v0.21.1

ktoolbox/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
__title__ = "KToolBox"
22
# noinspection SpellCheckingInspection
33
__description__ = "A useful CLI tool for downloading posts in Kemono.cr / .su / .party"
4-
__version__ = "v0.21.0"
4+
__version__ = "v0.21.1"

ktoolbox/action/job.py

Lines changed: 70 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
from ktoolbox._enum import PostFileTypeEnum, DataStorageNameEnum
1313
from ktoolbox.action import ActionRet, fetch_creator_posts, FetchInterruptError
1414
from ktoolbox.action.utils import generate_post_path_name, filter_posts_by_date, generate_filename, \
15-
filter_posts_by_keywords, filter_posts_by_keywords_exclude, generate_grouped_post_path
15+
filter_posts_by_keywords, filter_posts_by_keywords_exclude, generate_grouped_post_path, extract_content_images
1616
from ktoolbox.api.model import Post, Attachment, Revision
1717
from ktoolbox.api.posts import get_post_revisions as get_post_revisions_api, get_post as get_post_api
1818
from ktoolbox.configuration import config
@@ -125,7 +125,9 @@ async def create_job_from_post(
125125
)
126126
)
127127
# ``post.substring`` is used to determine if the post has content, but it's only partial
128-
if (post.content or post.substring) and post_dir and (config.job.extract_content or config.job.extract_external_links):
128+
if (post.content or post.substring) and post_dir and (
129+
config.job.extract_content or config.job.extract_external_links or config.job.extract_content_images
130+
):
129131
# If post has no content, fetch it from get_post API
130132
if not post.content:
131133
get_post_ret = await get_post_api(
@@ -164,6 +166,67 @@ async def create_job_from_post(
164166
for link in sorted(external_links):
165167
await f.write(f"{link}\n")
166168

169+
# Extract content images
170+
if config.job.extract_content_images:
171+
content_image_sources = extract_content_images(post.content)
172+
for image_src in content_image_sources:
173+
if not image_src or not image_src.strip():
174+
continue
175+
176+
# Handle relative paths by making them absolute
177+
# noinspection HttpUrlsUsage
178+
if image_src.startswith('/') and not image_src.startswith('//'):
179+
# Relative path - construct full URL
180+
image_path = image_src
181+
elif image_src.startswith('http://') or image_src.startswith('https://'):
182+
# Absolute URL - extract path
183+
image_path = urlparse(image_src).path
184+
else:
185+
# Skip data URLs, protocol-relative URLs, or other non-path sources
186+
continue
187+
188+
if not image_path or not image_path.strip():
189+
continue
190+
191+
# Generate filename from the image path
192+
image_file_path = Path(image_path)
193+
194+
# Apply "allow/block list" filtering first (before incrementing counter)
195+
if config.job.sequential_filename:
196+
basic_filename = f"{sequential_counter + 1}{image_file_path.suffix}"
197+
else:
198+
basic_filename = image_file_path.name
199+
200+
alt_filename = generate_filename(post, basic_filename, config.job.filename_format)
201+
202+
if (not config.job.allow_list or any(
203+
map(
204+
lambda x: fnmatch(alt_filename, x),
205+
config.job.allow_list
206+
)
207+
)) and not any(
208+
map(
209+
lambda x: fnmatch(alt_filename, x),
210+
config.job.block_list
211+
)
212+
):
213+
# Regenerate filename with correct counter
214+
should_use_sequential = (config.job.sequential_filename and
215+
image_file_path.suffix.lower() not in config.job.sequential_filename_excludes)
216+
if should_use_sequential:
217+
basic_filename = f"{sequential_counter}{image_file_path.suffix}"
218+
alt_filename = generate_filename(post, basic_filename, config.job.filename_format)
219+
sequential_counter += 1
220+
221+
jobs.append(
222+
Job(
223+
path=attachments_path,
224+
alt_filename=alt_filename,
225+
server_path=image_path,
226+
type=PostFileTypeEnum.Attachment
227+
)
228+
)
229+
167230
return jobs
168231

169232

@@ -263,10 +326,11 @@ async def create_job_from_creator(
263326

264327
if config.job.include_revisions:
265328
logger.warning("`job.include_revisions` is enabled and will fetch post revisions, "
266-
"which may take time. Disable if not needed.")
267-
if config.job.extract_content or config.job.extract_external_links:
268-
logger.warning("`job.extract_content` or `job.extract_external_links` is enabled and will fetch post content one by one, "
269-
"which may take time. Disable if not needed.")
329+
"which may take time. Disable if not needed.")
330+
if config.job.extract_content or config.job.extract_external_links or config.job.extract_content_images:
331+
logger.warning(
332+
"`job.extract_content` or `job.extract_external_links` or `job.extract_content_images` is enabled "
333+
"and will fetch post content one by one, which may take time. Disable if not needed.")
270334

271335
job_list: List[Job] = []
272336
for post in post_list:

ktoolbox/action/utils.py

Lines changed: 46 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
from datetime import datetime
2+
from html.parser import HTMLParser
23
from pathlib import Path
34
from typing import Optional, List, Generator, Any, Tuple, Set
45

@@ -19,12 +20,27 @@
1920
"filter_posts_by_indices",
2021
"match_post_keywords",
2122
"filter_posts_by_keywords",
22-
"filter_posts_by_keywords_exclude"
23+
"filter_posts_by_keywords_exclude",
24+
"extract_content_images"
2325
]
2426

2527
TIME_FORMAT = "%Y-%m-%d"
2628

2729

30+
class _ContentImageParser(HTMLParser):
31+
"""HTML parser to extract image sources from content"""
32+
33+
def __init__(self):
34+
super().__init__()
35+
self.image_sources = []
36+
37+
def handle_starttag(self, tag: str, attrs: List[Tuple[str, Optional[str]]]):
38+
if tag.lower() == 'img':
39+
for attr_name, attr_value in attrs:
40+
if attr_name.lower() == 'src' and attr_value:
41+
self.image_sources.append(attr_value)
42+
43+
2844
def generate_post_path_name(post: Post) -> str:
2945
"""Generate directory name for post to save."""
3046
if not post.title:
@@ -53,7 +69,7 @@ def generate_year_dirname(post: Post) -> str:
5369
post_date = post.published or post.added
5470
if not post_date:
5571
return "unknown"
56-
72+
5773
try:
5874
return sanitize_filename(
5975
config.job.year_dirname_format.format(
@@ -71,7 +87,7 @@ def generate_month_dirname(post: Post) -> str:
7187
post_date = post.published or post.added
7288
if not post_date:
7389
return "unknown"
74-
90+
7591
try:
7692
return sanitize_filename(
7793
config.job.month_dirname_format.format(
@@ -93,15 +109,15 @@ def generate_grouped_post_path(post: Post, base_path: Path) -> Path:
93109
:return: Full path where the post should be saved
94110
"""
95111
result_path = base_path
96-
112+
97113
if config.job.group_by_year:
98114
year_dirname = generate_year_dirname(post)
99115
result_path = result_path / year_dirname
100-
116+
101117
if config.job.group_by_month:
102118
month_dirname = generate_month_dirname(post)
103119
result_path = result_path / month_dirname
104-
120+
105121
return result_path
106122

107123

@@ -196,12 +212,12 @@ def match_post_keywords(post: Post, keywords: Set[str]) -> bool:
196212
"""
197213
if not keywords:
198214
return True
199-
215+
200216
# Only search in post title
201217
searchable_text = ""
202218
if post.title:
203219
searchable_text = post.title.lower()
204-
220+
205221
# Check if any keyword is found in the title
206222
return any(keyword.lower() in searchable_text for keyword in keywords)
207223

@@ -219,7 +235,7 @@ def filter_posts_by_keywords(
219235
if not keywords:
220236
yield from post_list
221237
return
222-
238+
223239
post_filter = filter(lambda x: match_post_keywords(x, keywords), post_list)
224240
yield from post_filter
225241

@@ -237,7 +253,27 @@ def filter_posts_by_keywords_exclude(
237253
if not keywords_exclude:
238254
yield from post_list
239255
return
240-
256+
241257
# Exclude posts that match any of the exclude keywords
242258
post_filter = filter(lambda x: not match_post_keywords(x, keywords_exclude), post_list)
243259
yield from post_filter
260+
261+
262+
def extract_content_images(content: str) -> List[str]:
263+
"""
264+
Extract image sources from HTML content
265+
266+
:param content: HTML content string
267+
:return: List of image source URLs/paths
268+
"""
269+
if not content:
270+
return []
271+
272+
parser = _ContentImageParser()
273+
try:
274+
parser.feed(content)
275+
except Exception as e:
276+
logger.warning(f"Failed to parse HTML content for images: {e}")
277+
return []
278+
279+
return parser.image_sources

ktoolbox/configuration.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,7 @@ class JobConfiguration(BaseModel):
198198
:ivar allow_list: Download files which match these patterns (Unix shell-style), e.g. ``["*.png"]``
199199
:ivar block_list: Not to download files which match these patterns (Unix shell-style), e.g. ``["*.psd","*.zip"]``
200200
:ivar extract_content: Extract post content and save to separate file (filename was defined in ``config.job.post_structure.content``)
201+
:ivar extract_content_images: Extract images from post content and download them.
201202
:ivar extract_external_links: Extract external file sharing links from post content and save to separate file \
202203
(filename was defined in ``config.job.post_structure.external_links``) \
203204
:ivar external_link_patterns: Regex patterns for extracting external links.
@@ -223,6 +224,7 @@ class JobConfiguration(BaseModel):
223224
# noinspection PyDataclass
224225
block_list: Set[str] = Field(default_factory=set)
225226
extract_content: bool = False
227+
extract_content_images: bool = False
226228
extract_external_links: bool = False
227229
# noinspection SpellCheckingInspection
228230
external_link_patterns: List[str] = [

0 commit comments

Comments
 (0)