Skip to content

Commit 099eef6

Browse files
authored
Merge pull request #322 from Ljzd-PRO/devel
Bump to v0.20.0
2 parents df65915 + bfa7953 commit 099eef6

File tree

12 files changed

+206
-111
lines changed

12 files changed

+206
-111
lines changed

CHANGELOG.md

Lines changed: 69 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,84 @@
11
## Changes
22

3-
![Downloads](https://img.shields.io/github/downloads/Ljzd-PRO/KToolBox/v0.19.2/total)
3+
![Downloads](https://img.shields.io/github/downloads/Ljzd-PRO/KToolBox/v0.20.0/total)
44

5-
### 💡 Feature
5+
### ✨ Features
66

7-
> - Improved the error log format to make it easier to read and understand (v0.19.1)
7+
- Added options to control whether to extract `content` and `external_links` for greater flexibility - #317
8+
- Related configuration items:
9+
- `job.extract_content`: Whether to extract post text content as a separate file, default is disabled (`False`)
10+
- `job.extract_external_links`: Whether to extract external links in post text content as a separate file, default is disabled (`False`)
11+
- You can edit these settings via `ktoolbox config-editor` (`Job -> ...`)
12+
- Or manually edit them in the `.env` file or environment variables
13+
```dotenv
14+
# Whether to extract post text content as a separate file
15+
KTOOLBOX_JOB__EXTRACT_CONTENT=True
16+
# Whether to extract external links in post text content as a separate file
17+
KTOOLBOX_JOB__EXTRACT_EXTERNAL_LINKS=True
818
9-
### 🪲 Fix
19+
# Change the default file names for content.txt and external_links.txt
20+
KTOOLBOX_JOB__POST_STRUCTURE__CONTENT="content.html"
21+
KTOOLBOX_JOB__POST_STRUCTURE__EXTERNAL_LINKS="link.txt"
22+
```
23+
- 📖 More info: [Configuration Reference - JobConfiguration](https://ktoolbox.readthedocs.io/latest/configuration/reference/#ktoolbox.configuration.JobConfiguration)
24+
- Support controlling whether to preserve the metadata (such as modification date) of downloaded files - #321
25+
- If you usually browse images by download date, or want to use post images as Windows folder preview covers, you can disable this option
26+
- Related configuration item:
27+
- `downloader.keep_metadata`: Whether to preserve the metadata (such as modification date) of downloaded files, enabled by default (`True`)
28+
- You can edit this setting via `ktoolbox config-editor` (`Downloader -> keep_metadata`)
29+
- Or manually edit it in the `.env` file or environment variables
30+
```dotenv
31+
# Whether to preserve the metadata (such as modification date) of downloaded files
32+
KTOOLBOX_DOWNLOADER__KEEP_METADATA=False
33+
```
34+
- 📖 More info: [Configuration Reference - DownloaderConfiguration](https://ktoolbox.readthedocs.io/latest/configuration/reference/#ktoolbox.configuration.DownloaderConfiguration)
1035
11-
- Fixed the issue where **`content`** data of works could not be obtained due to **Kemono API changes**, resulting in missing **`content.txt`** and `external_links.txt` - #316
12-
> - Fixed the issue where author information and work data could not be retrieved due to **Kemono API changes** - #315 (v0.19.1)
13-
> - Error messages: `Kemono API call failed: ...`, `404 Not Found`, `403 Forbidden`, ...
36+
### 🪲 Fixes
37+
38+
- Due to changes in the Kemono API, extraction of **post text content and external links** (content and external_links) can now only be performed one by one.
39+
Therefore, **only when** the default-disabled `job.extract_content` and `job.extract_external_links` are set to `True` (as mentioned above),
40+
and the post **actually contains text content**, **will** post text content and external links be extracted, to avoid frequent API calls that may trigger **server DDoS protection**.
41+
- Output `SUCCESS` level logs to help users better understand download status
1442
1543
- - -
1644
17-
### 💡 新特性
45+
### 新特性
1846
19-
> - 改进报错的日志格式,使其更易于阅读和理解 (v0.19.1)
47+
- 支持控制是否提取 content 和 external_links,灵活性提升 - #317
48+
- 相关配置项:
49+
- `job.extract_content`:是否提取帖子文本内容为独立的文件,默认关闭(`False`)
50+
- `job.extract_external_links`:是否提取帖子文本内容中的外部链接为独立的文件,默认关闭(`False`)
51+
- 可通过运行 `ktoolbox config-editor` 编辑这些配置(`Job -> ...`)
52+
- 或手动在 `.env` 文件或环境变量中编辑
53+
```dotenv
54+
# 是否提取帖子文本内容为独立的文件
55+
KTOOLBOX_JOB__EXTRACT_CONTENT=True
56+
# 是否提取帖子文本内容中的外部链接为独立的文件
57+
KTOOLBOX_JOB__EXTRACT_EXTERNAL_LINKS=True
58+
59+
# 修改默认的 content.txt 和 external_links.txt 文件名
60+
KTOOLBOX_JOB__POST_STRUCTURE__CONTENT="content.html"
61+
KTOOLBOX_JOB__POST_STRUCTURE__EXTERNAL_LINKS="link.txt"
62+
```
63+
- 📖更多信息:[配置参考-JobConfiguration](https://ktoolbox.readthedocs.io/latest/zh/configuration/reference/#ktoolbox._configuration_zh.JobConfiguration)
64+
- 支持控制是否保留下载的文件的元数据(修改日期等) - #321
65+
- 如果你平时是按照下载日期排序浏览图片的,或者需要将帖子图片作为 Windows 文件夹预览封面,可以关闭此配置
66+
- 相关配置项:
67+
- `downloader.keep_metadata`:是否保留下载的文件的元数据(修改日期等),默认开启(`True`)
68+
- 可通过运行 `ktoolbox config-editor` 编辑这些配置(`Downloader -> keep_metadata`)
69+
- 或手动在 `.env` 文件或环境变量中编辑
70+
```dotenv
71+
# 是否保留下载的文件的元数据(修改日期等)
72+
KTOOLBOX_DOWNLOADER__KEEP_METADATA=False
73+
```
74+
- 📖更多信息:[配置参考-DownloaderConfiguration](https://ktoolbox.readthedocs.io/latest/zh/configuration/reference/#ktoolbox._configuration_zh.DownloaderConfiguration)
2075
2176
### 🪲 修复
2277
23-
- 修复由于 **Kemono API 变更** 导致的作品 **`content`** 数据无法获取进而导致 **`content.txt`, `external_links.txt` 缺失**的问题 - #316
24-
> - 修复 Kemono **API 变更**导致的作者信息和作品数据**获取失败**的问题 - #315 (v0.19.1)
25-
> - 报错信息:`Kemono API call failed: ...`, `404 Not Found`, `403 Forbidden`, ...
78+
- 由于 Kemono API 变更,帖子的**文本内容和外部链接**(content 和 external_links)的提取只能单独地一个个获取,
79+
因此仅当将**默认关闭**的 `job.extract_content` 和 `job.extract_external_links` 设置为 `True` 时(上文新特性提到),
80+
以及帖子**真正存在文本内容**时,**才会提取帖子文本内容和外部链接**,避免 API 频繁调用导致**被服务器 DDoS 防御机制阻拦**
81+
- 输出 `SUCCESS` 级别日志,以便用户更清晰地了解下载状态
2682
2783
## Upgrade
2884
@@ -31,4 +87,4 @@ Use this command to upgrade if you are using **pipx**:
3187
pipx upgrade ktoolbox
3288
```
3389

34-
**Full Changelog**: https://github.com/Ljzd-PRO/KToolBox/compare/v0.19.1...v0.19.2
90+
**Full Changelog**: https://github.com/Ljzd-PRO/KToolBox/compare/v0.19.2...v0.20.0

ktoolbox/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
__title__ = "KToolBox"
22
# noinspection SpellCheckingInspection
33
__description__ = "A useful CLI tool for downloading posts in Kemono.cr / .su / .party"
4-
__version__ = "v0.19.2"
4+
__version__ = "v0.20.0"

ktoolbox/_configuration_zh.py

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ class DownloaderConfiguration(ktoolbox.configuration.DownloaderConfiguration):
2525
2626
:ivar scheme: 下载器的 URL 协议
2727
:ivar timeout: 下载器请求超时时间
28-
:ivar encoding: 文件名解析和帖子内容文本保存的字符集
28+
:ivar encoding: 文件名解析和帖子 ``内容``、``external_links`` 保存的字符集
2929
:ivar buffer_size: 每个下载文件的文件 I/O 缓冲区字节数
3030
:ivar chunk_size: 下载器流的分块字节数
3131
:ivar temp_suffix: 下载文件的临时文件名后缀
@@ -36,6 +36,7 @@ class DownloaderConfiguration(ktoolbox.configuration.DownloaderConfiguration):
3636
:ivar use_bucket: 启用本地存储桶模式
3737
:ivar bucket_path: 本地存储桶路径
3838
:ivar reverse_proxy: 下载 URL 的反向代理格式。通过插入空的 ``{}`` 自定义文件名格式以表示原始 URL。例如:``https://example.com/{}`` 会变成 ``https://example.com/https://n1.kemono.su/data/66/83/xxxxx.jpg``;``https://example.com/?url={}`` 会变成 ``https://example.com/?url=https://n1.kemono.su/data/66/83/xxxxx.jpg``
39+
:ivar keep_metadata: 下载文件时保留文件元数据(例如最后修改时间等)
3940
"""
4041
...
4142

@@ -118,14 +119,15 @@ class JobConfiguration(ktoolbox.configuration.JobConfiguration):
118119
:ivar filename_format: 通过插入空的 ``{}`` 自定义文件名格式,表示基本文件名。可使用 [属性][ktoolbox.configuration.JobConfiguration]。例如:``{title}_{}`` 可能生成 ``TheTitle_b4b41de2-8736-480d-b5c3-ebf0d917561b``、``TheTitle_af349b25-ac08-46d7-98fb-6ce99a237b90`` 等。也可与 ``sequential_filename`` 结合使用,如 ``[{published}]_{}`` 可能生成 ``[2024-1-1]_1.png``、``[2024-1-1]_2.png`` 等。
119120
:ivar allow_list: 下载匹配这些模式(Unix shell 风格)的文件,如 ``["*.png"]``
120121
:ivar block_list: 不下载匹配这些模式(Unix shell 风格)的文件,如 ``["*.psd","*.zip"]``
121-
:ivar extract_external_links: 从帖子内容中提取外部文件分享链接并保存到单独文件
122+
:ivar extract_content: 提取帖子内容并保存到单独文件(文件名由 ``config.job.post_structure.content`` 定义)
123+
:ivar extract_external_links: 从帖子内容中提取外部文件分享链接并保存到单独文件(文件名由 ``config.job.post_structure.external_links`` 定义)
122124
:ivar external_link_patterns: 用于提取外部链接的正则表达式模式
123125
:ivar group_by_year: 根据发布日期按年分组到不同目录
124126
:ivar group_by_month: 根据发布日期按月分组到不同目录(需要启用 group_by_year)
125127
:ivar year_dirname_format: 自定义年份目录名格式。可用属性:``year``。例如:``{year}`` > ``2024``,``Year_{year}`` > ``Year_2024``
126128
:ivar month_dirname_format: 自定义月份目录名格式。可用属性:``year``、``month``。例如:``{year}-{month}`` > ``2024-01``,``{year}_{month}`` > ``2024_01``
127129
"""
128-
...
130+
post_structure: PostStructureConfiguration = PostStructureConfiguration()
129131

130132

131133
class LoggerConfiguration(ktoolbox.configuration.LoggerConfiguration):
@@ -155,4 +157,7 @@ class Configuration(ktoolbox.configuration.Configuration):
155157
Windows 下安装 winloop:`pip install ktoolbox[winloop]` \
156158
Unix 下安装 uvloop:`pip install ktoolbox[uvloop]`
157159
"""
158-
...
160+
api: APIConfiguration = APIConfiguration()
161+
downloader: DownloaderConfiguration = DownloaderConfiguration()
162+
job: JobConfiguration = JobConfiguration()
163+
logger: LoggerConfiguration = LoggerConfiguration()

ktoolbox/action/job.py

Lines changed: 73 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,9 @@
1515
filter_posts_by_keywords, filter_posts_by_keywords_exclude, generate_grouped_post_path
1616
from ktoolbox.api.model import Post, Attachment, Revision
1717
from ktoolbox.api.posts import get_post_revisions as get_post_revisions_api, get_post as get_post_api
18-
from ktoolbox.configuration import config, PostStructureConfiguration
18+
from ktoolbox.configuration import config
1919
from ktoolbox.job import Job, CreatorIndices
20-
from ktoolbox.utils import extract_external_links
20+
from ktoolbox.utils import extract_external_links, generate_msg
2121

2222
__all__ = ["create_job_from_post", "create_job_from_creator"]
2323

@@ -26,35 +26,39 @@ async def create_job_from_post(
2626
post: Union[Post, Revision],
2727
post_path: Path,
2828
*,
29-
post_structure: Union[PostStructureConfiguration, bool] = None,
29+
post_dir: bool = True,
3030
dump_post_data: bool = True
3131
) -> List[Job]:
3232
"""
3333
Create a list of download job from a post data
3434
3535
:param post: post data
3636
:param post_path: Path of the post directory, which needs to be sanitized
37-
:param post_structure: post path structure, ``False`` -> disable, \
38-
``True`` & ``None`` -> ``config.job.post_structure``
37+
:param post_dir: Whether to create post directory
3938
:param dump_post_data: Whether to dump post data (post.json) in post directory
39+
:raise FetchInterruptError: If fetching post content fails
4040
"""
4141
post_path.mkdir(parents=True, exist_ok=True)
4242

4343
# Load ``PostStructureConfiguration``
44-
if post_structure in [True, None]:
45-
post_structure = config.job.post_structure
46-
if post_structure:
47-
attachments_path = post_path / post_structure.attachments # attachments
44+
if post_dir:
45+
attachments_path = post_path / config.job.post_structure.attachments # attachments
4846
attachments_path.mkdir(exist_ok=True)
49-
content_path = post_path / post_structure.content # content
47+
content_path = post_path / config.job.post_structure.content # content
5048
content_path.parent.mkdir(exist_ok=True)
51-
external_links_path = post_path / post_structure.external_links # external_links
49+
external_links_path = post_path / config.job.post_structure.external_links # external_links
5250
external_links_path.parent.mkdir(exist_ok=True)
5351
else:
5452
attachments_path = post_path
5553
content_path = None
5654
external_links_path = None
5755

56+
if dump_post_data:
57+
async with aiofiles.open(str(post_path / DataStorageNameEnum.PostData.value), "w", encoding="utf-8") as f:
58+
await f.write(
59+
post.model_dump_json(indent=config.json_dump_indent)
60+
)
61+
5862
# Filter and create jobs for ``Post.attachment``
5963
jobs: List[Job] = []
6064
sequential_counter = 1 # Counter for sequential filenames
@@ -120,37 +124,45 @@ async def create_job_from_post(
120124
post=post
121125
)
122126
)
127+
# ``post.substring`` is used to determine if the post has content, but it's only partial
128+
if post.substring and post_dir and (config.job.extract_content or config.job.extract_external_links):
129+
# If post has no content, fetch it from get_post API
130+
if not post.content:
131+
get_post_ret = await get_post_api(
132+
service=post.service,
133+
creator_id=post.user,
134+
post_id=post.id,
135+
revision_id=post.revision_id if isinstance(post, Revision) else None
136+
)
137+
if get_post_ret:
138+
post = get_post_ret.data.post
139+
else:
140+
logger.error(
141+
generate_msg(
142+
"Failed to fetch post content",
143+
post_name=post.title or "Unknown",
144+
post_id=post.id,
145+
creator_id=post.user,
146+
service=post.service
147+
)
148+
)
149+
raise FetchInterruptError(ret=get_post_ret)
123150

124-
# If post has no content, fetch it from get_post API
125-
if not post.content:
126-
get_post_ret = await get_post_api(
127-
service=post.service,
128-
creator_id=post.user,
129-
post_id=post.id,
130-
revision_id=post.revision_id if isinstance(post, Revision) else None
131-
)
132-
if get_post_ret:
133-
post = get_post_ret.data.post
134-
135-
# Write content file
136-
if content_path and post.content:
137-
async with aiofiles.open(content_path, "w", encoding=config.downloader.encoding) as f:
138-
await f.write(post.content)
139-
140-
# Extract and write external links file
141-
if config.job.extract_external_links and external_links_path and post.content:
142-
external_links = extract_external_links(post.content, config.job.external_link_patterns)
143-
if external_links:
144-
async with aiofiles.open(external_links_path, "w", encoding=config.downloader.encoding) as f:
145-
# Write each link on a separate line
146-
for link in sorted(external_links):
147-
await f.write(f"{link}\n")
151+
# If post content is still empty, skip content extraction
152+
if post.content:
153+
# Write content file
154+
if config.job.extract_content:
155+
async with aiofiles.open(content_path, "w", encoding=config.downloader.encoding) as f:
156+
await f.write(post.content)
148157

149-
if dump_post_data:
150-
async with aiofiles.open(str(post_path / DataStorageNameEnum.PostData.value), "w", encoding="utf-8") as f:
151-
await f.write(
152-
post.model_dump_json(indent=config.json_dump_indent)
153-
)
158+
# Extract and write external links file
159+
if config.job.extract_external_links:
160+
external_links = extract_external_links(post.content, config.job.external_link_patterns)
161+
if external_links:
162+
async with aiofiles.open(external_links_path, "w", encoding=config.downloader.encoding) as f:
163+
# Write each link on a separate line
164+
for link in sorted(external_links):
165+
await f.write(f"{link}\n")
154166

155167
return jobs
156168

@@ -250,7 +262,10 @@ async def create_job_from_creator(
250262
await f.write(indices.model_dump_json(indent=config.json_dump_indent))
251263

252264
if config.job.include_revisions:
253-
logger.info("`job.include_revisions` is enabled and will fetch post revisions, "
265+
logger.warning("`job.include_revisions` is enabled and will fetch post revisions, "
266+
"which may take time. Disable if not needed.")
267+
if config.job.extract_content or config.job.extract_external_links:
268+
logger.warning("`job.extract_content` or `job.extract_external_links` is enabled and will fetch post content one by one, "
254269
"which may take time. Disable if not needed.")
255270

256271
job_list: List[Job] = []
@@ -264,12 +279,15 @@ async def create_job_from_creator(
264279
post_path = grouped_base_path / generate_post_path_name(post)
265280

266281
# Generate jobs for the main post
267-
job_list += await create_job_from_post(
268-
post=post,
269-
post_path=post_path,
270-
post_structure=False if mix_posts else None,
271-
dump_post_data=not mix_posts
272-
)
282+
try:
283+
job_list += await create_job_from_post(
284+
post=post,
285+
post_path=post_path,
286+
post_dir=not mix_posts,
287+
dump_post_data=not mix_posts
288+
)
289+
except FetchInterruptError as e:
290+
return ActionRet(**e.ret.model_dump(mode="python"))
273291

274292
# If include_revisions is enabled, fetch and download revisions for this post
275293
if config.job.include_revisions and not mix_posts:
@@ -284,11 +302,14 @@ async def create_job_from_creator(
284302
if revision.revision_id: # Only process actual revisions
285303
revision_path = post_path / config.job.post_structure.revisions / generate_post_path_name(
286304
revision)
287-
revision_jobs = await create_job_from_post(
288-
post=revision,
289-
post_path=revision_path,
290-
dump_post_data=True
291-
)
305+
try:
306+
revision_jobs = await create_job_from_post(
307+
post=revision,
308+
post_path=revision_path,
309+
dump_post_data=True
310+
)
311+
except FetchInterruptError as e:
312+
return ActionRet(**e.ret.model_dump(mode="python"))
292313
job_list += revision_jobs
293314
except Exception as e:
294315
logger.warning(f"Failed to fetch revisions for post {post.id}: {e}")

ktoolbox/api/model/post.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ class Post(BaseModel):
2424
service: Optional[str] = None
2525
title: Optional[str] = None
2626
content: Optional[str] = None
27+
substring: Optional[str] = None
2728
embed: Optional[Dict[str, Any]] = None
2829
shared_file: Optional[bool] = None
2930
added: Optional[datetime] = None

0 commit comments

Comments
 (0)