Skip to content

Commit c6999e4

Browse files
authored
Merge pull request #286 from Ljzd-PRO/devel
Bump to v0.17.0
2 parents 1ca4b4e + 9394d3c commit c6999e4

File tree

15 files changed

+235
-514
lines changed

15 files changed

+235
-514
lines changed

CHANGELOG.md

Lines changed: 51 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,69 @@
11
## Changes
22

3-
![Downloads](https://img.shields.io/github/downloads/Ljzd-PRO/KToolBox/v0.16.0/total)
3+
![Downloads](https://img.shields.io/github/downloads/Ljzd-PRO/KToolBox/v0.17.0/total)
44

55
### 💡 Feature
66

7-
- Add auto-managed cookies to bypass **DDoS Guard** - #269 (@CanglanXYA)
8-
- Add comprehensive **revision post** support with enhanced API and configuration - #240, #241
9-
- For posts like: `https://kemono.cr/{service}/user/{user_id}/post/{post_id}/revision/{revision_id}`
10-
- This feature is disabled by default
11-
- Run `ktoolbox config-editor` to edit this configurations (`Job -> include_revisions`)
12-
- Or manually edit it in `.env` file or environment variables
7+
- Support download **images embedded in post HTML content** - #218
8+
- Add external links extraction feature for **cloud storage URLs** - #232 (@xxkzn)
9+
- New configurations:
10+
- `job.extract_external_links`: Whether to extract external file sharing links from post content and save to separate file
11+
- `job.external_link_patterns`: Regex patterns for extracting external links
12+
- These configuration are **optional**, with the feature enabled by default. The regular expression includes the following services:
13+
- Google Drive
14+
- MEGA
15+
- Dropbox
16+
- OneDrive
17+
- MediaFire
18+
- And other common file hosting services
19+
- Run `ktoolbox config-editor` to edit these configurations (`Job -> extract_external_links`, `Job -> external_link_patterns`)
20+
- Or manually edit them `.env` file or environment variables
1321
```dotenv
14-
# Set this to `True` to enable revisions download
15-
KTOOLBOX_JOB__INCLUDE_REVISIONS=True
22+
# This feature is enabled by default
23+
KTOOLBOX_JOB__EXTRACT_EXTERNAL_LINKS=True
24+
# Setting up lists and regular expressions in dotenv is relatively complex and cumbersome. It is recommended to use the aforementioned graphical configuration editor for these settings.
25+
KTOOLBOX_JOB__EXTERNAL_LINK_PATTERNS='["https?://drive\\.google\\.com/[^\\s]+", "https?://docs\\.google\\.com/[^\\s]+", "https?://mega\\.nz/[^\\s]+", "https?://mega\\.co\\.nz/[^\\s]+", "https?://(?:www\\.)?dropbox\\.com/[^\\s]+", "https?://db\\.tt/[^\\s]+", "https?://onedrive\\.live\\.com/[^\\s]+", "https?://1drv\\.ms/[^\\s]+", "https?://(?:www\\.)?mediafire\\.com/[^\\s]+", "https?://(?:www\\.)?wetransfer\\.com/[^\\s]+", "https?://we\\.tl/[^\\s]+", "https?://(?:www\\.)?sendspace\\.com/[^\\s]+", "https?://(?:www\\.)?4shared\\.com/[^\\s]+", "https?://(?:www\\.)?zippyshare\\.com/[^\\s]+", "https?://(?:www\\.)?uploadfiles\\.io/[^\\s]+", "https?://(?:www\\.)?box\\.com/[^\\s]+", "https?://(?:www\\.)?pcloud\\.com/[^\\s]+", "https?://disk\\.yandex\\.[a-z]+/[^\\s]+", "https?://[^\\s]*(?:file|upload|share|download|drive|storage)[^\\s]*\\.[a-z]{2,4}/[^\\s]+"]'
1626
```
27+
- 📖More information: [Configuration-Reference-JobConfiguration](https://ktoolbox.readthedocs.io/latest/configuration/reference/#ktoolbox.configuration.JobConfiguration)
1728
18-
[//]: # (### 🪲 Fix)
29+
### 🪲 Fix
30+
31+
- Removed the deprecated configuration `job.post_structure.content_filepath`, use `job.post_structure.content` instead
32+
- Fixed an issue where the `sync-creator` command lacked handling for 404 responses when fetching post revisions
33+
(i\.e\. no revision version exists), which caused **slow task creation** - #294
34+
- Fixed the issue of **duplicate Cookies** in DDoS Guard management (manual management is no longer performed)
1935
2036
- - -
2137
2238
### 💡 新特性
2339
24-
- 新增自动管理 Cookie 功能以绕过 **DDoS Guard** - #269 (@CanglanXYA)
25-
- 新增全面的**修订作品**支持,增强 API 和配置功能 - #240, #241
26-
- 适用于如下作品:`https://kemono.cr/{service}/user/{user_id}/post/{post_id}/revision/{revision_id}`
27-
- 此功能默认关闭
28-
- 运行 `ktoolbox config-editor` 可编辑此配置项(`Job -> include_revisions`)
29-
- 或在 `.env` 文件或环境变量中手动编辑
40+
- 支持下载**帖子 HTML 内容中嵌入的图片** - #218
41+
- 新增**云存储 URL 外链提取**功能 - #232 (@xxkzn)
42+
- 新增配置项:
43+
- `job.extract_external_links`:是否从帖子内容中提取外部文件分享链接并保存到单独文件
44+
- `job.external_link_patterns`:用于提取外链的正则表达式模式
45+
- 这些配置项为**可选**,该功能默认启用。正则表达式已包含以下服务:
46+
- Google Drive
47+
- MEGA
48+
- Dropbox
49+
- OneDrive
50+
- MediaFire
51+
- 及其他常见文件托管服务
52+
- 可运行 `ktoolbox config-editor` 编辑这些配置(`Job -> extract_external_links`,`Job -> external_link_patterns`)
53+
- 或手动编辑 `.env` 文件或环境变量
3054
```dotenv
31-
# 设置为 `True` 以启用修订下载
32-
KTOOLBOX_JOB__INCLUDE_REVISIONS=True
55+
# 此功能默认启用
56+
KTOOLBOX_JOB__EXTRACT_EXTERNAL_LINKS=True
57+
# 在 dotenv 中设置列表和正则表达式较为复杂,推荐使用上述图形化配置编辑器进行设置。
58+
KTOOLBOX_JOB__EXTERNAL_LINK_PATTERNS='["https?://drive\\.google\\.com/[^\\s]+", "https?://docs\\.google\\.com/[^\\s]+", "https?://mega\\.nz/[^\\s]+", "https?://mega\\.co\\.nz/[^\\s]+", "https?://(?:www\\.)?dropbox\\.com/[^\\s]+", "https?://db\\.tt/[^\\s]+", "https?://onedrive\\.live\\.com/[^\\s]+", "https?://1drv\\.ms/[^\\s]+", "https?://(?:www\\.)?mediafire\\.com/[^\\s]+", "https?://(?:www\\.)?wetransfer\\.com/[^\\s]+", "https?://we\\.tl/[^\\s]+", "https?://(?:www\\.)?sendspace\\.com/[^\\s]+", "https?://(?:www\\.)?4shared\\.com/[^\\s]+", "https?://(?:www\\.)?zippyshare\\.com/[^\\s]+", "https?://(?:www\\.)?uploadfiles\\.io/[^\\s]+", "https?://(?:www\\.)?box\\.com/[^\\s]+", "https?://(?:www\\.)?pcloud\\.com/[^\\s]+", "https?://disk\\.yandex\\.[a-z]+/[^\\s]+", "https?://[^\\s]*(?:file|upload|share|download|drive|storage)[^\\s]*\\.[a-z]{2,4}/[^\\s]+"]'
3359
```
60+
- 📖更多信息:[配置参考-JobConfiguration](https://ktoolbox.readthedocs.io/latest/configuration/reference/#ktoolbox.configuration.JobConfiguration)
3461
35-
[//]: # (### 🪲 修复)
62+
### 🪲 修复
63+
64+
- 移除了过时的配置 `job.post_structure.content_filepath`,请用 `job.post_structure.content` 代替
65+
- 修复 `sync-creator` 命令在 **获取帖子修订(revision)** 时缺少 404 响应的处理(即帖子无修订版本)导致的**任务创建缓慢**的问题 - #294
66+
- 修复 DDoS Guard Cookies 管理出现**重复 Cookie** 的问题(不再进行手动管理)
3667
3768
## Upgrade
3869
@@ -41,4 +72,4 @@ Use this command to upgrade if you are using **pipx**:
4172
pipx upgrade ktoolbox
4273
```
4374

44-
**Full Changelog**: https://github.com/Ljzd-PRO/KToolBox/compare/v0.15.1...v0.16.0
75+
**Full Changelog**: https://github.com/Ljzd-PRO/KToolBox/compare/v0.16.0...v0.17.0

docs/en/configuration/guide.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,15 @@
2828
## `.env` / `prod.env` file example
2929

3030
```dotenv
31+
##############################################################################
32+
# It is recommended to use the graphical configuration editor for editing. #
33+
# Run `ktoolbox config-editor` to launch it. #
34+
##############################################################################
35+
36+
# (Optional) Session key that can be found in cookies after a successful login
37+
# Use when 403 Error
38+
#KTOOLBOX_API__SESSION_KEY=xxxxx
39+
3140
# Download 10 files at the same time.
3241
KTOOLBOX_JOB__COUNT=10
3342
@@ -43,7 +52,7 @@ KTOOLBOX_JOB__SEQUENTIAL_FILENAME=True
4352
# For example: `{title}_{}` > `HelloWorld_b4b41de2-8736-480d-b5c3-ebf0d917561b`, etc.
4453
# You can also use it with `sequential_filename`. For instance,
4554
# `[{published}]_{}` > `[2024-1-1]_1.png`, `[2024-1-1]_2.png`, etc.
46-
KTOOLBOX_JOB__FILENAME_FORMAT=[{published}]_{}
55+
KTOOLBOX_JOB__FILENAME_FORMAT=[{published}]_{title}_{id}_{}
4756
4857
# Prefix the post directory name with its release/publish date, e.g. `[2024-1-1]HelloWorld`
4958
KTOOLBOX_JOB__POST_DIRNAME_FORMAT=[{published}]{title}

docs/zh/configuration/guide.md

Lines changed: 20 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -27,23 +27,32 @@
2727
## `.env` / `prod.env` 文件示例
2828

2929
```dotenv
30-
# 可同时下载10个文件
30+
##############################################################################
31+
# 推荐使用图形化配置编辑器进行编辑。 #
32+
# 运行 `ktoolbox config-editor` 启动编辑器。 #
33+
##############################################################################
34+
35+
# (可选)会话密钥,可在成功登录后的 cookies 中找到
36+
# 403 错误时使用
37+
#KTOOLBOX_API__SESSION_KEY=xxxxx
38+
39+
# 同时下载 10 个文件。
3140
KTOOLBOX_JOB__COUNT=10
3241
33-
# 设置作品附件目录为 `./`, 这意味着所有附件将直接保存在作品目录下
34-
# 而不会创建一个子目录来储存
42+
# 设置帖子附件目录路径为 `./`,表示将所有附件文件保存在帖子目录下
43+
# 不会为附件单独创建子目录
3544
KTOOLBOX_JOB__POST_STRUCTURE__ATTACHMENTS=./
3645
37-
# 按照数字顺序重命名附件, 例如 `1.png`, `2.png`, ...
46+
# 附件按数字顺序重命名,例如 `1.png``2.png`、……
3847
KTOOLBOX_JOB__SEQUENTIAL_FILENAME=True
3948
40-
# 通过插入一个代表了基本文件名的空白的 `{}` 以自定义文件名格式
41-
# `post_dirname_format` 类似,你可以使用一些 `Post` 类里的属性
42-
# 例如 `{title}_{}` > `HelloWorld_b4b41de2-8736-480d-b5c3-ebf0d917561b`
43-
# 你也可以和 `sequential_filename` 搭配使用
44-
# 例如 `[{published}]_{}` > `[2024-1-1]_1.png`, `[2024-1-1]_2.png`
45-
KTOOLBOX_JOB__FILENAME_FORMAT=[{published}]_{}
49+
# 通过插入空的 `{}` 自定义文件名格式,表示基础文件名。
50+
# 类似于 `post_dirname_format`,可以使用 `Post` 中的一些属性。
51+
# 例如`{title}_{}` > `HelloWorld_b4b41de2-8736-480d-b5c3-ebf0d917561b` 等。
52+
# 也可以与 `sequential_filename` 一起使用。例如,
53+
# `[{published}]_{}` > `[2024-1-1]_1.png``[2024-1-1]_2.png` 等。
54+
KTOOLBOX_JOB__FILENAME_FORMAT=[{published}]_{title}_{id}_{}
4655
47-
# 将发布日期作为作品目录名的开头,例如 `[2024-1-1]HelloWorld`
56+
# 帖子目录名以发布时间作为前缀,例如 `[2024-1-1]HelloWorld`
4857
KTOOLBOX_JOB__POST_DIRNAME_FORMAT=[{published}]{title}
4958
```

ktoolbox/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
__title__ = "KToolBox"
22
# noinspection SpellCheckingInspection
33
__description__ = "A useful CLI tool for downloading posts in Kemono.cr / .su / .party"
4-
__version__ = "v0.16.0"
4+
__version__ = "v0.17.0"

ktoolbox/action/job.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
from ktoolbox.api.posts import get_post_revisions as get_post_revisions_api
1717
from ktoolbox.configuration import config, PostStructureConfiguration
1818
from ktoolbox.job import Job, CreatorIndices
19+
from ktoolbox.utils import extract_external_links
1920

2021
__all__ = ["create_job_from_post", "create_job_from_creator"]
2122

@@ -46,9 +47,12 @@ async def create_job_from_post(
4647
attachments_path.mkdir(exist_ok=True)
4748
content_path = post_path / post_structure.content # content
4849
content_path.parent.mkdir(exist_ok=True)
50+
external_links_path = post_path / post_structure.external_links # external_links
51+
external_links_path.parent.mkdir(exist_ok=True)
4952
else:
5053
attachments_path = post_path
5154
content_path = None
55+
external_links_path = None
5256

5357
# Filter and create jobs for ``Post.attachment``
5458
jobs: List[Job] = []
@@ -110,6 +114,16 @@ async def create_job_from_post(
110114
if content_path and post.content:
111115
async with aiofiles.open(content_path, "w", encoding=config.downloader.encoding) as f:
112116
await f.write(post.content)
117+
118+
# Extract and write external links file
119+
if config.job.extract_external_links and external_links_path and post.content:
120+
external_links = extract_external_links(post.content, config.job.external_link_patterns)
121+
if external_links:
122+
async with aiofiles.open(external_links_path, "w", encoding=config.downloader.encoding) as f:
123+
# Write each link on a separate line
124+
for link in sorted(external_links):
125+
await f.write(f"{link}\n")
126+
113127
if dump_post_data:
114128
async with aiofiles.open(str(post_path / DataStorageNameEnum.PostData.value), "w", encoding="utf-8") as f:
115129
await f.write(
@@ -194,6 +208,9 @@ async def create_job_from_creator(
194208
) as f:
195209
await f.write(indices.model_dump_json(indent=config.json_dump_indent))
196210

211+
logger.info("`job.include_revisions` is enabled and will fetch post revisions, "
212+
"which may take time. Disable if not needed.")
213+
197214
job_list: List[Job] = []
198215
for post in post_list:
199216
# Get post path

ktoolbox/api/base.py

Lines changed: 2 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@
1212
from ktoolbox._enum import RetCodeEnum
1313
from ktoolbox.configuration import config
1414
from ktoolbox.utils import BaseRet, generate_msg
15-
from ktoolbox.ddos_guard import DDoSGuardCookieManager, merge_cookies
1615

1716
__all__ = ["APITenacityStop", "APIRet", "BaseAPI"]
1817

@@ -69,30 +68,9 @@ class BaseAPI(ABC, Generic[_T]):
6968
path: str = "/"
7069
method: Literal["get", "post"]
7170
extra_validator: Optional[Callable[[str], BaseModel]] = None
72-
73-
# Initialize DDoS Guard cookie manager
74-
_ddos_cookie_manager: Optional[DDoSGuardCookieManager] = None
75-
76-
@classmethod
77-
def _get_ddos_cookie_manager(cls) -> DDoSGuardCookieManager:
78-
"""Get or create DDoS Guard cookie manager"""
79-
if cls._ddos_cookie_manager is None:
80-
cls._ddos_cookie_manager = DDoSGuardCookieManager()
81-
logger.debug("Initialized DDoS Guard cookie manager")
82-
return cls._ddos_cookie_manager
83-
84-
@classmethod
85-
def _build_cookies(cls) -> Optional[dict]:
86-
"""Build cookies including session and DDoS Guard cookies"""
87-
session_cookies = {"session": config.api.session_key} if config.api.session_key else None
88-
ddos_manager = cls._get_ddos_cookie_manager()
89-
ddos_cookies = ddos_manager.cookies
90-
91-
return merge_cookies(session_cookies, ddos_cookies)
92-
9371
client = httpx.AsyncClient(
9472
verify=config.ssl_verify,
95-
cookies=None # We'll set cookies dynamically
73+
cookies={"session": config.api.session_key} if config.api.session_key else None
9674
)
9775

9876
Response = BaseModel
@@ -113,7 +91,7 @@ def handle_res(cls, res: httpx.Response) -> APIRet[_T]:
11391
message=str(e),
11492
exception=e
11593
)
116-
elif isinstance(e, ValidationError):
94+
else:
11795
return APIRet(
11896
code=RetCodeEnum.ValidationError,
11997
message=str(e),
@@ -135,12 +113,6 @@ async def request(cls, path: str = None, **kwargs) -> APIRet[_T]:
135113
path = cls.path
136114
url_parts = [config.api.scheme, config.api.netloc, f"{config.api.path}{path}", '', '', '']
137115
url = str(urlunparse(url_parts))
138-
139-
# Build cookies dynamically to include any updates
140-
cookies = cls._build_cookies()
141-
if cookies:
142-
kwargs.setdefault('cookies', {}).update(cookies)
143-
144116
try:
145117
res = await cls.client.request(
146118
method=cls.method,
@@ -149,11 +121,6 @@ async def request(cls, path: str = None, **kwargs) -> APIRet[_T]:
149121
follow_redirects=True,
150122
**kwargs
151123
)
152-
153-
# Update DDoS Guard cookies from response
154-
ddos_manager = cls._get_ddos_cookie_manager()
155-
ddos_manager.update_from_response(res)
156-
157124
except Exception as e:
158125
return APIRet(
159126
code=RetCodeEnum.NetWorkError,

ktoolbox/api/posts/get_post_revisions.py

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,15 @@
1-
from pydantic import BaseModel, RootModel
2-
from typing import List
1+
from typing import List, TypeVar
2+
3+
import httpx
4+
from pydantic import RootModel
35

46
from ktoolbox.api import BaseAPI, APIRet
57
from ktoolbox.api.model import Revision
68

79
__all__ = ["GetPostRevisions", "get_post_revisions"]
810

11+
_T = TypeVar('_T')
12+
913

1014
class GetPostRevisions(BaseAPI):
1115
path = "/{service}/user/{creator_id}/post/{post_id}/revisions"
@@ -28,8 +32,12 @@ async def __call__(cls, service: str, creator_id: str, post_id: str) -> APIRet[R
2832
creator_id=creator_id,
2933
post_id=post_id
3034
)
31-
35+
3236
return await cls.request(path=path)
3337

38+
@classmethod
39+
def handle_res(cls, res: httpx.Response) -> APIRet[_T]:
40+
return APIRet(data=[]) if res.status_code == 404 else super().handle_res(res)
41+
3442

35-
get_post_revisions = GetPostRevisions.__call__
43+
get_post_revisions = GetPostRevisions.__call__

ktoolbox/cli.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
from ktoolbox.action import create_job_from_post, create_job_from_creator, generate_post_path_name
1313
from ktoolbox.action import search_creator as search_creator_action, search_creator_post as search_creator_post_action
1414
from ktoolbox.api.misc import get_app_version
15-
from ktoolbox.api.posts import get_post as get_post_api, get_post_revisions as get_post_revisions_api
15+
from ktoolbox.api.posts import get_post as get_post_api
1616
from ktoolbox.configuration import config
1717
from ktoolbox.job import JobRunner
1818
from ktoolbox.utils import dump_search, parse_webpage_url, generate_msg
@@ -228,7 +228,7 @@ async def download_post(
228228

229229
for revision_order, revision_data in ret.data.props.revisions:
230230
if revision_data.revision_id: # Only process actual revisions, not the main post
231-
revision_path = post_path / "revision" / str(revision_data.revision_id)
231+
revision_path = post_path / config.job.post_structure.revisions / generate_post_path_name(revision_data)
232232
revision_jobs = await create_job_from_post(
233233
post=revision_data,
234234
post_path=revision_path,

0 commit comments

Comments
 (0)