Ljzd-PRO
diff --git a/‎CHANGELOG.md‎
Lines changed: 69 additions & 13 deletions b/‎CHANGELOG.md‎
Lines changed: 69 additions & 13 deletions
diff --git a/‎ktoolbox/__init__.py‎
Lines changed: 1 addition & 1 deletion b/‎ktoolbox/__init__.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎ktoolbox/_configuration_zh.py‎
Lines changed: 9 additions & 4 deletions b/‎ktoolbox/_configuration_zh.py‎
Lines changed: 9 additions & 4 deletions
diff --git a/‎ktoolbox/action/job.py‎
Lines changed: 73 additions & 52 deletions b/‎ktoolbox/action/job.py‎
Lines changed: 73 additions & 52 deletions
diff --git a/‎ktoolbox/api/model/post.py‎
Lines changed: 1 addition & 0 deletions b/‎ktoolbox/api/model/post.py‎
Lines changed: 1 addition & 0 deletions
@@ -1,28 +1,84 @@
 ## Changes
 
-![Downloads](https://img.shields.io/github/downloads/Ljzd-PRO/KToolBox/v0.19.2/total)
+![Downloads](https://img.shields.io/github/downloads/Ljzd-PRO/KToolBox/v0.20.0/total)
 
-### 💡 Feature
+### ✨ Features
 
-> - Improved the error log format to make it easier to read and understand (v0.19.1)
+- Added options to control whether to extract `content` and `external_links` for greater flexibility - #317
+  - Related configuration items:
+    - `job.extract_content`: Whether to extract post text content as a separate file, default is disabled (`False`)
+    - `job.extract_external_links`: Whether to extract external links in post text content as a separate file, default is disabled (`False`)
+  - You can edit these settings via `ktoolbox config-editor` (`Job -> ...`)
+  - Or manually edit them in the `.env` file or environment variables
+    ```dotenv
+    # Whether to extract post text content as a separate file
+    KTOOLBOX_JOB__EXTRACT_CONTENT=True
+    # Whether to extract external links in post text content as a separate file
+    KTOOLBOX_JOB__EXTRACT_EXTERNAL_LINKS=True
 
-### 🪲 Fix
+    # Change the default file names for content.txt and external_links.txt
+    KTOOLBOX_JOB__POST_STRUCTURE__CONTENT="content.html"
+    KTOOLBOX_JOB__POST_STRUCTURE__EXTERNAL_LINKS="link.txt"
+    ```
+  - 📖 More info: [Configuration Reference - JobConfiguration](https://ktoolbox.readthedocs.io/latest/configuration/reference/#ktoolbox.configuration.JobConfiguration)
+- Support controlling whether to preserve the metadata (such as modification date) of downloaded files - #321
+  - If you usually browse images by download date, or want to use post images as Windows folder preview covers, you can disable this option
+  - Related configuration item:
+    - `downloader.keep_metadata`: Whether to preserve the metadata (such as modification date) of downloaded files, enabled by default (`True`)
+  - You can edit this setting via `ktoolbox config-editor` (`Downloader -> keep_metadata`)
+  - Or manually edit it in the `.env` file or environment variables
+    ```dotenv
+    # Whether to preserve the metadata (such as modification date) of downloaded files
+    KTOOLBOX_DOWNLOADER__KEEP_METADATA=False
+    ```
+  - 📖 More info: [Configuration Reference - DownloaderConfiguration](https://ktoolbox.readthedocs.io/latest/configuration/reference/#ktoolbox.configuration.DownloaderConfiguration)
 
-- Fixed the issue where **`content`** data of works could not be obtained due to **Kemono API changes**, resulting in missing **`content.txt`** and `external_links.txt` - #316
-> - Fixed the issue where author information and work data could not be retrieved due to **Kemono API changes** - #315 (v0.19.1)
->  - Error messages: `Kemono API call failed: ...`, `404 Not Found`, `403 Forbidden`, ...
+### 🪲 Fixes
+
+- Due to changes in the Kemono API, extraction of **post text content and external links** (content and external_links) can now only be performed one by one.
+Therefore, **only when** the default-disabled `job.extract_content` and `job.extract_external_links` are set to `True` (as mentioned above),
+and the post **actually contains text content**, **will** post text content and external links be extracted, to avoid frequent API calls that may trigger **server DDoS protection**.
+- Output `SUCCESS` level logs to help users better understand download status
 
 - - -
 
-### 💡 新特性
+### ✨ 新特性
 
-> - 改进报错的日志格式，使其更易于阅读和理解 (v0.19.1)
+- 支持控制是否提取 content 和 external_links，灵活性提升 - #317
+  - 相关配置项：
+    - `job.extract_content`：是否提取帖子文本内容为独立的文件，默认关闭（`False`）
+    - `job.extract_external_links`：是否提取帖子文本内容中的外部链接为独立的文件，默认关闭（`False`）
+  - 可通过运行 `ktoolbox config-editor` 编辑这些配置（`Job -> ...`）
+  - 或手动在 `.env` 文件或环境变量中编辑
+    ```dotenv
+    # 是否提取帖子文本内容为独立的文件
+    KTOOLBOX_JOB__EXTRACT_CONTENT=True
+    # 是否提取帖子文本内容中的外部链接为独立的文件
+    KTOOLBOX_JOB__EXTRACT_EXTERNAL_LINKS=True
+  
+    # 修改默认的 content.txt 和 external_links.txt 文件名
+    KTOOLBOX_JOB__POST_STRUCTURE__CONTENT="content.html"
+    KTOOLBOX_JOB__POST_STRUCTURE__EXTERNAL_LINKS="link.txt"
+    ```
+  - 📖更多信息：[配置参考-JobConfiguration](https://ktoolbox.readthedocs.io/latest/zh/configuration/reference/#ktoolbox._configuration_zh.JobConfiguration)
+- 支持控制是否保留下载的文件的元数据（修改日期等） - #321
+  - 如果你平时是按照下载日期排序浏览图片的，或者需要将帖子图片作为 Windows 文件夹预览封面，可以关闭此配置
+  - 相关配置项：
+    - `downloader.keep_metadata`：是否保留下载的文件的元数据（修改日期等），默认开启（`True`）
+    - 可通过运行 `ktoolbox config-editor` 编辑这些配置（`Downloader -> keep_metadata`）
+    - 或手动在 `.env` 文件或环境变量中编辑
+    ```dotenv
+    # 是否保留下载的文件的元数据（修改日期等）
+    KTOOLBOX_DOWNLOADER__KEEP_METADATA=False
+    ```
+  - 📖更多信息：[配置参考-DownloaderConfiguration](https://ktoolbox.readthedocs.io/latest/zh/configuration/reference/#ktoolbox._configuration_zh.DownloaderConfiguration)
 
 ### 🪲 修复
 
-- 修复由于 **Kemono API 变更** 导致的作品 **`content`** 数据无法获取进而导致 **`content.txt`, `external_links.txt` 缺失**的问题 - #316
-> - 修复 Kemono **API 变更**导致的作者信息和作品数据**获取失败**的问题 - #315 (v0.19.1)
->  - 报错信息：`Kemono API call failed: ...`, `404 Not Found`, `403 Forbidden`, ...
+- 由于 Kemono API 变更，帖子的**文本内容和外部链接**（content 和 external_links）的提取只能单独地一个个获取，
+因此仅当将**默认关闭**的 `job.extract_content` 和 `job.extract_external_links` 设置为 `True` 时（上文新特性提到），
+以及帖子**真正存在文本内容**时，**才会提取帖子文本内容和外部链接**，避免 API 频繁调用导致**被服务器 DDoS 防御机制阻拦**
+- 输出 `SUCCESS` 级别日志，以便用户更清晰地了解下载状态
 
 ## Upgrade
 
@@ -31,4 +87,4 @@ Use this command to upgrade if you are using **pipx**:
 pipx upgrade ktoolbox
 ```
 
-**Full Changelog**: https://github.com/Ljzd-PRO/KToolBox/compare/v0.19.1...v0.19.2
+**Full Changelog**: https://github.com/Ljzd-PRO/KToolBox/compare/v0.19.2...v0.20.0
@@ -1,4 +1,4 @@
 __title__ = "KToolBox"
 # noinspection SpellCheckingInspection
 __description__ = "A useful CLI tool for downloading posts in Kemono.cr / .su / .party"
-__version__ = "v0.19.2"
+__version__ = "v0.20.0"
@@ -25,7 +25,7 @@ class DownloaderConfiguration(ktoolbox.configuration.DownloaderConfiguration):
 
     :ivar scheme: 下载器的 URL 协议
     :ivar timeout: 下载器请求超时时间
-    :ivar encoding: 文件名解析和帖子内容文本保存的字符集
+    :ivar encoding: 文件名解析和帖子 ``内容``、``external_links`` 保存的字符集
     :ivar buffer_size: 每个下载文件的文件 I/O 缓冲区字节数
     :ivar chunk_size: 下载器流的分块字节数
     :ivar temp_suffix: 下载文件的临时文件名后缀
@@ -36,6 +36,7 @@ class DownloaderConfiguration(ktoolbox.configuration.DownloaderConfiguration):
     :ivar use_bucket: 启用本地存储桶模式
     :ivar bucket_path: 本地存储桶路径
     :ivar reverse_proxy: 下载 URL 的反向代理格式。通过插入空的 ``{}`` 自定义文件名格式以表示原始 URL。例如：``https://example.com/{}`` 会变成 ``https://example.com/https://n1.kemono.su/data/66/83/xxxxx.jpg``；``https://example.com/?url={}`` 会变成 ``https://example.com/?url=https://n1.kemono.su/data/66/83/xxxxx.jpg``
+    :ivar keep_metadata: 下载文件时保留文件元数据（例如最后修改时间等）
     """
     ...
 
@@ -118,14 +119,15 @@ class JobConfiguration(ktoolbox.configuration.JobConfiguration):
     :ivar filename_format: 通过插入空的 ``{}`` 自定义文件名格式，表示基本文件名。可使用 [属性][ktoolbox.configuration.JobConfiguration]。例如：``{title}_{}`` 可能生成 ``TheTitle_b4b41de2-8736-480d-b5c3-ebf0d917561b``、``TheTitle_af349b25-ac08-46d7-98fb-6ce99a237b90`` 等。也可与 ``sequential_filename`` 结合使用，如 ``[{published}]_{}`` 可能生成 ``[2024-1-1]_1.png``、``[2024-1-1]_2.png`` 等。
     :ivar allow_list: 下载匹配这些模式（Unix shell 风格）的文件，如 ``["*.png"]``
     :ivar block_list: 不下载匹配这些模式（Unix shell 风格）的文件，如 ``["*.psd","*.zip"]``
-    :ivar extract_external_links: 从帖子内容中提取外部文件分享链接并保存到单独文件
+    :ivar extract_content: 提取帖子内容并保存到单独文件（文件名由 ``config.job.post_structure.content`` 定义）
+    :ivar extract_external_links: 从帖子内容中提取外部文件分享链接并保存到单独文件（文件名由 ``config.job.post_structure.external_links`` 定义）
     :ivar external_link_patterns: 用于提取外部链接的正则表达式模式
     :ivar group_by_year: 根据发布日期按年分组到不同目录
     :ivar group_by_month: 根据发布日期按月分组到不同目录（需要启用 group_by_year）
     :ivar year_dirname_format: 自定义年份目录名格式。可用属性：``year``。例如：``{year}`` > ``2024``，``Year_{year}`` > ``Year_2024``
     :ivar month_dirname_format: 自定义月份目录名格式。可用属性：``year``、``month``。例如：``{year}-{month}`` > ``2024-01``，``{year}_{month}`` > ``2024_01``
     """
-    ...
+    post_structure: PostStructureConfiguration = PostStructureConfiguration()
 
 
 class LoggerConfiguration(ktoolbox.configuration.LoggerConfiguration):
@@ -155,4 +157,7 @@ class Configuration(ktoolbox.configuration.Configuration):
     Windows 下安装 winloop：`pip install ktoolbox[winloop]` \
     Unix 下安装 uvloop：`pip install ktoolbox[uvloop]`
     """
-    ...
+    api: APIConfiguration = APIConfiguration()
+    downloader: DownloaderConfiguration = DownloaderConfiguration()
+    job: JobConfiguration = JobConfiguration()
+    logger: LoggerConfiguration = LoggerConfiguration()
@@ -15,9 +15,9 @@
     filter_posts_by_keywords, filter_posts_by_keywords_exclude, generate_grouped_post_path
 from ktoolbox.api.model import Post, Attachment, Revision
 from ktoolbox.api.posts import get_post_revisions as get_post_revisions_api, get_post as get_post_api
-from ktoolbox.configuration import config, PostStructureConfiguration
+from ktoolbox.configuration import config
 from ktoolbox.job import Job, CreatorIndices
-from ktoolbox.utils import extract_external_links
+from ktoolbox.utils import extract_external_links, generate_msg
 
 __all__ = ["create_job_from_post", "create_job_from_creator"]
 
@@ -26,35 +26,39 @@ async def create_job_from_post(
         post: Union[Post, Revision],
         post_path: Path,
         *,
-        post_structure: Union[PostStructureConfiguration, bool] = None,
+        post_dir: bool = True,
         dump_post_data: bool = True
 ) -> List[Job]:
     """
     Create a list of download job from a post data
 
     :param post: post data
     :param post_path: Path of the post directory, which needs to be sanitized
-    :param post_structure: post path structure, ``False`` -> disable, \
-     ``True`` & ``None`` -> ``config.job.post_structure``
+    :param post_dir: Whether to create post directory
     :param dump_post_data: Whether to dump post data (post.json) in post directory
+    :raise FetchInterruptError: If fetching post content fails
     """
     post_path.mkdir(parents=True, exist_ok=True)
 
     # Load ``PostStructureConfiguration``
-    if post_structure in [True, None]:
-        post_structure = config.job.post_structure
-    if post_structure:
-        attachments_path = post_path / post_structure.attachments  # attachments
+    if post_dir:
+        attachments_path = post_path / config.job.post_structure.attachments  # attachments
         attachments_path.mkdir(exist_ok=True)
-        content_path = post_path / post_structure.content  # content
+        content_path = post_path / config.job.post_structure.content  # content
         content_path.parent.mkdir(exist_ok=True)
-        external_links_path = post_path / post_structure.external_links  # external_links
+        external_links_path = post_path / config.job.post_structure.external_links  # external_links
         external_links_path.parent.mkdir(exist_ok=True)
     else:
         attachments_path = post_path
         content_path = None
         external_links_path = None
 
+    if dump_post_data:
+        async with aiofiles.open(str(post_path / DataStorageNameEnum.PostData.value), "w", encoding="utf-8") as f:
+            await f.write(
+                post.model_dump_json(indent=config.json_dump_indent)
+            )
+
     # Filter and create jobs for ``Post.attachment``
     jobs: List[Job] = []
     sequential_counter = 1  # Counter for sequential filenames
@@ -120,37 +124,45 @@ async def create_job_from_post(
                     post=post
                 )
             )
+    # ``post.substring`` is used to determine if the post has content, but it's only partial
+    if post.substring and post_dir and (config.job.extract_content or config.job.extract_external_links):
+        # If post has no content, fetch it from get_post API
+        if not post.content:
+            get_post_ret = await get_post_api(
+                service=post.service,
+                creator_id=post.user,
+                post_id=post.id,
+                revision_id=post.revision_id if isinstance(post, Revision) else None
+            )
+            if get_post_ret:
+                post = get_post_ret.data.post
+            else:
+                logger.error(
+                    generate_msg(
+                        "Failed to fetch post content",
+                        post_name=post.title or "Unknown",
+                        post_id=post.id,
+                        creator_id=post.user,
+                        service=post.service
+                    )
+                )
+                raise FetchInterruptError(ret=get_post_ret)
 
-    # If post has no content, fetch it from get_post API
-    if not post.content:
-        get_post_ret = await get_post_api(
-            service=post.service,
-            creator_id=post.user,
-            post_id=post.id,
-            revision_id=post.revision_id if isinstance(post, Revision) else None
-        )
-        if get_post_ret:
-            post = get_post_ret.data.post
-
-    # Write content file
-    if content_path and post.content:
-        async with aiofiles.open(content_path, "w", encoding=config.downloader.encoding) as f:
-            await f.write(post.content)
-
-    # Extract and write external links file
-    if config.job.extract_external_links and external_links_path and post.content:
-        external_links = extract_external_links(post.content, config.job.external_link_patterns)
-        if external_links:
-            async with aiofiles.open(external_links_path, "w", encoding=config.downloader.encoding) as f:
-                # Write each link on a separate line
-                for link in sorted(external_links):
-                    await f.write(f"{link}\n")
+        # If post content is still empty, skip content extraction
+        if post.content:
+            # Write content file
+            if config.job.extract_content:
+                async with aiofiles.open(content_path, "w", encoding=config.downloader.encoding) as f:
+                    await f.write(post.content)
 
-    if dump_post_data:
-        async with aiofiles.open(str(post_path / DataStorageNameEnum.PostData.value), "w", encoding="utf-8") as f:
-            await f.write(
-                post.model_dump_json(indent=config.json_dump_indent)
-            )
+            # Extract and write external links file
+            if config.job.extract_external_links:
+                external_links = extract_external_links(post.content, config.job.external_link_patterns)
+                if external_links:
+                    async with aiofiles.open(external_links_path, "w", encoding=config.downloader.encoding) as f:
+                        # Write each link on a separate line
+                        for link in sorted(external_links):
+                            await f.write(f"{link}\n")
 
     return jobs
 
@@ -250,7 +262,10 @@ async def create_job_from_creator(
                 await f.write(indices.model_dump_json(indent=config.json_dump_indent))
 
     if config.job.include_revisions:
-        logger.info("`job.include_revisions` is enabled and will fetch post revisions, "
+        logger.warning("`job.include_revisions` is enabled and will fetch post revisions, "
+                    "which may take time. Disable if not needed.")
+    if config.job.extract_content or config.job.extract_external_links:
+        logger.warning("`job.extract_content` or `job.extract_external_links` is enabled and will fetch post content one by one, "
                     "which may take time. Disable if not needed.")
 
     job_list: List[Job] = []
@@ -264,12 +279,15 @@ async def create_job_from_creator(
             post_path = grouped_base_path / generate_post_path_name(post)
 
         # Generate jobs for the main post
-        job_list += await create_job_from_post(
-            post=post,
-            post_path=post_path,
-            post_structure=False if mix_posts else None,
-            dump_post_data=not mix_posts
-        )
+        try:
+            job_list += await create_job_from_post(
+                post=post,
+                post_path=post_path,
+                post_dir=not mix_posts,
+                dump_post_data=not mix_posts
+            )
+        except FetchInterruptError as e:
+            return ActionRet(**e.ret.model_dump(mode="python"))
 
         # If include_revisions is enabled, fetch and download revisions for this post
         if config.job.include_revisions and not mix_posts:
@@ -284,11 +302,14 @@ async def create_job_from_creator(
                         if revision.revision_id:  # Only process actual revisions
                             revision_path = post_path / config.job.post_structure.revisions / generate_post_path_name(
                                 revision)
-                            revision_jobs = await create_job_from_post(
-                                post=revision,
-                                post_path=revision_path,
-                                dump_post_data=True
-                            )
+                            try:
+                                revision_jobs = await create_job_from_post(
+                                    post=revision,
+                                    post_path=revision_path,
+                                    dump_post_data=True
+                                )
+                            except FetchInterruptError as e:
+                                return ActionRet(**e.ret.model_dump(mode="python"))
                             job_list += revision_jobs
             except Exception as e:
                 logger.warning(f"Failed to fetch revisions for post {post.id}: {e}")
 
@@ -24,6 +24,7 @@ class Post(BaseModel):
     service: Optional[str] = None
     title: Optional[str] = None
     content: Optional[str] = None
+    substring: Optional[str] = None
     embed: Optional[Dict[str, Any]] = None
     shared_file: Optional[bool] = None
     added: Optional[datetime] = None