Ljzd-PRO
diff --git a/‎CHANGELOG.md‎
Lines changed: 32 additions & 8 deletions b/‎CHANGELOG.md‎
Lines changed: 32 additions & 8 deletions
diff --git a/‎ktoolbox/__init__.py‎
Lines changed: 1 addition & 1 deletion b/‎ktoolbox/__init__.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎ktoolbox/action/job.py‎
Lines changed: 70 additions & 6 deletions b/‎ktoolbox/action/job.py‎
Lines changed: 70 additions & 6 deletions
diff --git a/‎ktoolbox/action/utils.py‎
Lines changed: 46 additions & 10 deletions b/‎ktoolbox/action/utils.py‎
Lines changed: 46 additions & 10 deletions
diff --git a/‎ktoolbox/configuration.py‎
Lines changed: 2 additions & 0 deletions b/‎ktoolbox/configuration.py‎
Lines changed: 2 additions & 0 deletions
@@ -1,10 +1,10 @@
 ## Changes
 
-![Downloads](https://img.shields.io/github/downloads/Ljzd-PRO/KToolBox/v0.21.0/total)
+![Downloads](https://img.shields.io/github/downloads/Ljzd-PRO/KToolBox/v0.21.1/total)
 
 ### ✨ Features
 
-- Improved **download progress display**, providing a more **elegant and intuitive** progress bar
+- Improved **download progress display**, providing a more **elegant and intuitive** progress bar (v0.21.0)
     ```log
     2025-08-19 13:42:07 | INFO     | ktoolbox.cli - Got creator information - name: Ljzd-PRO, id: 12345678
 
@@ -20,7 +20,7 @@
 
     2025-08-19 13:44:01 | SUCCESS  | ktoolbox.job.runner - All jobs in queue finished
     ```
-- Automatically check for updates at program startup and notify the user if a new version is available
+- Automatically check for updates at program startup and notify the user if a new version is available (v0.21.0)
     ```log
     2025-08-19 13:41:23 | INFO     | ktoolbox.utils - Update available: 0.21.0 (current: 0.20.0)
 
@@ -29,14 +29,26 @@
 
 ### 🪲 Fixes
 
+- Fixed the issue where the feature "**support downloading images embedded in the post HTML content**" added in [v0.17.0](https://github.com/Ljzd-PRO/KToolBox/releases/tag/v0.17.0) did not actually exist - #332
+  - This is indeed a serious issue. It seems that the relevant branch was not merged into the main branch, resulting in the feature **not being implemented in v0.17.0**.
+  - Related configuration:
+    - `job.extract_content_images`: Whether to parse and download images embedded in the post HTML content, disabled by default (`False`)
+  - This feature is disabled by default because when using the `sync-creator` command to download all posts from a creator, the content of each post must be fetched individually, which can easily trigger DDoS protection and get blocked.
+  - You can edit this configuration via `ktoolbox config-editor` (`Job -> extract_content_images`)
+  - Or manually edit it in the `.env` file or environment variables:
+    ```dotenv
+    # Enable parsing and downloading images embedded in the post HTML content
+    KTOOLBOX_JOB__EXTRACT_CONTENT_IMAGES=True
+    ```
+  - 📖 More info: [Configuration Reference - JobConfiguration](https://ktoolbox.readthedocs.io/latest/configuration/reference/#ktoolbox.configuration.JobConfiguration)
 - Fixed the issue where the `download-post` command would **still not generate the text content file** (`content.txt`) 
-and external links file (`external_links.txt`) even when the `job.extract_content` and `job.extract_external_links` options were enabled - #332
+and external links file (`external_links.txt`) even when the `job.extract_content` and `job.extract_external_links` options were enabled - #332 (v0.21.0)
 
 - - -
 
 ### ✨ 新特性
 
-- 改进了**下载进度显示**，提供更加**优美和直观**的进度条显示
+- 改进了**下载进度显示**，提供更加**优美和直观**的进度条显示 (v0.21.0)
     ```log
     2025-08-19 13:42:07 | INFO     | ktoolbox.cli - Got creator information - name: Ljzd-PRO, id: 12345678
     
@@ -52,7 +64,7 @@ and external links file (`external_links.txt`) even when the `job.extract_conten
     
     2025-08-19 13:44:01 | SUCCESS  | ktoolbox.job.runner - All jobs in queue finished
     ```
-- 程序启动时自动检查更新，并在有新版本时提示用户
+- 程序启动时自动检查更新，并在有新版本时提示用户 (v0.21.0)
     ```log
     2025-08-19 13:41:23 | INFO     | ktoolbox.utils - Update available: 0.21.0 (current: 0.20.0)
     
@@ -61,8 +73,20 @@ and external links file (`external_links.txt`) even when the `job.extract_conten
 
 ### 🪲 修复
 
+- 修复了 [v0.17.0](https://github.com/Ljzd-PRO/KToolBox/releases/tag/v0.17.0) 新增的“**支持下载帖子 HTML 内容中嵌入的图片**”实际上并不存在的问题 - #332
+  - 这确实是个严重的问题，似乎相关分支没有被合并到主分支，导致该功能**在 v0.17.0 版本中并未实现**
+  - 相关配置项：
+    - `job.extract_content_images`：是否解析并下载帖子 HTML 内容中嵌入的图片，默认关闭（`False`）
+  - 该功能默认关闭，因为当使用 `sync-creator` 命令下载作者全部帖子时，只能逐个获取帖子内容（content），这容易导致触发 DDoS 防御机制而被阻断
+  - 可通过运行 `ktoolbox config-editor` 编辑这些配置（`Job -> extract_content_images`）
+  - 或手动在 `.env` 文件或环境变量中编辑：
+    ```dotenv
+    # 开启解析并下载帖子 HTML 内容中嵌入的图片
+    KTOOLBOX_JOB__EXTRACT_CONTENT_IMAGES=True
+    ```
+  - 📖更多信息：[配置参考-JobConfiguration](https://ktoolbox.readthedocs.io/latest/zh/configuration/reference/#ktoolbox._configuration_zh.JobConfiguration)
 - 修复即使启用了 `job.extract_content` 和 `job.extract_external_links` 配置项，`download-post` 命令
-仍然**不会生成文本内容文件**（`content.txt`）和外部链接文件（`external_links.txt`）的问题 - #332
+仍然**不会生成文本内容文件**（`content.txt`）和外部链接文件（`external_links.txt`）的问题 - #332 (v0.21.0)
 
 ## Upgrade
 
@@ -71,4 +95,4 @@ Use this command to upgrade if you are using **pipx**:
 pipx upgrade ktoolbox
 ```
 
-**Full Changelog**: https://github.com/Ljzd-PRO/KToolBox/compare/v0.20.0...v0.21.0
+**Full Changelog**: https://github.com/Ljzd-PRO/KToolBox/compare/v0.21.0...v0.21.1
@@ -1,4 +1,4 @@
 __title__ = "KToolBox"
 # noinspection SpellCheckingInspection
 __description__ = "A useful CLI tool for downloading posts in Kemono.cr / .su / .party"
-__version__ = "v0.21.0"
+__version__ = "v0.21.1"
@@ -12,7 +12,7 @@
 from ktoolbox._enum import PostFileTypeEnum, DataStorageNameEnum
 from ktoolbox.action import ActionRet, fetch_creator_posts, FetchInterruptError
 from ktoolbox.action.utils import generate_post_path_name, filter_posts_by_date, generate_filename, \
-    filter_posts_by_keywords, filter_posts_by_keywords_exclude, generate_grouped_post_path
+    filter_posts_by_keywords, filter_posts_by_keywords_exclude, generate_grouped_post_path, extract_content_images
 from ktoolbox.api.model import Post, Attachment, Revision
 from ktoolbox.api.posts import get_post_revisions as get_post_revisions_api, get_post as get_post_api
 from ktoolbox.configuration import config
@@ -125,7 +125,9 @@ async def create_job_from_post(
                 )
             )
     # ``post.substring`` is used to determine if the post has content, but it's only partial
-    if (post.content or post.substring) and post_dir and (config.job.extract_content or config.job.extract_external_links):
+    if (post.content or post.substring) and post_dir and (
+            config.job.extract_content or config.job.extract_external_links or config.job.extract_content_images
+    ):
         # If post has no content, fetch it from get_post API
         if not post.content:
             get_post_ret = await get_post_api(
@@ -164,6 +166,67 @@ async def create_job_from_post(
                         for link in sorted(external_links):
                             await f.write(f"{link}\n")
 
+            # Extract content images
+            if config.job.extract_content_images:
+                content_image_sources = extract_content_images(post.content)
+                for image_src in content_image_sources:
+                    if not image_src or not image_src.strip():
+                        continue
+
+                    # Handle relative paths by making them absolute
+                    # noinspection HttpUrlsUsage
+                    if image_src.startswith('/') and not image_src.startswith('//'):
+                        # Relative path - construct full URL
+                        image_path = image_src
+                    elif image_src.startswith('http://') or image_src.startswith('https://'):
+                        # Absolute URL - extract path
+                        image_path = urlparse(image_src).path
+                    else:
+                        # Skip data URLs, protocol-relative URLs, or other non-path sources
+                        continue
+
+                    if not image_path or not image_path.strip():
+                        continue
+
+                    # Generate filename from the image path
+                    image_file_path = Path(image_path)
+
+                    # Apply "allow/block list" filtering first (before incrementing counter)
+                    if config.job.sequential_filename:
+                        basic_filename = f"{sequential_counter + 1}{image_file_path.suffix}"
+                    else:
+                        basic_filename = image_file_path.name
+
+                    alt_filename = generate_filename(post, basic_filename, config.job.filename_format)
+
+                    if (not config.job.allow_list or any(
+                            map(
+                                lambda x: fnmatch(alt_filename, x),
+                                config.job.allow_list
+                            )
+                    )) and not any(
+                        map(
+                            lambda x: fnmatch(alt_filename, x),
+                            config.job.block_list
+                        )
+                    ):
+                        # Regenerate filename with correct counter
+                        should_use_sequential = (config.job.sequential_filename and
+                                                 image_file_path.suffix.lower() not in config.job.sequential_filename_excludes)
+                        if should_use_sequential:
+                            basic_filename = f"{sequential_counter}{image_file_path.suffix}"
+                            alt_filename = generate_filename(post, basic_filename, config.job.filename_format)
+                            sequential_counter += 1
+
+                        jobs.append(
+                            Job(
+                                path=attachments_path,
+                                alt_filename=alt_filename,
+                                server_path=image_path,
+                                type=PostFileTypeEnum.Attachment
+                            )
+                        )
+
     return jobs
 
 
@@ -263,10 +326,11 @@ async def create_job_from_creator(
 
     if config.job.include_revisions:
         logger.warning("`job.include_revisions` is enabled and will fetch post revisions, "
-                    "which may take time. Disable if not needed.")
-    if config.job.extract_content or config.job.extract_external_links:
-        logger.warning("`job.extract_content` or `job.extract_external_links` is enabled and will fetch post content one by one, "
-                    "which may take time. Disable if not needed.")
+                       "which may take time. Disable if not needed.")
+    if config.job.extract_content or config.job.extract_external_links or config.job.extract_content_images:
+        logger.warning(
+            "`job.extract_content` or `job.extract_external_links` or `job.extract_content_images` is enabled "
+            "and will fetch post content one by one, which may take time. Disable if not needed.")
 
     job_list: List[Job] = []
     for post in post_list:
 
@@ -1,4 +1,5 @@
 from datetime import datetime
+from html.parser import HTMLParser
 from pathlib import Path
 from typing import Optional, List, Generator, Any, Tuple, Set
 
@@ -19,12 +20,27 @@
     "filter_posts_by_indices",
     "match_post_keywords",
     "filter_posts_by_keywords",
-    "filter_posts_by_keywords_exclude"
+    "filter_posts_by_keywords_exclude",
+    "extract_content_images"
 ]
 
 TIME_FORMAT = "%Y-%m-%d"
 
 
+class _ContentImageParser(HTMLParser):
+    """HTML parser to extract image sources from content"""
+
+    def __init__(self):
+        super().__init__()
+        self.image_sources = []
+
+    def handle_starttag(self, tag: str, attrs: List[Tuple[str, Optional[str]]]):
+        if tag.lower() == 'img':
+            for attr_name, attr_value in attrs:
+                if attr_name.lower() == 'src' and attr_value:
+                    self.image_sources.append(attr_value)
+
+
 def generate_post_path_name(post: Post) -> str:
     """Generate directory name for post to save."""
     if not post.title:
@@ -53,7 +69,7 @@ def generate_year_dirname(post: Post) -> str:
     post_date = post.published or post.added
     if not post_date:
         return "unknown"
-    
+
     try:
         return sanitize_filename(
             config.job.year_dirname_format.format(
@@ -71,7 +87,7 @@ def generate_month_dirname(post: Post) -> str:
     post_date = post.published or post.added
     if not post_date:
         return "unknown"
-    
+
     try:
         return sanitize_filename(
             config.job.month_dirname_format.format(
@@ -93,15 +109,15 @@ def generate_grouped_post_path(post: Post, base_path: Path) -> Path:
     :return: Full path where the post should be saved
     """
     result_path = base_path
-    
+
     if config.job.group_by_year:
         year_dirname = generate_year_dirname(post)
         result_path = result_path / year_dirname
-        
+
         if config.job.group_by_month:
             month_dirname = generate_month_dirname(post)
             result_path = result_path / month_dirname
-    
+
     return result_path
 
 
@@ -196,12 +212,12 @@ def match_post_keywords(post: Post, keywords: Set[str]) -> bool:
     """
     if not keywords:
         return True
-    
+
     # Only search in post title
     searchable_text = ""
     if post.title:
         searchable_text = post.title.lower()
-    
+
     # Check if any keyword is found in the title
     return any(keyword.lower() in searchable_text for keyword in keywords)
 
@@ -219,7 +235,7 @@ def filter_posts_by_keywords(
     if not keywords:
         yield from post_list
         return
-    
+
     post_filter = filter(lambda x: match_post_keywords(x, keywords), post_list)
     yield from post_filter
 
@@ -237,7 +253,27 @@ def filter_posts_by_keywords_exclude(
     if not keywords_exclude:
         yield from post_list
         return
-    
+
     # Exclude posts that match any of the exclude keywords
     post_filter = filter(lambda x: not match_post_keywords(x, keywords_exclude), post_list)
     yield from post_filter
+
+
+def extract_content_images(content: str) -> List[str]:
+    """
+    Extract image sources from HTML content
+
+    :param content: HTML content string
+    :return: List of image source URLs/paths
+    """
+    if not content:
+        return []
+
+    parser = _ContentImageParser()
+    try:
+        parser.feed(content)
+    except Exception as e:
+        logger.warning(f"Failed to parse HTML content for images: {e}")
+        return []
+
+    return parser.image_sources
@@ -198,6 +198,7 @@ class JobConfiguration(BaseModel):
     :ivar allow_list: Download files which match these patterns (Unix shell-style), e.g. ``["*.png"]``
     :ivar block_list: Not to download files which match these patterns (Unix shell-style), e.g. ``["*.psd","*.zip"]``
     :ivar extract_content: Extract post content and save to separate file (filename was defined in ``config.job.post_structure.content``)
+    :ivar extract_content_images: Extract images from post content and download them.
     :ivar extract_external_links: Extract external file sharing links from post content and save to separate file \
     (filename was defined in ``config.job.post_structure.external_links``) \
     :ivar external_link_patterns: Regex patterns for extracting external links.
@@ -223,6 +224,7 @@ class JobConfiguration(BaseModel):
     # noinspection PyDataclass
     block_list: Set[str] = Field(default_factory=set)
     extract_content: bool = False
+    extract_content_images: bool = False
     extract_external_links: bool = False
     # noinspection SpellCheckingInspection
     external_link_patterns: List[str] = [