fix: The image uploaded from the workflow knowledge base zip file cannot be parsed #4503

shaohuzhang1 · 2025-12-12T03:57:18Z

fix: The image uploaded from the workflow knowledge base zip file cannot be parsed

…not be parsed

f2c-ci-robot · 2025-12-12T03:57:22Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

f2c-ci-robot · 2025-12-12T03:57:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shaohuzhang1 · 2025-12-12T03:57:51Z

apps/common/handle/impl/text/zip_split_handle.py

-
        return '\n\n'.join(content_parts)
-
-


The get_image_list_by_content function handles both relative and absolute paths in images within Markdown content, but it only constructs the path from oss/file/, assuming that all referenced images are stored in a specific subdirectory. This might lead to incorrect behavior if images are distributed across directories.

Optimization Suggestion: Consider using an argument or configuration option that allows specifying the exact subdirectory where these images should be looked for (oss/file/). Additionally, ensure that this directory exists before attempting to retrieve files from it.

There's a small inconsistency between checking whether a file starts with 'oss/file/' or 'oss/image/'. Although they are likely synonymous based on context, explicit differentiation could improve readability:

# Original code if image_path.startswith('oss/files/') or image_path.startswith('oss/images/'): # Revised code if image_path.startswith(('oss,file/', 'oss,image/')):

Some variable names like _image_list and content_parts can be confusing without clear contexts or explanations of their purpose. It would help clarify them by prefixing them with meaningful prefixes such as "result_" or "processed_":

result_image_list = [] processed_content_parts = []

The use of hardcoded UUID strings ("./oss/file/" + new_image_id) in place of actual UUIDs (e.g., "./oss/file/{}/{}") when logging details about replacements could potentially lead to misleading information:

# Incorrectly hard-coded log message print(f"Replaced {source_image_path} with ./oss/file/{new_image_id}") # Correctly formatted log message print(f"Replaced {source_image_path} with ./oss/file/{image_id}")

While your handling of ZIP archives is comprehensive, there’s room for improvement regarding error management. For instance, exceptions related to decoding text data or invalid file types might need more attention:

try: enc = detect(raw).get('encoding') or 'utf-8' md_text = raw.decode(enc, errors='ignore') except Exception as e: print(f"Failed to decode text: {e}. Content was ignored.") md_text = raw.decode('utf-8', errors='ignore')

You are creating File instances in two separate operations – once directly after extracting each file and another time when you collect all relevant images into one final list. Consolidating these steps into one pass could simplify the logic and make future maintenance easier:

for name, zf in zip(files, zip_ref): if name.endswith('/') or name.startswith('__MACOSX'): continue # Collect relevant file contents content_parts.append({'content': get_buffer(zf), 'name': name}) # Once all valid contents have been collected, create `File` instances. for file_content in content_parts: rows, updated_content = get_image_list_by_content(file_content.get('name'), file_content.get("content"), files) update_content_with_images(updated_content) def update_content_with_images(new_content): # Append all found images for image in image_mode_list: new_content = new_content.replace(old_source_file, f'./oss/file/{image.id}') return new_content

7. If you need to maintain compatibility with certain older Python versions or libraries (like `urljoin` or `posixpath`), ensure that those functions are imported correctly and supported in your environment. In newer Python versions or environments, these functions may already be available without needing imports: ```python from urllib.parse import urljoin import os import posixpath # Only needed if used directly instead of built-in urlparse.urljoin() # Example usage: joined = urljoin(base_dir, source_image_path) if not posixpath.startswith(joined, '/'): joined = posixpath.normpath(os.path.join(base_dir, source_image_path)))

These points aim to enhance the robustness, clarity, efficiency, maintainability and compliance of your code while avoiding potential issues common in its current implementation.

shaohuzhang1 · 2025-12-12T03:58:05Z

apps/application/flow/step_node/document_extract_node/impl/base_document_extract_node.py

+                    new_file.save(file_bytes)

        document_list = []
        for doc in document:


Your code looks generally clear and functional, but there are a few areas where improvements can be made:

Empty File Check: The line new_file.save(file_bytes) might not work correctly if file_bytes is empty or null because saving an empty file would result in incorrect data being written to the database.

Duplicate Entry Prevention: Although checking if a record with the same ID already exists before saving is good practice to prevent duplicate entries, it should also handle scenarios where such duplicates need to be updated instead of created (depending on your application logic).

Semicolon Usage: There seems to be a semicolon at the end of the first line inside the loop that's iterating over document. This semicolon has no effect on the functionality, so you could remove it to improve readability.

Comments: It'd be helpful to add comments to explain why certain lines are included, especially those related to data validation or handling.

Here's a refined version of your code with these suggestions applied:

from django.db import models class Document(models.Model): # Define fields here def save_image(image_list): for img in image_list: metadata = img.get('metadata') # Extracting metadata from each item new_file = NewFileType.objects.create( # Assuming this class exists name=img['name'], # Replace 'name' key with actual field name content_type=img['content_type'], source_id=metadata['application_id'] if metadata['application_id'] else metadata['knowledge_id'], meta=metadata ) file_bytes = get_file_contents(img) # Function to retrieve file bytes # Save only if the file doesn't exist in the database if not QuerySet(File).filter(id=new_file.id).exists(): new_file.save(file_bytes) def save_document(document_data): for doc_data in document_data: document_instance = Document(...) # Initialize Document instance with appropriate arguments document_instance.save() # Save the document instance

Make sure to adjust method calls, variable names, and assumptions according to your specific implementation details.

…not be parsed (#4503)

fix: The image uploaded from the workflow knowledge base zip file can…

62e2266

…not be parsed

f2c-ci-robot bot added the do-not-merge/release-note-label-needed label Dec 12, 2025

shaohuzhang1 merged commit e529514 into v2 Dec 12, 2025
3 of 5 checks passed

shaohuzhang1 deleted the pr@v2@fix_knowledge_workflow branch December 12, 2025 03:57

shaohuzhang1 commented Dec 12, 2025

View reviewed changes

liuruibin pushed a commit that referenced this pull request Dec 12, 2025

fix: The image uploaded from the workflow knowledge base zip file can…

fd400f2

…not be parsed (#4503)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: The image uploaded from the workflow knowledge base zip file cannot be parsed #4503

fix: The image uploaded from the workflow knowledge base zip file cannot be parsed #4503

Uh oh!

shaohuzhang1 commented Dec 12, 2025

Uh oh!

f2c-ci-robot bot commented Dec 12, 2025

Uh oh!

f2c-ci-robot bot commented Dec 12, 2025

Uh oh!

Uh oh!

shaohuzhang1 Dec 12, 2025

Uh oh!

shaohuzhang1 Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: The image uploaded from the workflow knowledge base zip file cannot be parsed #4503

fix: The image uploaded from the workflow knowledge base zip file cannot be parsed #4503

Uh oh!

Conversation

shaohuzhang1 commented Dec 12, 2025

Uh oh!

f2c-ci-robot bot commented Dec 12, 2025

Uh oh!

f2c-ci-robot bot commented Dec 12, 2025

Uh oh!

Uh oh!

shaohuzhang1 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

shaohuzhang1 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants