Skip to content

Conversation

@shaohuzhang1
Copy link
Contributor

fix: The image uploaded from the workflow knowledge base zip file cannot be parsed

@f2c-ci-robot
Copy link

f2c-ci-robot bot commented Dec 12, 2025

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@f2c-ci-robot
Copy link

f2c-ci-robot bot commented Dec 12, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shaohuzhang1 shaohuzhang1 merged commit e529514 into v2 Dec 12, 2025
3 of 5 checks passed
@shaohuzhang1 shaohuzhang1 deleted the pr@v2@fix_knowledge_workflow branch December 12, 2025 03:57

return '\n\n'.join(content_parts)


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The get_image_list_by_content function handles both relative and absolute paths in images within Markdown content, but it only constructs the path from oss/file/, assuming that all referenced images are stored in a specific subdirectory. This might lead to incorrect behavior if images are distributed across directories.

    Optimization Suggestion: Consider using an argument or configuration option that allows specifying the exact subdirectory where these images should be looked for (oss/file/). Additionally, ensure that this directory exists before attempting to retrieve files from it.

  2. There's a small inconsistency between checking whether a file starts with 'oss/file/' or 'oss/image/'. Although they are likely synonymous based on context, explicit differentiation could improve readability:

    # Original code
    if image_path.startswith('oss/files/') or image_path.startswith('oss/images/'):
    
    # Revised code
    if image_path.startswith(('oss,file/', 'oss,image/')):
  3. Some variable names like _image_list and content_parts can be confusing without clear contexts or explanations of their purpose. It would help clarify them by prefixing them with meaningful prefixes such as "result_" or "processed_":

    result_image_list = []
    processed_content_parts = []
  4. The use of hardcoded UUID strings ("./oss/file/" + new_image_id) in place of actual UUIDs (e.g., "./oss/file/{}/{}") when logging details about replacements could potentially lead to misleading information:

    # Incorrectly hard-coded log message
    print(f"Replaced {source_image_path} with ./oss/file/{new_image_id}")
    
    # Correctly formatted log message
    print(f"Replaced {source_image_path} with ./oss/file/{image_id}")
  5. While your handling of ZIP archives is comprehensive, there’s room for improvement regarding error management. For instance, exceptions related to decoding text data or invalid file types might need more attention:

    try:
        enc = detect(raw).get('encoding') or 'utf-8'
        md_text = raw.decode(enc, errors='ignore')
    except Exception as e:
        print(f"Failed to decode text: {e}. Content was ignored.")
        md_text = raw.decode('utf-8', errors='ignore')
  6. You are creating File instances in two separate operations – once directly after extracting each file and another time when you collect all relevant images into one final list. Consolidating these steps into one pass could simplify the logic and make future maintenance easier:

    for name, zf in zip(files, zip_ref):
        if name.endswith('/') or name.startswith('__MACOSX'):
            continue
        # Collect relevant file contents
        content_parts.append({'content': get_buffer(zf), 'name': name})
    
    # Once all valid contents have been collected, create `File` instances.
    for file_content in content_parts:
        rows, updated_content = get_image_list_by_content(file_content.get('name'), file_content.get("content"), files)
        update_content_with_images(updated_content)
    
    def update_content_with_images(new_content):
        # Append all found images
        for image in image_mode_list:
            new_content = new_content.replace(old_source_file, f'./oss/file/{image.id}')
    
        return new_content

7. If you need to maintain compatibility with certain older Python versions or libraries (like `urljoin` or `posixpath`), ensure that those functions are imported correctly and supported in your environment. In newer Python versions or environments, these functions may already be available without needing imports:

```python   
from urllib.parse import urljoin
import os
import posixpath   # Only needed if used directly instead of built-in urlparse.urljoin()

# Example usage:
joined = urljoin(base_dir, source_image_path)
if not posixpath.startswith(joined, '/'):
     joined = posixpath.normpath(os.path.join(base_dir, source_image_path)))

These points aim to enhance the robustness, clarity, efficiency, maintainability and compliance of your code while avoiding potential issues common in its current implementation.

new_file.save(file_bytes)

document_list = []
for doc in document:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code looks generally clear and functional, but there are a few areas where improvements can be made:

  1. Empty File Check: The line new_file.save(file_bytes) might not work correctly if file_bytes is empty or null because saving an empty file would result in incorrect data being written to the database.

  2. Duplicate Entry Prevention: Although checking if a record with the same ID already exists before saving is good practice to prevent duplicate entries, it should also handle scenarios where such duplicates need to be updated instead of created (depending on your application logic).

  3. Semicolon Usage: There seems to be a semicolon at the end of the first line inside the loop that's iterating over document. This semicolon has no effect on the functionality, so you could remove it to improve readability.

  4. Comments: It'd be helpful to add comments to explain why certain lines are included, especially those related to data validation or handling.

Here's a refined version of your code with these suggestions applied:

from django.db import models

class Document(models.Model):
    # Define fields here

def save_image(image_list):
    for img in image_list:
        metadata = img.get('metadata')  # Extracting metadata from each item
        new_file = NewFileType.objects.create(  # Assuming this class exists
            name=img['name'],  # Replace 'name' key with actual field name
            content_type=img['content_type'],
            source_id=metadata['application_id'] if metadata['application_id'] else metadata['knowledge_id'],
            meta=metadata
        )
        
        file_bytes = get_file_contents(img)  # Function to retrieve file bytes
        
        # Save only if the file doesn't exist in the database
        if not QuerySet(File).filter(id=new_file.id).exists():
            new_file.save(file_bytes)

def save_document(document_data):
    for doc_data in document_data:
        document_instance = Document(...)  # Initialize Document instance with appropriate arguments
        document_instance.save()  # Save the document instance
    

Make sure to adjust method calls, variable names, and assumptions according to your specific implementation details.

liuruibin pushed a commit that referenced this pull request Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants