-
Notifications
You must be signed in to change notification settings - Fork 2.6k
fix: The image uploaded from the workflow knowledge base zip file cannot be parsed #4503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
|
||
| return '\n\n'.join(content_parts) | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
The
get_image_list_by_contentfunction handles both relative and absolute paths in images within Markdown content, but it only constructs the path fromoss/file/, assuming that all referenced images are stored in a specific subdirectory. This might lead to incorrect behavior if images are distributed across directories.Optimization Suggestion: Consider using an argument or configuration option that allows specifying the exact subdirectory where these images should be looked for (
oss/file/). Additionally, ensure that this directory exists before attempting to retrieve files from it. -
There's a small inconsistency between checking whether a file starts with
'oss/file/'or'oss/image/'. Although they are likely synonymous based on context, explicit differentiation could improve readability:# Original code if image_path.startswith('oss/files/') or image_path.startswith('oss/images/'): # Revised code if image_path.startswith(('oss,file/', 'oss,image/')):
-
Some variable names like
_image_listandcontent_partscan be confusing without clear contexts or explanations of their purpose. It would help clarify them by prefixing them with meaningful prefixes such as "result_" or "processed_":result_image_list = [] processed_content_parts = []
-
The use of hardcoded UUID strings ("./oss/file/" + new_image_id) in place of actual UUIDs (e.g.,
"./oss/file/{}/{}") when logging details about replacements could potentially lead to misleading information:# Incorrectly hard-coded log message print(f"Replaced {source_image_path} with ./oss/file/{new_image_id}") # Correctly formatted log message print(f"Replaced {source_image_path} with ./oss/file/{image_id}")
-
While your handling of ZIP archives is comprehensive, there’s room for improvement regarding error management. For instance, exceptions related to decoding text data or invalid file types might need more attention:
try: enc = detect(raw).get('encoding') or 'utf-8' md_text = raw.decode(enc, errors='ignore') except Exception as e: print(f"Failed to decode text: {e}. Content was ignored.") md_text = raw.decode('utf-8', errors='ignore')
-
You are creating
Fileinstances in two separate operations – once directly after extracting each file and another time when you collect all relevant images into one final list. Consolidating these steps into one pass could simplify the logic and make future maintenance easier:for name, zf in zip(files, zip_ref): if name.endswith('/') or name.startswith('__MACOSX'): continue # Collect relevant file contents content_parts.append({'content': get_buffer(zf), 'name': name}) # Once all valid contents have been collected, create `File` instances. for file_content in content_parts: rows, updated_content = get_image_list_by_content(file_content.get('name'), file_content.get("content"), files) update_content_with_images(updated_content) def update_content_with_images(new_content): # Append all found images for image in image_mode_list: new_content = new_content.replace(old_source_file, f'./oss/file/{image.id}') return new_content
7. If you need to maintain compatibility with certain older Python versions or libraries (like `urljoin` or `posixpath`), ensure that those functions are imported correctly and supported in your environment. In newer Python versions or environments, these functions may already be available without needing imports:
```python
from urllib.parse import urljoin
import os
import posixpath # Only needed if used directly instead of built-in urlparse.urljoin()
# Example usage:
joined = urljoin(base_dir, source_image_path)
if not posixpath.startswith(joined, '/'):
joined = posixpath.normpath(os.path.join(base_dir, source_image_path)))
These points aim to enhance the robustness, clarity, efficiency, maintainability and compliance of your code while avoiding potential issues common in its current implementation.
| new_file.save(file_bytes) | ||
|
|
||
| document_list = [] | ||
| for doc in document: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code looks generally clear and functional, but there are a few areas where improvements can be made:
-
Empty File Check: The line
new_file.save(file_bytes)might not work correctly iffile_bytesis empty or null because saving an empty file would result in incorrect data being written to the database. -
Duplicate Entry Prevention: Although checking if a record with the same ID already exists before saving is good practice to prevent duplicate entries, it should also handle scenarios where such duplicates need to be updated instead of created (depending on your application logic).
-
Semicolon Usage: There seems to be a semicolon at the end of the first line inside the loop that's iterating over
document. This semicolon has no effect on the functionality, so you could remove it to improve readability. -
Comments: It'd be helpful to add comments to explain why certain lines are included, especially those related to data validation or handling.
Here's a refined version of your code with these suggestions applied:
from django.db import models
class Document(models.Model):
# Define fields here
def save_image(image_list):
for img in image_list:
metadata = img.get('metadata') # Extracting metadata from each item
new_file = NewFileType.objects.create( # Assuming this class exists
name=img['name'], # Replace 'name' key with actual field name
content_type=img['content_type'],
source_id=metadata['application_id'] if metadata['application_id'] else metadata['knowledge_id'],
meta=metadata
)
file_bytes = get_file_contents(img) # Function to retrieve file bytes
# Save only if the file doesn't exist in the database
if not QuerySet(File).filter(id=new_file.id).exists():
new_file.save(file_bytes)
def save_document(document_data):
for doc_data in document_data:
document_instance = Document(...) # Initialize Document instance with appropriate arguments
document_instance.save() # Save the document instance
Make sure to adjust method calls, variable names, and assumptions according to your specific implementation details.
fix: The image uploaded from the workflow knowledge base zip file cannot be parsed