Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion apps/common/handle/impl/text/text_split_handle.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,13 @@

class TextSplitHandle(BaseSplitHandle):
def support(self, file, get_buffer):
buffer = get_buffer(file)
file_name: str = file.name.lower()
if file_name.endswith(".md") or file_name.endswith('.txt') or file_name.endswith('.TXT') or file_name.endswith(
'.MD'):
return True
if file_name.index('.') > 0:
return False
buffer = get_buffer(file)
result = detect(buffer)
if result['encoding'] is not None and result['confidence'] is not None and result['encoding'] != 'ascii' and \
result['confidence'] > 0.5:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the provided code, there are several potential issues that need attention:

  1. Magic Number Usage: The line if file_name.endswith(".md") directly compares strings using .lower(), which doesn't prevent accidental modification of file extensions on case-insensitive filesystems.

  2. Unfiltered File Extension Check: Adding '.TXT' and 'MD' duplicates cases with existing checks (e.g., 'md'). This could lead to redundancy.

  3. Buffer Loading Logic Issues:

    • Potential Buffer Leak: The logic for loading the buffer (get_buffer(file)) should be placed inside a block or method scope where it's actually used.
    • Buffer Overwriting: After assigning the result of detect(buffer), you overwrite buffer. Depending on what detect(buffer) returns, this might not be intentional.
  4. Encoding Detection: There's no further processing done after checking encoding confidence. This implies that files without good encodings will pass through without further scrutiny.

  5. Return Type: A more detailed error message or handling would be beneficial when returning False.

  6. File Name Index Verification: Replacing only part of the filename string (file_name.index('.') > 0) can mask other issues related to different file formats or content.

Possible Improvements:

  • Simplify Encoding Checks:

    supported_encodings = set(['utf-8', 'latin1', 'unicode_escape'])  # Example supported encodings
    
    if result['encoding'] and result['encoding'].lower() in supported_encodings and result['confidence'] > 0.5:
        return True
  • Move Buffer Loading Inside Use Case:

    class TextSplitHandle(BaseSplitHandle):
        def support(self, file, get_buffer):
            buffer = get_buffer(file)
            file_name: str = file.name.lower()
    
            if file_name.endswith(('.md', '.txt')):
                return True
    
            if file_name.startswith('.'):
                # Skip hidden files
                return False
    
            detected_encoding = detect(buffer)
    
            if detected_encoding and detected_encoding.lower() in supported_encodings and detected_encoding['confidence'] > 0.5:
                return True
    
            raise ValueError("Unsupported file format or encoding")

These changes improve robustness and readability while addressing some common coding issues.

Expand Down
6 changes: 2 additions & 4 deletions apps/common/handle/impl/text/zip_split_handle.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,20 +216,18 @@ def get_content(self, file, save_image):
real_name = get_file_name(zf.name)
except Exception:
real_name = zf.name

# 为 split_handle 提供可重复读取的 file-like 对象
zf.name = real_name
get_buffer = FileBufferHandle().get_buffer
for split_handle in split_handles:
# 准备一个简单的 get_buffer 回调,返回当前 raw
get_buffer = FileBufferHandle().get_buffer
if split_handle.support(zf, get_buffer):
row = get_buffer(zf)
md_text = split_handle.get_content(io.BytesIO(row), save_image)
file_content_list.append({'content': md_text, 'name': real_name})
break
for file_content in file_content_list:
_image_list, content = get_image_list_by_content(file_content.get('name'), file_content.get("content"),
files)
files)
content_parts.append(content)
for image in _image_list:
image_list.append(image)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided code snippet looks well-structured but contains a few potential issues and areas for optimization:

Issues:

  1. Redundant split_handle.support(zf, get_buffer) Check: The call to if split_handle.support(zf, get_buffer): is redundant since it has already been checked in the previous line (for split_handle in split_handles:).
  2. Variable Shadowing: There is variable shadowing with get_buffer. This can cause confusion and bugs if not handled carefully.

Optimization Suggestions:

  1. Avoid Redundant Calls: Remove the unnecessary assignment of zf.name = real_name before entering the loop.
  2. Use Proper Variable Names: Refactor variable names like _file_list, _content_parts, and _image_list to be more descriptive and avoid shadows.

Here's an optimized version of the code with these improvements:

def get_content(self, file, save_image):
    file_content_list = []
    
    # Prepare a simple get_buffer callback, returning current raw
    get_buffer = FileBufferHandle().get_buffer
    
    for zf in zip_files:
        try:
            real_name = get_file_name(zf.name)
        except Exception:
            real_name = zf.name

        for split_handle in split_handles:
            if split_handle.support(zf, get_buffer):
                row = get_buffer(zf)
                md_text = split_handle.get_content(io.BytesIO(row), save_image)
                file_content_list.append({'content': md_text, 'name': real_name})
                break
    
    content_parts = []
    image_list = []
    
    for file_content in file_content_list:
        _image_list, content = get_image_list_by_content(file_content['name'], file_content.get("content"), files)
        content_parts.append(content)
        for image in _image_list:
            image_list.append(image)

This version ensures clarity by avoiding redundancy and improving readability through proper formatting and naming conventions.

Expand Down
Loading