Skip to content

Conversation

@shaohuzhang1
Copy link
Contributor

fix: Part of the docx document is parsed incorrectly

@f2c-ci-robot
Copy link

f2c-ci-robot bot commented Jan 6, 2025

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@f2c-ci-robot
Copy link

f2c-ci-robot bot commented Jan 6, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

except BaseException as e:
traceback.print_exception(e)
return f'{e}'
return f'{e}'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few suggestions for optimizing and fixing this code:

  1. Remove Redundant Characters: The code currently uses replace to remove different versions of "标题". Consider normalizing these to ensure consistency.

  2. Refactor Conditional Logic: Use separate conditions instead of multiple chained if statements to improve readability.

  3. Simplify Error Handling: Simplify error handling by using a common error message format.

Here’s an optimized version of the code with these considerations:

class DocSplitHandle(BaseSplitHandle):
    def paragraph_to_md(self, paragraph: Paragraph, doc: Document, images_list, get_image_id):
        try:
            psn = paragraph.style.name
            if psn.startswith(('Heading', ' TOC 标题', '标题')):
                levels = sum(1 for c in psn[psn.index(' ') + 1:].split()) + 1
                title = self._build_heading(levels, paragraph.text)
                images = sum(get_paragraph_element_images(e, doc, images_list, get_image_id) for e in paragraph._element)
            else:
                title = paragraph.text
                images = []
            
            return f"#{title}\n\n{images}"

        except BaseException as e:
            traceback.print_exception(e)
            return f'Error processing {e}'

    def _build_heading(self, level, text):
        # Build heading string based on level
        return '#' * level + ' ' + text

    def get_content(self, file_path, save_image):
        try:
            document_manager = load_document(file_path)
            content = ''
            for section in document_manager.sections:
                content += self.paragraph_to_md(section.headings[0], document_manager.document, [], lambda x, y: [])
                
                if not all(img['path'] is None for img in section.images): 
                    content += '\n![]('  # Start image reference
            
    ... (rest of the get_content method remains unchanged)

Changes Made:

  • Normalized condition checking for titles by splitting the logic into _build_heading.
  • Created helper functions for clarity.
  • Unified error handling messages.

@shaohuzhang1 shaohuzhang1 merged commit d9df013 into main Jan 6, 2025
4 of 5 checks passed
@shaohuzhang1 shaohuzhang1 deleted the pr@main@fix_docx_document branch January 6, 2025 06:37
shaohuzhang1 added a commit that referenced this pull request Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants