Feat: Implement ID-Based Paragraph Synchronization for Efficient Loanword Detection#16
Feat: Implement ID-Based Paragraph Synchronization for Efficient Loanword Detection#16google-labs-jules[bot] wants to merge 2 commits intodevelopfrom
Conversation
Refactors the document change detection mechanism to use stable, unique IDs for each paragraph-level element. This resolves synchronization issues between the editor content and the loanword analysis results. The key changes include: - A two-pass system in `DocumentManager` that first ensures all paragraphs have a `data-unique` attribute before proceeding with analysis. - Replaced the inefficient `deep-diff` on plain text with a new ID-based diffing logic that compares maps of paragraph IDs to their text content. - The new logic correctly identifies added, modified, and deleted paragraphs. - Only added or modified paragraphs are sent to the AI service for re-analysis, significantly reducing expensive API calls. - Error states for deleted paragraphs are now properly cleaned up from the repository. - The document's state is now managed as a `Map<string, string>` for more efficient lookups.
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll acknowledge your comments with a 👀 emoji and then get to work. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! I will automatically address your feedback on specific files or sections of files. For any comments you don't want me to act on, just include PS: Responding to comments on the entire PR are not yet supported but coming soon. For security, I will only act on instructions from the user who triggered this task for this pull request. |
Corrects the implementation to use `Prefix.UNIQUE` when generating new `data-unique` attributes for paragraphs, as per user feedback. Removes the unnecessary `PARAGRAPH` prefix that was added. This aligns the code with the intended design for unique ID generation without changing the core ID-based diffing logic.
There was a problem hiding this comment.
Pull Request Overview
This PR implements a robust ID-based paragraph synchronization system to address loanword highlight detachment issues after document edits. The change replaces inefficient array-based diffing with a Map-based approach using unique paragraph IDs.
- Introduces persistent
data-uniqueattributes for paragraph-level elements to enable accurate tracking - Replaces array-based document diffing with efficient Map-based comparison using unique IDs
- Implements targeted processing that only analyzes actually modified paragraphs, reducing AI API calls
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/utils/ai/refineForeign.ts | Adds fallback ID generation for paragraphs missing data-unique attributes |
| src/utils/HtmlProcessor.ts | Introduces createParagraphMap method to build ID-to-content mappings from HTML elements |
| src/shared/stores/DocumentManager.ts | Replaces array-based diffing with ID-based synchronization and adds automatic ID assignment |
| src/shared/services/DocumentService.ts | Updates document storage from arrays to Maps and adds error cleanup for deleted paragraphs |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| const el = doc.body.firstChild as HTMLElement; | ||
| if (el && el.getAttribute) { | ||
| const id = el.getAttribute("data-unique"); |
There was a problem hiding this comment.
Type assertion to HTMLElement is unsafe. The firstChild could be a Text node or other Node type that doesn't have getAttribute method. Use proper type checking: if (el && el.nodeType === Node.ELEMENT_NODE && 'getAttribute' in el)
| const el = doc.body.firstChild as HTMLElement; | |
| if (el && el.getAttribute) { | |
| const id = el.getAttribute("data-unique"); | |
| const el = doc.body.firstChild; | |
| if (el && el.nodeType === Node.ELEMENT_NODE && 'getAttribute' in el) { | |
| const id = (el as HTMLElement).getAttribute("data-unique"); |
| const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`)); | ||
| if (elementHtml) modifiedElements.push(elementHtml); | ||
| } else if (oldText !== undefined && newText === undefined) { | ||
| // Deleted | ||
| DocumentManager.documentService.removeErrorByTargetId(id); | ||
| } else if (oldText !== newText) { | ||
| // Modified | ||
| const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`)); |
There was a problem hiding this comment.
Using string.includes() for HTML attribute matching is fragile and could match IDs that are substrings of other IDs (e.g., 'id-1' would match 'id-10'). Use a more robust approach like DOM parsing or regex with word boundaries.
| const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`)); | |
| if (elementHtml) modifiedElements.push(elementHtml); | |
| } else if (oldText !== undefined && newText === undefined) { | |
| // Deleted | |
| DocumentManager.documentService.removeErrorByTargetId(id); | |
| } else if (oldText !== newText) { | |
| // Modified | |
| const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`)); | |
| const elementHtml = editedDocumentHtml.find((el) => new RegExp(`\\bdata-unique="${id}"\\b`).test(el)); | |
| if (elementHtml) modifiedElements.push(elementHtml); | |
| } else if (oldText !== undefined && newText === undefined) { | |
| // Deleted | |
| DocumentManager.documentService.removeErrorByTargetId(id); | |
| } else if (oldText !== newText) { | |
| // Modified | |
| const elementHtml = editedDocumentHtml.find((el) => new RegExp(`\\bdata-unique="${id}"\\b`).test(el)); |
| const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`)); | ||
| if (elementHtml) modifiedElements.push(elementHtml); | ||
| } else if (oldText !== undefined && newText === undefined) { | ||
| // Deleted | ||
| DocumentManager.documentService.removeErrorByTargetId(id); | ||
| } else if (oldText !== newText) { | ||
| // Modified | ||
| const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`)); |
There was a problem hiding this comment.
Same fragile string matching issue as line 109. Using string.includes() for HTML attribute matching could match IDs that are substrings of other IDs.
| const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`)); | |
| if (elementHtml) modifiedElements.push(elementHtml); | |
| } else if (oldText !== undefined && newText === undefined) { | |
| // Deleted | |
| DocumentManager.documentService.removeErrorByTargetId(id); | |
| } else if (oldText !== newText) { | |
| // Modified | |
| const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`)); | |
| const elementHtml = editedDocumentHtml.find((el) => { | |
| const match = el.match(/data-unique="([^"]*)"/); | |
| return match && match[1] === id; | |
| }); | |
| if (elementHtml) modifiedElements.push(elementHtml); | |
| } else if (oldText !== undefined && newText === undefined) { | |
| // Deleted | |
| DocumentManager.documentService.removeErrorByTargetId(id); | |
| } else if (oldText !== newText) { | |
| // Modified | |
| const elementHtml = editedDocumentHtml.find((el) => { | |
| const match = el.match(/data-unique="([^"]*)"/); | |
| return match && match[1] === id; | |
| }); |
This change addresses a synchronization bug where loanword highlights would become detached from the document content after edits. The previous system re-analyzed large portions of the document inefficiently.
The new implementation introduces a robust, ID-based synchronization strategy:
<p>,<h1>, etc.) in the editor has a persistentdata-uniqueattribute.Mapof these unique IDs. This allows the system to pinpoint exactly which paragraphs were added, modified, or deleted.