Skip to content

Feat: Implement ID-Based Paragraph Synchronization for Efficient Loanword Detection#16

Draft
google-labs-jules[bot] wants to merge 2 commits intodevelopfrom
feature/id-based-sync
Draft

Feat: Implement ID-Based Paragraph Synchronization for Efficient Loanword Detection#16
google-labs-jules[bot] wants to merge 2 commits intodevelopfrom
feature/id-based-sync

Conversation

@google-labs-jules
Copy link

This change addresses a synchronization bug where loanword highlights would become detached from the document content after edits. The previous system re-analyzed large portions of the document inefficiently.

The new implementation introduces a robust, ID-based synchronization strategy:

  1. Stable Unique IDs: A mechanism has been added to ensure every paragraph-level element (<p>, <h1>, etc.) in the editor has a persistent data-unique attribute.
  2. Efficient Diffing: The change detection logic has been rewritten to compare the 'before' and 'after' states of the document using a Map of these unique IDs. This allows the system to pinpoint exactly which paragraphs were added, modified, or deleted.
  3. Minimized API Calls: Only the content of paragraphs that were actually added or modified is now sent to the loanword detection AI service. This dramatically reduces the number of expensive API calls, improving performance and lowering cost.
  4. Correct State Management: The system now correctly handles the deletion of paragraphs by removing any associated error data from the application's state.

Refactors the document change detection mechanism to use stable, unique IDs for each paragraph-level element. This resolves synchronization issues between the editor content and the loanword analysis results.

The key changes include:
- A two-pass system in `DocumentManager` that first ensures all paragraphs have a `data-unique` attribute before proceeding with analysis.
- Replaced the inefficient `deep-diff` on plain text with a new ID-based diffing logic that compares maps of paragraph IDs to their text content.
- The new logic correctly identifies added, modified, and deleted paragraphs.
- Only added or modified paragraphs are sent to the AI service for re-analysis, significantly reducing expensive API calls.
- Error states for deleted paragraphs are now properly cleaned up from the repository.
- The document's state is now managed as a `Map<string, string>` for more efficient lookups.
@google-labs-jules
Copy link
Author

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll acknowledge your comments with a 👀 emoji and then get to work. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

I will automatically address your feedback on specific files or sections of files. For any comments you don't want me to act on, just include `(aside)`.

PS: Responding to comments on the entire PR are not yet supported but coming soon.


For security, I will only act on instructions from the user who triggered this task for this pull request.

Corrects the implementation to use `Prefix.UNIQUE` when generating new `data-unique` attributes for paragraphs, as per user feedback. Removes the unnecessary `PARAGRAPH` prefix that was added.

This aligns the code with the intended design for unique ID generation without changing the core ID-based diffing logic.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a robust ID-based paragraph synchronization system to address loanword highlight detachment issues after document edits. The change replaces inefficient array-based diffing with a Map-based approach using unique paragraph IDs.

  • Introduces persistent data-unique attributes for paragraph-level elements to enable accurate tracking
  • Replaces array-based document diffing with efficient Map-based comparison using unique IDs
  • Implements targeted processing that only analyzes actually modified paragraphs, reducing AI API calls

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/utils/ai/refineForeign.ts Adds fallback ID generation for paragraphs missing data-unique attributes
src/utils/HtmlProcessor.ts Introduces createParagraphMap method to build ID-to-content mappings from HTML elements
src/shared/stores/DocumentManager.ts Replaces array-based diffing with ID-based synchronization and adds automatic ID assignment
src/shared/services/DocumentService.ts Updates document storage from arrays to Maps and adds error cleanup for deleted paragraphs

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +108 to +110
const el = doc.body.firstChild as HTMLElement;
if (el && el.getAttribute) {
const id = el.getAttribute("data-unique");
Copy link

Copilot AI Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type assertion to HTMLElement is unsafe. The firstChild could be a Text node or other Node type that doesn't have getAttribute method. Use proper type checking: if (el && el.nodeType === Node.ELEMENT_NODE && 'getAttribute' in el)

Suggested change
const el = doc.body.firstChild as HTMLElement;
if (el && el.getAttribute) {
const id = el.getAttribute("data-unique");
const el = doc.body.firstChild;
if (el && el.nodeType === Node.ELEMENT_NODE && 'getAttribute' in el) {
const id = (el as HTMLElement).getAttribute("data-unique");

Copilot uses AI. Check for mistakes.
Comment on lines +109 to +116
const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`));
if (elementHtml) modifiedElements.push(elementHtml);
} else if (oldText !== undefined && newText === undefined) {
// Deleted
DocumentManager.documentService.removeErrorByTargetId(id);
} else if (oldText !== newText) {
// Modified
const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`));
Copy link

Copilot AI Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using string.includes() for HTML attribute matching is fragile and could match IDs that are substrings of other IDs (e.g., 'id-1' would match 'id-10'). Use a more robust approach like DOM parsing or regex with word boundaries.

Suggested change
const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`));
if (elementHtml) modifiedElements.push(elementHtml);
} else if (oldText !== undefined && newText === undefined) {
// Deleted
DocumentManager.documentService.removeErrorByTargetId(id);
} else if (oldText !== newText) {
// Modified
const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`));
const elementHtml = editedDocumentHtml.find((el) => new RegExp(`\\bdata-unique="${id}"\\b`).test(el));
if (elementHtml) modifiedElements.push(elementHtml);
} else if (oldText !== undefined && newText === undefined) {
// Deleted
DocumentManager.documentService.removeErrorByTargetId(id);
} else if (oldText !== newText) {
// Modified
const elementHtml = editedDocumentHtml.find((el) => new RegExp(`\\bdata-unique="${id}"\\b`).test(el));

Copilot uses AI. Check for mistakes.
Comment on lines +109 to +116
const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`));
if (elementHtml) modifiedElements.push(elementHtml);
} else if (oldText !== undefined && newText === undefined) {
// Deleted
DocumentManager.documentService.removeErrorByTargetId(id);
} else if (oldText !== newText) {
// Modified
const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`));
Copy link

Copilot AI Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same fragile string matching issue as line 109. Using string.includes() for HTML attribute matching could match IDs that are substrings of other IDs.

Suggested change
const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`));
if (elementHtml) modifiedElements.push(elementHtml);
} else if (oldText !== undefined && newText === undefined) {
// Deleted
DocumentManager.documentService.removeErrorByTargetId(id);
} else if (oldText !== newText) {
// Modified
const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`));
const elementHtml = editedDocumentHtml.find((el) => {
const match = el.match(/data-unique="([^"]*)"/);
return match && match[1] === id;
});
if (elementHtml) modifiedElements.push(elementHtml);
} else if (oldText !== undefined && newText === undefined) {
// Deleted
DocumentManager.documentService.removeErrorByTargetId(id);
} else if (oldText !== newText) {
// Modified
const elementHtml = editedDocumentHtml.find((el) => {
const match = el.match(/data-unique="([^"]*)"/);
return match && match[1] === id;
});

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants