Feat: Implement ID-Based Paragraph Synchronization for Efficient Loanword Detection by google-labs-jules[bot] · Pull Request #16 · Team-INSERT/NARATMALSAMI-WEB

google-labs-jules · 2025-09-02T04:51:58Z

This change addresses a synchronization bug where loanword highlights would become detached from the document content after edits. The previous system re-analyzed large portions of the document inefficiently.

The new implementation introduces a robust, ID-based synchronization strategy:

Stable Unique IDs: A mechanism has been added to ensure every paragraph-level element (<p>, <h1>, etc.) in the editor has a persistent data-unique attribute.
Efficient Diffing: The change detection logic has been rewritten to compare the 'before' and 'after' states of the document using a Map of these unique IDs. This allows the system to pinpoint exactly which paragraphs were added, modified, or deleted.
Minimized API Calls: Only the content of paragraphs that were actually added or modified is now sent to the loanword detection AI service. This dramatically reduces the number of expensive API calls, improving performance and lowering cost.
Correct State Management: The system now correctly handles the deletion of paragraphs by removing any associated error data from the application's state.

Refactors the document change detection mechanism to use stable, unique IDs for each paragraph-level element. This resolves synchronization issues between the editor content and the loanword analysis results. The key changes include: - A two-pass system in `DocumentManager` that first ensures all paragraphs have a `data-unique` attribute before proceeding with analysis. - Replaced the inefficient `deep-diff` on plain text with a new ID-based diffing logic that compares maps of paragraph IDs to their text content. - The new logic correctly identifies added, modified, and deleted paragraphs. - Only added or modified paragraphs are sent to the AI service for re-analysis, significantly reducing expensive API calls. - Error states for deleted paragraphs are now properly cleaned up from the repository. - The document's state is now managed as a `Map<string, string>` for more efficient lookups.

google-labs-jules · 2025-09-02T04:51:58Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll acknowledge your comments with a 👀 emoji and then get to work. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

I will automatically address your feedback on specific files or sections of files. For any comments you don't want me to act on, just include `(aside)`.

PS: Responding to comments on the entire PR are not yet supported but coming soon.

For security, I will only act on instructions from the user who triggered this task for this pull request.

Corrects the implementation to use `Prefix.UNIQUE` when generating new `data-unique` attributes for paragraphs, as per user feedback. Removes the unnecessary `PARAGRAPH` prefix that was added. This aligns the code with the intended design for unique ID generation without changing the core ID-based diffing logic.

Copilot

Pull Request Overview

This PR implements a robust ID-based paragraph synchronization system to address loanword highlight detachment issues after document edits. The change replaces inefficient array-based diffing with a Map-based approach using unique paragraph IDs.

Introduces persistent data-unique attributes for paragraph-level elements to enable accurate tracking
Replaces array-based document diffing with efficient Map-based comparison using unique IDs
Implements targeted processing that only analyzes actually modified paragraphs, reducing AI API calls

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
src/utils/ai/refineForeign.ts	Adds fallback ID generation for paragraphs missing `data-unique` attributes
src/utils/HtmlProcessor.ts	Introduces `createParagraphMap` method to build ID-to-content mappings from HTML elements
src/shared/stores/DocumentManager.ts	Replaces array-based diffing with ID-based synchronization and adds automatic ID assignment
src/shared/services/DocumentService.ts	Updates document storage from arrays to Maps and adds error cleanup for deleted paragraphs

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-02T06:18:38Z

src/utils/HtmlProcessor.ts

+      const el = doc.body.firstChild as HTMLElement;
+      if (el && el.getAttribute) {
+        const id = el.getAttribute("data-unique");


Type assertion to HTMLElement is unsafe. The firstChild could be a Text node or other Node type that doesn't have getAttribute method. Use proper type checking: if (el && el.nodeType === Node.ELEMENT_NODE && 'getAttribute' in el)

Suggested change

const el = doc.body.firstChild as HTMLElement;

if (el && el.getAttribute) {

const id = el.getAttribute("data-unique");

const el = doc.body.firstChild;

if (el && el.nodeType === Node.ELEMENT_NODE && 'getAttribute' in el) {

const id = (el as HTMLElement).getAttribute("data-unique");

Copilot · 2025-09-02T06:18:39Z

src/shared/stores/DocumentManager.ts

+        const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`));
+        if (elementHtml) modifiedElements.push(elementHtml);
+      } else if (oldText !== undefined && newText === undefined) {
+        // Deleted
+        DocumentManager.documentService.removeErrorByTargetId(id);
+      } else if (oldText !== newText) {
+        // Modified
+        const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`));


Using string.includes() for HTML attribute matching is fragile and could match IDs that are substrings of other IDs (e.g., 'id-1' would match 'id-10'). Use a more robust approach like DOM parsing or regex with word boundaries.

Suggested change

const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`));

if (elementHtml) modifiedElements.push(elementHtml);

} else if (oldText !== undefined && newText === undefined) {

// Deleted

DocumentManager.documentService.removeErrorByTargetId(id);

} else if (oldText !== newText) {

// Modified

const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`));

const elementHtml = editedDocumentHtml.find((el) => new RegExp(`\\bdata-unique="${id}"\\b`).test(el));

if (elementHtml) modifiedElements.push(elementHtml);

} else if (oldText !== undefined && newText === undefined) {

// Deleted

DocumentManager.documentService.removeErrorByTargetId(id);

} else if (oldText !== newText) {

// Modified

const elementHtml = editedDocumentHtml.find((el) => new RegExp(`\\bdata-unique="${id}"\\b`).test(el));

Copilot · 2025-09-02T06:18:39Z

src/shared/stores/DocumentManager.ts

+        const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`));
+        if (elementHtml) modifiedElements.push(elementHtml);
+      } else if (oldText !== undefined && newText === undefined) {
+        // Deleted
+        DocumentManager.documentService.removeErrorByTargetId(id);
+      } else if (oldText !== newText) {
+        // Modified
+        const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`));


Same fragile string matching issue as line 109. Using string.includes() for HTML attribute matching could match IDs that are substrings of other IDs.

Suggested change

const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`));

if (elementHtml) modifiedElements.push(elementHtml);

} else if (oldText !== undefined && newText === undefined) {

// Deleted

DocumentManager.documentService.removeErrorByTargetId(id);

} else if (oldText !== newText) {

// Modified

const elementHtml = editedDocumentHtml.find((el) => el.includes(`data-unique="${id}"`));

const elementHtml = editedDocumentHtml.find((el) => {

const match = el.match(/data-unique="([^"]*)"/);

return match && match[1] === id;

});

if (elementHtml) modifiedElements.push(elementHtml);

} else if (oldText !== undefined && newText === undefined) {

// Deleted

DocumentManager.documentService.removeErrorByTargetId(id);

} else if (oldText !== newText) {

// Modified

const elementHtml = editedDocumentHtml.find((el) => {

const match = el.match(/data-unique="([^"]*)"/);

return match && match[1] === id;

});

obtuse-triangle requested a review from Copilot September 2, 2025 06:17

Copilot AI reviewed Sep 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Implement ID-Based Paragraph Synchronization for Efficient Loanword Detection#16

Feat: Implement ID-Based Paragraph Synchronization for Efficient Loanword Detection#16
google-labs-jules[bot] wants to merge 2 commits intodevelopfrom
feature/id-based-sync

google-labs-jules bot commented Sep 2, 2025

Uh oh!

google-labs-jules bot commented Sep 2, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 2, 2025

Uh oh!

Copilot AI Sep 2, 2025

Uh oh!

Copilot AI Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

google-labs-jules bot commented Sep 2, 2025

Uh oh!

google-labs-jules bot commented Sep 2, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants