Skip to content

feat(address-service): DOMA-12746 implement heuristics-based address deduplication system#7196

Merged
AleX83Xpert merged 53 commits intomainfrom
feat/address-service/doma-12746/improve-internal-keys-to-merge-data-from-dadata-and-pullenti
Mar 10, 2026
Merged

feat(address-service): DOMA-12746 implement heuristics-based address deduplication system#7196
AleX83Xpert merged 53 commits intomainfrom
feat/address-service/doma-12746/improve-internal-keys-to-merge-data-from-dadata-and-pullenti

Conversation

@AleX83Xpert
Copy link
Member

@AleX83Xpert AleX83Xpert commented Feb 12, 2026

Address Heuristics: Provider-Agnostic Deduplication System

Problem

The address-service previously relied on Address.key for deduplication. Since different providers (Dadata, Google, Pullenti) generate different keys for the same physical building, duplicate Address records accumulate over time. There was no mechanism to detect or resolve these cross-provider duplicates.

Solution

Introduce AddressHeuristic — a model that stores structured identifiers extracted from each provider (FIAS ID, coordinates, Google Place ID, fallback key). When a new address is resolved, the system checks existing heuristics before creating a new Address, enabling cross-provider matching.

How It Works

  1. On address search: the provider extracts heuristics from its response (e.g., Dadata yields fias_id + coordinates + fallback)
  2. Heuristic matching: findAddressByHeuristics() checks each heuristic against the DB, sorted by reliability score — first match wins
  3. If match found: the existing Address is returned, new heuristics are upserted for future matching
  4. If no match: a new Address is created, heuristics are stored, and possibleDuplicateOf may be set if a near-match is detected
  5. Admin resolution: admins can merge or dismiss flagged duplicates via the admin UI

Where to Start Reviewing

Start with the data model, then follow the flow:

  1. domains/address/schema/AddressHeuristic.js — the new model (type, value, reliability, provider, lat/lon for coordinate range queries)
  2. domains/address/schema/Address.js — the new possibleDuplicateOf relationship field
  3. domains/common/utils/services/search/heuristicMatcher.js — core matching logic: findAddressByHeuristics(), upsertHeuristics(), coordinate range queries
  4. domains/common/utils/services/search/searchServiceUtils.js — integration point where heuristics enter the search flow
  5. Provider extractHeuristics() methods — see how each provider extracts its identifiers:
    • DadataSearchProvider.js → fias_id, coordinates, fallback
    • GoogleSearchProvider.js → google_place_id, coordinates, fallback
    • PullentiSearchProvider.js → fias_id, fallback
    • InjectionsSeeker.js → fallback only

Then review the resolution/merge side:

  1. domains/address/utils/mergeAddresses.js — shared utility for merging two addresses (moves sources + heuristics, soft-deletes loser)
  2. domains/address/schema/ResolveAddressDuplicateService.js — GraphQL mutation for admin merge/dismiss
  3. domains/address/schema/ActualizeAddressesService.js — updated to use shared merge utility (now handles heuristics during key collisions)
  4. admin-ui/index.js — the UI button for resolving duplicates

Migration scripts (all support --dry-run):

  1. bin/migrate-address-keys-to-heuristics.js — updates Address.key format (fias: → fias_id:, bare → fallback:)
  2. bin/create-address-heuristics.js — backfills AddressHeuristic records from existing Address meta
  3. bin/local/merge-duplicate-addresses.js — bulk auto-merge clear duplicate cases

What to Pay Attention To

  • Coordinate matching tolerance: COORDINATE_TOLERANCE = 0.00001 (~1.1m at equator). Uses latitude/longitude Decimal fields with DB range queries — not in-memory scanning
  • Unique constraint on AddressHeuristic (type, value) where deletedAt IS NULL — prevents duplicate heuristic records
  • Chain prevention: possibleDuplicateOf always points to the root address via findRootAddress()
  • Migration order matters: run schema migration → Script A → Script B → Script C
  • Address.key is still generated but now uses the best heuristic (highest reliability) as the key source. Fallback key remains for providers without strong identifiers
  • Merge logic is shared across three consumers (ResolveAddressDuplicateService, ActualizeAddressesService, merge script) via mergeAddresses() — changes to merge behavior only need to happen in one place

Summary by CodeRabbit

  • New Features

    • Admin UI: review and resolve flagged duplicate addresses (merge or dismiss) with confirmation and navigation.
    • Heuristics-based deduplication: provider and fallback heuristics, automatic heuristic detection/upsert, and safe merge helpers.
    • New CLI tools to build/migrate heuristics and assist bulk merges (dry-run supported).
  • Documentation

    • Detailed migration guide for moving to heuristics-based deduplication and running scripts.
  • Tests

    • Expanded coverage for heuristic matching, resolve flows, search and suggestion integrations.

@AleX83Xpert AleX83Xpert added 🔬 WIP Not intended to be merged right now, it is a work in progress 🚨 Migrations We have a database migrations here! 🐘 BIG No so easy to review changes up to 1300 lines of code labels Feb 12, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds heuristic-based address deduplication: new AddressHeuristic model and DB migration, heuristics extraction/upsert/matching, provider-based addressKey generation, merge/dismiss resolution service and admin UI, merge utility, migration scripts, and extensive tests and tooling.

Changes

Cohort / File(s) Summary
Admin UI
apps/address-service/admin-ui/index.js
Added ResolveAddressDuplicate component, GraphQL queries/mutation, confirmation/error handling, and changed itemHeaderActions layout to vertical Space including the new action.
GraphQL types & schema index
apps/address-service/domains/address/gql.js, apps/address-service/domains/address/schema/index.js
Added AddressHeuristic GQL entity; included Address.possibleDuplicateOf in address fields; exported AddressHeuristic and ResolveAddressDuplicateService.
Keystone models & migration
apps/address-service/domains/address/schema/Address.js, apps/address-service/domains/address/schema/AddressHeuristic.js, apps/address-service/migrations/20260212163711-0008_addressheuristichistoryrecord_and_more.js
Added possibleDuplicateOf relationship; new AddressHeuristic list with uniqueness constraint and history table; migration creates tables, indexes, and updates analytics view.
Access control & resolve service
apps/address-service/domains/address/access/AddressHeuristic.js, apps/address-service/domains/address/access/ResolveAddressDuplicateService.js, apps/address-service/domains/address/schema/ResolveAddressDuplicateService.js
Added canReadAddressHeuristics, canManageAddressHeuristics, canResolveAddressDuplicate; implemented resolveAddressDuplicate mutation supporting merge and dismiss with validations.
Merge utility
apps/address-service/domains/address/utils/mergeAddresses.js
New mergeAddresses(context, winnerId, loserId, dvSender) that moves AddressSource and AddressHeuristic records, soft-deletes loser, clears possibleDuplicateOf, and logs counts.
Heuristic engine & tests
apps/address-service/domains/common/utils/services/search/heuristicMatcher.js, .../heuristicMatcher.spec.js
New heuristic matcher: coordinate parsing/comparison, findAddressByHeuristics, findRootAddress, upsertHeuristics with conflict/race handling; comprehensive tests added.
Provider & suggestion changes
apps/address-service/domains/common/constants/heuristicTypes.js, .../providers/*.js, .../suggest/providers/*.js, .../AbstractSearchProvider.js, .../AbstractSuggestionProvider.js
Introduced heuristic type constants and PROVIDERS; providers now implement extractHeuristics and generateAddressKey; fallback key logic moved into providers.
Search integration & plugins
apps/address-service/domains/common/utils/services/search/searchServiceUtils.js, .../plugins/*.js, .../searchServiceUtils.spec.js
createOrUpdateAddressWithSource signature extended to accept heuristics; added heuristic-first lookup (findAddressByHeuristics) and call to upsertHeuristics; plugin calls updated to pass heuristics and use provider methods.
Removed/relocated utils
apps/address-service/domains/common/utils/addressKeyUtils.js, .../addressKeyUtils.spec.js, apps/address-service/domains/common/utils/services/InjectionsSeeker.js
Removed standalone addressKeyUtils; moved fallback key and heuristic extraction into InjectionsSeeker/providers and adjusted tests.
Actualization flow
apps/address-service/domains/address/schema/ActualizeAddressesService.js, .../ActualizeAddressesService.spec.js
Switched to provider.generateAddressKey/extractHeuristics, integrated upsertHeuristics and mergeAddresses in actualize flow; tests updated.
Scripts & local tooling
apps/address-service/bin/*.js, apps/address-service/bin/local/*, apps/address-service/docs/MIGRATION-heuristics.md
Added migration and helper scripts (migrate-address-keys-to-heuristics, create-address-heuristics, regenerate-keys, local merge-duplicate-addresses), docs and .env example; dry-run support and logging.
Server utils & exports
apps/address-service/domains/address/utils/serverSchema/index.js, apps/address-service/domains/address/gql.js
Exported AddressHeuristic server utils and exposed new GQL queries/entities.
Tests & suites
multiple *.spec.js under search, suggest, schema directories
Extensive tests added/updated for heuristic matching, provider heuristics, resolveAddressDuplicate, search/suggest integration, and utilities.
Package & minor test refactors
apps/address-service/package.json, apps/condo/.../*.test.js
Added workspace deps @open-conda/apollo-server-client, @open-condo/ui; minor test setup refactors in condo domain tests.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant AdminUI as Admin UI
    participant GraphQL as Address Service\nGraphQL API
    participant DB as Database

    User->>AdminUI: Open address detail
    AdminUI->>GraphQL: query Address with possibleDuplicateOf
    GraphQL->>DB: fetch Address + possibleDuplicateOf
    DB-->>GraphQL: return data
    GraphQL-->>AdminUI: response

    alt Duplicate Exists
        AdminUI->>User: show Resolve Duplicate button
        User->>AdminUI: choose Merge/Dismiss
        AdminUI->>AdminUI: confirmation dialog
        User->>AdminUI: confirm
        AdminUI->>GraphQL: resolveAddressDuplicate(action,winnerId)

        alt Merge
            GraphQL->>DB: move AddressSource records
            GraphQL->>DB: move AddressHeuristic records
            GraphQL->>DB: soft-delete loser
            GraphQL->>DB: clear possibleDuplicateOf
            DB-->>GraphQL: success
            GraphQL-->>AdminUI: merged
            AdminUI->>AdminUI: redirect to winner
        else Dismiss
            GraphQL->>DB: clear possibleDuplicateOf
            DB-->>GraphQL: success
            GraphQL-->>AdminUI: dismissed
            AdminUI->>AdminUI: reload
        end

        AdminUI-->>User: show success
    else No Duplicate
        AdminUI-->>User: no duplicate
    end
Loading
sequenceDiagram
    participant Search as Search Plugin
    participant Provider as Search Provider
    participant Matcher as Heuristic Matcher
    participant DB as Database

    Search->>Provider: normalize(searchResult)
    Provider-->>Search: normalizedBuilding
    Search->>Provider: extractHeuristics(normalizedBuilding)
    Provider-->>Search: heuristics
    Search->>Provider: generateAddressKey(normalizedBuilding)
    Provider-->>Search: addressKey
    Search->>Matcher: findAddressByHeuristics(heuristics)
    Matcher->>DB: query AddressHeuristic
    alt Match Found
        DB-->>Matcher: matching heuristic -> addressId
        Matcher-->>Search: addressId
    else No Match
        Matcher-->>Search: null
        Search->>DB: query by addressKey
    end
    Search->>Matcher: upsertHeuristics(addressId, heuristics)
    Matcher->>DB: insert/update heuristics, detect conflicts
    alt Conflict
        Matcher->>DB: find root address
        Matcher->>DB: set possibleDuplicateOf for conflicting address
    end
    DB-->>Matcher: success
    Matcher-->>Search: done
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐰 I hopped through keys and heuristics bright,
Found duplicates hidden out of sight,
I stamped a root and stitched the seams,
Merged addresses, fulfilled the dreams,
A tiny rabbit, tidying data tonight. 🌱✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 34.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: implementing a heuristics-based address deduplication system for the address service, with a reference to the ticket (DOMA-12746).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/address-service/doma-12746/improve-internal-keys-to-merge-data-from-dadata-and-pullenti

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a6b8dc49b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Fix all issues with AI agents
In `@apps/address-service/admin-ui/index.js`:
- Around line 113-125: Convert the click handler to an async function and
replace the resolveDuplicate().then()/catch() chain with an await inside a
try/catch: call await resolveDuplicate({ variables: { data: mutationData } })
and destructure the result to access result.result.status; on success perform
the same redirect logic using action, addressId and possibleDuplicate.id
(window.location.href or window.location.reload), and in the catch block log the
error with console.error('Failed to resolve duplicate', error) and show
alert(error.message); ensure the handler signature (onClick) is updated to be
async so await can be used.

In `@apps/address-service/bin/merge-duplicate-addresses.js`:
- Around line 85-88: The comment says we should pick the winner by checking
which Address.id is referenced in the condo database but the code always does
const winnerId = targetId; update the merge logic in
merge-duplicate-addresses.js: replace the unconditional winnerId = targetId
assignment with a reference-check against the condo DB for targetId and
address.id (use your existing condo access code or add a small query that counts
references), set winnerId to the id that has references, set loserId to the
other id, and if both or neither are referenced treat the case as ambiguous and
skip merging (log and continue). Ensure you only change the block around
winnerId/loserId assignment and preserve existing variables targetId and
address.id.

In `@apps/address-service/docs/MIGRATION-heuristics.md`:
- Around line 34-37: Update the docs to remove the misleading "makemigrations"
step and instruct users to only run the migration that is already committed;
specifically edit MIGRATION-heuristics.md to drop the `yarn workspace
`@app/address-service` run makemigrations` line and leave only `yarn workspace
`@app/address-service` run migrate`, and mention the committed migration file
`20260212163711-0008_addressheuristichistoryrecord_and_more.js` so readers know
they only need to apply the existing migration.

In `@apps/address-service/domains/address/utils/mergeAddresses.js`:
- Around line 29-50: The loop in mergeAddresses.js that moves heuristics
(loserHeuristics, find, AddressHeuristic.update) only checks for duplicates
scoped to the winner and can still violate the global unique constraint on
(type, value); change the duplicate check to search globally by type and value
(omit the address filter) with deletedAt: null before deciding to reconnect or
soft-delete, i.e., if any non-deleted AddressHeuristic exists with the same type
and value then soft-delete the loser's record, otherwise update the loser's
record to connect to winnerId; keep using context and dvSender and preserve the
current soft-delete behavior when a global match is found.
- Around line 18-69: The mergeAddresses function performs multiple writes
(AddressSource.update, AddressHeuristic.update, Address.update) without a
transaction so a mid-way failure leaves data inconsistent; wrap the whole body
of mergeAddresses in a single DB transaction (use your Keystone/knex transaction
helper or the transaction API exposed on the request/context) and pass the
transaction into all find/update/getById calls so they all commit or rollback
atomically; if a transaction API isn't available, at minimum surround the
sequence with try/catch, log the failing step (include winnerId/loserId and the
symbol where it failed like AddressHeuristic.update) and rethrow so callers can
handle or trigger manual recovery.

In
`@apps/address-service/domains/common/utils/services/search/heuristicMatcher.js`:
- Around line 139-154: The non-coordinate branch in upsertHeuristics is missing
the enabled filter; update the find call (the AddressHeuristic lookup in the
else branch of upsertHeuristics) to include enabled: true alongside type, value
and deletedAt: null so it matches findCoordinateHeuristicsInRange's behavior;
ensure you only change the query object passed to find and keep existing
variable names (heuristic, existingRecords, providerName, dvSender) intact.
- Around line 108-123: The findRootAddress function currently follows
possibleDuplicateOf links using getById('Address', currentId) but doesn't
exclude soft-deleted records, so traversal can stop at a deleted node; update
the lookup in findRootAddress to fetch only non-deleted addresses (i.e., ensure
getById or the query checks deletedAt is null) and continue traversal when a
record is deleted, and after loop add a fallback: if the resolved root is
soft-deleted (or no non-deleted root found) return the original addressId
instead of a deleted currentId to avoid pointing possibleDuplicateOf at deleted
records.

In
`@apps/address-service/domains/common/utils/services/search/providers/AbstractSearchProvider.js`:
- Around line 168-174: The two providers return different "no key" sentinel
values; change AbstractSuggestionProvider.generateAddressKey to return null
instead of an empty string so it matches
AbstractSearchProvider.generateAddressKey; update the return path in
AbstractSuggestionProvider.generateAddressKey (the branch that currently returns
'') to return null and run tests to ensure callers of generateAddressKey (across
both AbstractSuggestionProvider and AbstractSearchProvider) continue to work
with the unified null sentinel.
🧹 Nitpick comments (16)
apps/address-service/domains/common/utils/services/search/plugins/SearchByFiasId.spec.js (1)

163-166: Remove commented-out code and prefer mockResolvedValueOnce for async consistency.

Line 163 is dead commented-out code. Additionally, mockImplementationOnce returning a plain value (not a Promise) is inconsistent with line 288 which uses mockResolvedValue. Since the real function is async, prefer mockResolvedValueOnce for clarity and consistency.

♻️ Suggested cleanup
-            // mockCreateOrUpdateAddressWithSource.mockResolvedValue(mockCreatedAddress)
-            mockCreateOrUpdateAddressWithSource.mockImplementationOnce((...args) => {
-                return mockCreatedAddress
-            })
+            mockCreateOrUpdateAddressWithSource.mockResolvedValueOnce(mockCreatedAddress)

Based on learnings: "Use mockReturnValue instead of mockImplementation(() => ({ ... })) for consistent mock objects."

apps/address-service/domains/common/utils/services/InjectionsSeeker.js (2)

10-12: Duplicated constants across multiple files.

JOINER, SPACE_REPLACER, and SPECIAL_SYMBOLS_TO_REMOVE_REGEX are defined identically in at least four files: addressKeyUtils.js, AbstractSearchProvider.js, AbstractSuggestionProvider.js, and here. Consider extracting them into a shared constants module to avoid drift.


216-254: Duplicated fallback key generation logic.

generateFallbackKey replicates the fallback path in addressKeyUtils.js#generateAddressKey (same field list, same sanitization pipeline). Since generateAddressKey in addressKeyUtils.js is now deprecated, this duplication is expected during the transition, but consolidating the shared logic into a single utility would reduce the maintenance surface.

apps/address-service/domains/address/access/AddressHeuristic.js (1)

12-17: Minor: inconsistent boolean coercion with sibling access file.

canManageAddressHeuristics returns user.isAdmin || user.isSupport directly (truthy/falsy), while the neighboring ResolveAddressDuplicateService.js uses !!(user.isAdmin || user.isSupport) for explicit boolean coercion. Not a bug, but worth being consistent across access files.

apps/address-service/domains/common/utils/services/suggest/providers/AbstractSuggestionProvider.js (1)

7-9: Significant code duplication with AbstractSearchProvider.js.

The constants JOINER, SPACE_REPLACER, SPECIAL_SYMBOLS_TO_REMOVE_REGEX and the key-generation logic (lines 110–127) are identical to AbstractSearchProvider.generateFallbackKey (lines 25–27, 123–140 in that file). Consider extracting these into a shared utility (e.g., addressKeyUtils.js already exists in the codebase) to keep a single source of truth.

Also, if every part in normalizedBuilding.data is falsy, generateAddressKey returns '' — an empty string. Callers that use this as a DB key or heuristic value should guard against it. The AbstractSearchProvider variant does guard (if (!fallbackKey) return []), but this method has no such check.

Proposed: add empty-key guard and extract shared logic
     generateAddressKey (normalizedBuilding) {
         const data = normalizedBuilding.data
 
         // ... parts & transform pipeline ...
-            .toLowerCase()
+            .toLowerCase() || null
     }

For the duplication, consider a shared helper:

// in addressKeyUtils.js (or a new shared module)
function buildFallbackKeyFromParts (data) { /* shared implementation */ }

Also applies to: 90-128

apps/address-service/domains/common/utils/services/suggest/providers/PullentiSuggestionProvider.js (1)

84-92: Implementation is identical to DadataSuggestionProvider.generateAddressKey.

Both Pullenti and Dadata suggestion providers share the exact same generateAddressKey logic (check house_fias_id → return fias: prefix, else delegate to super). Consider moving this into AbstractSuggestionProvider as the default behavior (check house_fias_id first, then fall back to parts-based key), or into a small shared mixin/helper.

apps/address-service/domains/common/utils/services/search/heuristicMatcher.spec.js (1)

47-51: Consider adding null/undefined input tests.

The invalid-input tests cover malformed strings but not null or undefined arguments to coordinatesMatch and parseCoordinates. These are likely real-world inputs (e.g., a missing heuristic value). Worth a quick test to ensure no unexpected TypeError.

apps/address-service/bin/migrate-address-keys-to-heuristics.js (1)

62-79: No per-address error handling — a unique-key collision will crash the script mid-run.

If two addresses produce the same migrated key (e.g., duplicate fias:<uuid> entries), the Address.update at Line 75 will throw on the unique constraint and abort the entire batch. Consider wrapping the update in a try/catch to log the failure and continue, especially since the script is meant to be idempotent and re-runnable.

🛡️ Suggested improvement
             if (isDryRun) {
                 console.info(`  [DRY RUN] ${address.id}: ${address.key} → ${newKey}`)
             } else {
-                await Address.update(context, address.id, { dv, sender, key: newKey })
-                console.info(`  ${address.id}: ${address.key} → ${newKey}`)
+                try {
+                    await Address.update(context, address.id, { dv, sender, key: newKey })
+                    console.info(`  ${address.id}: ${address.key} → ${newKey}`)
+                } catch (err) {
+                    console.error(`  ❌ ${address.id}: Failed to migrate key ${address.key} → ${newKey}: ${err.message}`)
+                    totalSkipped++
+                    continue
+                }
             }
apps/address-service/bin/merge-duplicate-addresses.js (2)

64-66: totalSkipped is declared but never incremented.

The summary at Line 108 will always report 0 skipped. Either wire up skip logic (for ambiguous cases per the doc comment) or remove the variable.


79-98: No per-address error handling — one failed merge aborts the entire batch.

Same concern as in migrate-address-keys-to-heuristics.js: if mergeAddresses throws for a single pair (e.g., the target was already soft-deleted by a prior merge in the same run), the script crashes. A try/catch with logging would make the script more resilient for large datasets.

apps/address-service/domains/address/schema/AddressHeuristic.js (1)

45-57: Consider adding a database index on latitude/longitude for coordinate range queries.

The PR summary describes coordinate matching via DB range queries with COORDINATE_TOLERANCE = 0.00001. Without an index on these columns, those range queries will degrade as the AddressHeuristic table grows. A composite index on (latitude, longitude) filtered to deletedAt IS NULL would help.

apps/address-service/admin-ui/index.js (1)

71-78: Consider extracting shared path-parsing logic.

Lines 73-74 duplicate the path extraction at Lines 40-41 in UpdateAddress. A shared helper (e.g., useAddressIdFromPath()) would reduce duplication and prevent drift.

apps/address-service/bin/create-address-heuristics.js (3)

90-104: Inconsistent enabled filtering between coordinate and non-coordinate heuristic existence checks.

findCoordinateHeuristicsInRange (called on Line 94) filters by enabled: true, but the non-coordinate path (Lines 98-103) does not. This means disabled coordinate heuristics are invisible to the existence check while disabled non-coordinate heuristics are not.

In practice this is unlikely to cause a unique-constraint violation (coordinate values are fuzzy-matched so exact duplicates are rare), but the inconsistency could produce confusing results if disabled heuristics exist. Consider adding enabled: true to the non-coordinate query or removing it from the coordinate path, depending on the intended semantics.


106-127: Consider using structured logging for consistency.

The script uses console.info/console.error throughout, but the coding guidelines require getLogger() from @open-condo/keystone/logging with structured { msg, ... } objects for **/*.{js,ts} files.

While console output is common for CLI migration scripts, aligning with the project's logging pattern would keep the output machine-parseable and consistent with the rest of the codebase.

As per coding guidelines, "Use structured logging with standard fields (msg, data, entityId, entity, count, status, err) from @open-condo/keystone/logging" and "Use getLogger() from @open-condo/keystone/logging for logger initialization".


129-193: Sequential await in a tight loop — acceptable for a migration script, but consider batching for large datasets.

Each heuristic check and create is an individual await inside nested loops. For a one-off migration this is fine, but if the address table is large, this could take a long time. If performance becomes a concern, batching heuristic existence checks or creation could help.

apps/address-service/domains/common/utils/services/search/searchServiceUtils.js (1)

69-78: Minor: potential race condition on concurrent heuristic-based address creation.

Between findAddressByHeuristics (Line 70) and upsertHeuristics (Line 91), a concurrent request could create the same heuristic record, hitting the (type, value) unique constraint. The upsert logic does a check-then-create, but under high concurrency this TOCTOU gap could produce transient errors.

This is acceptable for the current workload, but if this path becomes high-throughput, consider wrapping the heuristic upsert in a try/catch to gracefully handle unique constraint violations (retry or skip).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@apps/address-service/admin-ui/index.js`:
- Around line 130-134: The span used to render the clickable icon (style
ICON_STYLE with onClick and title) is not keyboard-accessible and currently has
a no-op onKeyDown; replace the <span> with a native <button type="button">
(apply ICON_STYLE to the button, keep title and onClick, and render <Link />
inside) so Enter/Space work natively and remove the no-op onKeyDown; if you
cannot use a button for some reason, add role="button", tabIndex={0}, and
implement onKeyDown to handle Enter/Space for the same behavior instead.
🧹 Nitpick comments (1)
apps/address-service/admin-ui/index.js (1)

71-74: Duplicated path-extraction logic could be shared.

The addressId derivation from location.pathname (split → splice → check against TARGET_URL_PART) is repeated verbatim from UpdateAddress (lines 36–41). Consider extracting a small useAddressId() hook to keep them in sync.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@apps/address-service/bin/local/merge-duplicate-addresses.js`:
- Around line 155-161: The loop resets skip to 0 in non-dry-run mode which
causes skipped address pairs to be re-fetched forever; instead introduce and
increment a persistent offset (e.g., totalSkipped) and use that as the query
skip/offset for both dry-run and non-dry-run runs so skipped records are skipped
on subsequent pages — update the loop to increment totalSkipped by
Math.min(pageSize, addresses.length) when records are skipped (or always
increment after each fetch) and replace the reset logic that sets skip = 0 with
using totalSkipped as the skip value passed to the query (keep isDryRun only
controlling whether merges are applied, not whether skip is reset).
- Around line 76-90: The resolveDuplicate function currently assumes
addressClient.executeAuthorizedMutation returns a populated data.result.status
and will throw if data is null/undefined or contains errors; update
resolveDuplicate to validate the mutation response from
addressClient.executeAuthorizedMutation (check for errors, ensure data and
data.result exist) before accessing data.result.status, and if invalid either
throw a descriptive error (including mutation name
RESOLVE_ADDRESS_DUPLICATE_MUTATION and addressId/winnerId) or return a safe
fallback value; reference the resolveDuplicate function, the data variable, and
the call to addressClient.executeAuthorizedMutation when making the changes.
- Around line 117-126: The loop processing addresses can throw if
address.possibleDuplicateOf is null (e.g., soft-deleted target), so add a guard
after extracting const target = address.possibleDuplicateOf to skip or log and
continue when target is null/undefined; update the console.info lines and
subsequent calls to isAddressReferenced(condoClient, target.id) to only run when
target exists (or use target?.id safely), ensuring the loop still increments
totalProcessed and handles skipping gracefully in the for..of over addresses in
merge-duplicate-addresses.js.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In
`@apps/address-service/domains/common/utils/services/search/heuristicMatcher.js`:
- Around line 148-194: The upsertHeuristics loop can overwrite
possibleDuplicateOf when multiple heuristic conflicts occur; in
upsertHeuristics, after detecting a conflict (existingRecords.length > 0),
determine the desired resolution strategy and stop further overwrites — simplest
fix: after resolving the first conflict (call to findRootAddress and
AddressServerUtils.update for possibleDuplicateOf) break or return from the
function so later heuristic iterations don't overwrite the link; alternatively,
if you prefer deterministic selection, collect conflicting existingAddressIds
across the loop, pick the best root (by reliability or other criteria) using
findRootAddress, and call AddressServerUtils.update once with that chosen root
instead of updating per-heuristic.

In
`@apps/address-service/domains/common/utils/services/suggest/providers/DadataSuggestionProvider.js`:
- Around line 430-445: Update the JSDoc for generateAddressKey to accurately
reflect it can return null by changing the `@returns` annotation to
"{string|null}" and ensure references in callers/tests accept null; locate the
method generateAddressKey in DadataSuggestionProvider (and note constants
HEURISTIC_TYPE_FIAS_ID and HEURISTIC_TYPE_FALLBACK and the call to
super.generateAddressKey) and only change the comment/annotation (no runtime
behavior changes).
🧹 Nitpick comments (5)
apps/address-service/domains/common/utils/services/search/providers/AbstractSearchProvider.js (1)

96-143: Implementation is solid. One minor nit: the @returns {string} JSDoc on line 100 should be {string|null} since line 142 explicitly returns null for empty keys.

📝 Proposed fix
-     * `@returns` {string}
+     * `@returns` {string|null}
apps/address-service/admin-ui/index.js (1)

76-79: Duplicated addressId extraction logic.

The location.pathname.split('/').splice(2, 2)addressId pattern is repeated here (lines 78–79) and in UpdateAddress (lines 45–46). Consider extracting a small hook like useCurrentAddressId() to keep things DRY.

function useCurrentAddressId () {
    const location = useLocation()
    const path = location.pathname.split('/').splice(2, 2)
    return (path[0] === TARGET_URL_PART && path[1]) ? path[1] : null
}

Both components can then call const addressId = useCurrentAddressId().

apps/address-service/domains/address/utils/mergeAddresses.js (1)

62-69: getById does not filter soft-deleted records.

getById('Address', winnerId) on Line 63 may return a soft-deleted address. While unlikely in normal flow (the winner should be alive), consider using find with deletedAt: null for consistency, or at minimum checking winner.deletedAt.

apps/address-service/docs/MIGRATION-heuristics.md (1)

72-91: Script C instructions reference an undocumented .env file.

The instructions tell users to source apps/address-service/bin/local/.env without listing the required environment variables. Users unfamiliar with the setup may struggle. Consider briefly listing the expected variables (e.g., ADDRESS_SERVICE_URL, CONDO_URL, auth tokens) or pointing to a .env.example.

apps/address-service/domains/common/utils/services/search/heuristicMatcher.js (1)

21-25: parseCoordinates will throw on null/undefined input.

coordString.split(',') throws a TypeError if coordString is null or undefined. Since this function is publicly exported, consider a guard.

Proposed fix
 function parseCoordinates (coordString) {
+    if (!coordString || typeof coordString !== 'string') return null
     const [lat, lon] = coordString.split(',').map(parseFloat)

@AleX83Xpert AleX83Xpert force-pushed the feat/address-service/doma-12746/improve-internal-keys-to-merge-data-from-dadata-and-pullenti branch from a326f4d to a45c0d2 Compare February 17, 2026 09:14
@AleX83Xpert AleX83Xpert force-pushed the feat/address-service/doma-12746/improve-internal-keys-to-merge-data-from-dadata-and-pullenti branch from a45c0d2 to 94ab4b9 Compare February 19, 2026 05:01
@AleX83Xpert AleX83Xpert force-pushed the feat/address-service/doma-12746/improve-internal-keys-to-merge-data-from-dadata-and-pullenti branch from 94ab4b9 to 60ccfc1 Compare February 19, 2026 11:24
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 19, 2026

Confidence Score: 5/5

  • This PR is safe to merge - well-architected solution with proper constraint handling, race condition mitigation, and comprehensive migration tooling
  • All previously identified issues have been resolved (zero coordinate checks, self-reference guards, actualize service heuristic updates). The implementation demonstrates careful attention to edge cases: unique constraints prevent duplicate heuristics, dependency checks prevent dangling references during merge, two-pass conflict detection handles races, and coordinate tolerance uses efficient DB range queries. Migration scripts provide dry-run mode and detailed logging. No critical issues found.
  • No files require special attention - all implementations are solid and previous review feedback has been addressed

Important Files Changed

Filename Overview
apps/address-service/domains/address/schema/AddressHeuristic.js New model for storing provider-generated heuristic identifiers with proper unique constraint and coordinate fields for range queries
apps/address-service/domains/common/utils/services/search/heuristicMatcher.js Core matching logic with two-pass conflict detection, coordinate range queries, and self-reference guard correctly implemented
apps/address-service/domains/address/utils/mergeAddresses.js Shared merge utility with proper dependency checks, moves sources/heuristics, soft-deletes loser, clears possibleDuplicateOf on winner
apps/address-service/domains/address/schema/ActualizeAddressesService.js Updated to use shared merge utility and now calls upsertHeuristics after updating address data
apps/address-service/bin/create-address-heuristics.js Backfills AddressHeuristic records from Address.key and meta with batch processing, conflict detection, and fias_id: prefix support
apps/address-service/bin/local/merge-duplicate-addresses.js Bulk merge script that auto-merges clear cases based on Property references, skips ambiguous cases with clear logging

Last reviewed commit: 25b5b41

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

47 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@AleX83Xpert
Copy link
Member Author

@greptileai, you are welcome

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 19, 2026

Thank you! I appreciate your patience and collaboration. If you have any more questions about the PR or need clarification on any part of the heuristics-based address deduplication system, feel free to ask. Happy to help!

@AleX83Xpert AleX83Xpert added ✋🙂 Review please Comments are resolved, take a look, please and removed 🔬 WIP Not intended to be merged right now, it is a work in progress labels Feb 19, 2026
…gic from migrate-address-keys-to-heuristics script

Remove fias:<uuid> → fias_id:<uuid> migration logic as it's no longer needed. Update script to only handle fallback key prefix migration. Remove fiasToMigrateCount tracking, fias_id SQL update query, and related statistics. Update documentation to reflect that only fallback key migration is performed.
…veAddressDuplicateService and admin UI

Replace generic Error throws with structured GQLError in ResolveAddressDuplicateService. Add validation for soft-deleted target addresses. Restrict merge action to only allow possibleDuplicateOf as winnerId (remove bidirectional merge support). Add extractErrorMessage helper in admin UI to display user-friendly error messages from GraphQL errors. Update all validation tests to use expectToThrowGQLErrorToResult
…n AddressHeuristicHistoryRecord from 4 to 8 decimal places
…inate candidate query in create-address-heuristics script
…ount in create-address-heuristics script dry-run mode
…marketplace tests

Add createTestBillingIntegration setup in PaymentsFile, MarketPriceScope, and RegisterResidentInvoiceService tests.
…psert to ActualizeAddressesService

Add DadataSearchProvider to extract heuristics from DaData search results in ActualizeAddressesService. Call upsertHeuristics after address update to persist heuristics alongside actualized address data. Add comment explaining dual provider usage (SuggestionProvider for fresh data, SearchProvider for heuristic extraction).
…dataSearchProvider to handle zero values

Replace truthy check with explicit null check for geoLat/geoLon to correctly handle zero coordinates (e.g., locations near equator/prime meridian). Previous implementation would skip valid coordinates with zero values.
…e in upsertHeuristics to end of function

Move possibleDuplicateOf update from first pass to end of function to ensure only one update occurs. Compare conflicts from both passes and select highest reliability match overall.
…ject for structured logging

Wrap all logger call parameters (addressId, winnerId, loserId, etc.) in a data object to follow structured logging conventions. Update logger calls in ResolveAddressDuplicateService, mergeAddresses, and heuristicMatcher.
…hitespace in parseCoordinates helper

Add null/type check for coordString parameter and trim whitespace before parsing to handle malformed coordinate strings. Prevents errors when coordString is null, undefined, or non-string.
…ss is referenced in condo Properties

Skip merge operation when current address (duplicate) is referenced in Properties, as the mutation requires the target to be the winner. Previously attempted to swap winner/loser which would violate merge constraints.
…eate-address-heuristics to handle zero values
…ting possibleDuplicateOf in upsertHeuristics

Add validation to prevent setting possibleDuplicateOf to the same address (self-link). Log warning when rootAddressId equals addressId and skip the update operation to avoid creating invalid duplicate relationships.
…ses script with batch processing and progress tracking

Add batch property reference checking to reduce database queries, implement progress bar visualization, and improve logging format. Replace per-address isAddressReferenced calls with single batch getReferencedAddressIds query. Add total record count display and page-based progress indicators.
@AleX83Xpert AleX83Xpert force-pushed the feat/address-service/doma-12746/improve-internal-keys-to-merge-data-from-dadata-and-pullenti branch from 25b5b41 to 2910d11 Compare March 10, 2026 03:49
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@sonarqubecloud
Copy link

@AleX83Xpert
Copy link
Member Author

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Hey! We are open-source!
image
@greptileai

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@AleX83Xpert AleX83Xpert merged commit a0e6dec into main Mar 10, 2026
169 of 189 checks passed
@AleX83Xpert AleX83Xpert deleted the feat/address-service/doma-12746/improve-internal-keys-to-merge-data-from-dadata-and-pullenti branch March 10, 2026 09:07
@AleX83Xpert
Copy link
Member Author

@paulo-rossy, please pay attention to this merged PR.
You need to transform your database after this code is deployed.
See the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐘 BIG No so easy to review changes up to 1300 lines of code 🚨 Migrations We have a database migrations here! ✋🙂 Review please Comments are resolved, take a look, please

Development

Successfully merging this pull request may close these issues.

2 participants