feat(address-service): DOMA-12746 implement heuristics-based address deduplication system#7196
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds heuristic-based address deduplication: new AddressHeuristic model and DB migration, heuristics extraction/upsert/matching, provider-based addressKey generation, merge/dismiss resolution service and admin UI, merge utility, migration scripts, and extensive tests and tooling. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant AdminUI as Admin UI
participant GraphQL as Address Service\nGraphQL API
participant DB as Database
User->>AdminUI: Open address detail
AdminUI->>GraphQL: query Address with possibleDuplicateOf
GraphQL->>DB: fetch Address + possibleDuplicateOf
DB-->>GraphQL: return data
GraphQL-->>AdminUI: response
alt Duplicate Exists
AdminUI->>User: show Resolve Duplicate button
User->>AdminUI: choose Merge/Dismiss
AdminUI->>AdminUI: confirmation dialog
User->>AdminUI: confirm
AdminUI->>GraphQL: resolveAddressDuplicate(action,winnerId)
alt Merge
GraphQL->>DB: move AddressSource records
GraphQL->>DB: move AddressHeuristic records
GraphQL->>DB: soft-delete loser
GraphQL->>DB: clear possibleDuplicateOf
DB-->>GraphQL: success
GraphQL-->>AdminUI: merged
AdminUI->>AdminUI: redirect to winner
else Dismiss
GraphQL->>DB: clear possibleDuplicateOf
DB-->>GraphQL: success
GraphQL-->>AdminUI: dismissed
AdminUI->>AdminUI: reload
end
AdminUI-->>User: show success
else No Duplicate
AdminUI-->>User: no duplicate
end
sequenceDiagram
participant Search as Search Plugin
participant Provider as Search Provider
participant Matcher as Heuristic Matcher
participant DB as Database
Search->>Provider: normalize(searchResult)
Provider-->>Search: normalizedBuilding
Search->>Provider: extractHeuristics(normalizedBuilding)
Provider-->>Search: heuristics
Search->>Provider: generateAddressKey(normalizedBuilding)
Provider-->>Search: addressKey
Search->>Matcher: findAddressByHeuristics(heuristics)
Matcher->>DB: query AddressHeuristic
alt Match Found
DB-->>Matcher: matching heuristic -> addressId
Matcher-->>Search: addressId
else No Match
Matcher-->>Search: null
Search->>DB: query by addressKey
end
Search->>Matcher: upsertHeuristics(addressId, heuristics)
Matcher->>DB: insert/update heuristics, detect conflicts
alt Conflict
Matcher->>DB: find root address
Matcher->>DB: set possibleDuplicateOf for conflicting address
end
DB-->>Matcher: success
Matcher-->>Search: done
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8a6b8dc49b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
.../address-service/domains/common/utils/services/suggest/providers/DadataSuggestionProvider.js
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Fix all issues with AI agents
In `@apps/address-service/admin-ui/index.js`:
- Around line 113-125: Convert the click handler to an async function and
replace the resolveDuplicate().then()/catch() chain with an await inside a
try/catch: call await resolveDuplicate({ variables: { data: mutationData } })
and destructure the result to access result.result.status; on success perform
the same redirect logic using action, addressId and possibleDuplicate.id
(window.location.href or window.location.reload), and in the catch block log the
error with console.error('Failed to resolve duplicate', error) and show
alert(error.message); ensure the handler signature (onClick) is updated to be
async so await can be used.
In `@apps/address-service/bin/merge-duplicate-addresses.js`:
- Around line 85-88: The comment says we should pick the winner by checking
which Address.id is referenced in the condo database but the code always does
const winnerId = targetId; update the merge logic in
merge-duplicate-addresses.js: replace the unconditional winnerId = targetId
assignment with a reference-check against the condo DB for targetId and
address.id (use your existing condo access code or add a small query that counts
references), set winnerId to the id that has references, set loserId to the
other id, and if both or neither are referenced treat the case as ambiguous and
skip merging (log and continue). Ensure you only change the block around
winnerId/loserId assignment and preserve existing variables targetId and
address.id.
In `@apps/address-service/docs/MIGRATION-heuristics.md`:
- Around line 34-37: Update the docs to remove the misleading "makemigrations"
step and instruct users to only run the migration that is already committed;
specifically edit MIGRATION-heuristics.md to drop the `yarn workspace
`@app/address-service` run makemigrations` line and leave only `yarn workspace
`@app/address-service` run migrate`, and mention the committed migration file
`20260212163711-0008_addressheuristichistoryrecord_and_more.js` so readers know
they only need to apply the existing migration.
In `@apps/address-service/domains/address/utils/mergeAddresses.js`:
- Around line 29-50: The loop in mergeAddresses.js that moves heuristics
(loserHeuristics, find, AddressHeuristic.update) only checks for duplicates
scoped to the winner and can still violate the global unique constraint on
(type, value); change the duplicate check to search globally by type and value
(omit the address filter) with deletedAt: null before deciding to reconnect or
soft-delete, i.e., if any non-deleted AddressHeuristic exists with the same type
and value then soft-delete the loser's record, otherwise update the loser's
record to connect to winnerId; keep using context and dvSender and preserve the
current soft-delete behavior when a global match is found.
- Around line 18-69: The mergeAddresses function performs multiple writes
(AddressSource.update, AddressHeuristic.update, Address.update) without a
transaction so a mid-way failure leaves data inconsistent; wrap the whole body
of mergeAddresses in a single DB transaction (use your Keystone/knex transaction
helper or the transaction API exposed on the request/context) and pass the
transaction into all find/update/getById calls so they all commit or rollback
atomically; if a transaction API isn't available, at minimum surround the
sequence with try/catch, log the failing step (include winnerId/loserId and the
symbol where it failed like AddressHeuristic.update) and rethrow so callers can
handle or trigger manual recovery.
In
`@apps/address-service/domains/common/utils/services/search/heuristicMatcher.js`:
- Around line 139-154: The non-coordinate branch in upsertHeuristics is missing
the enabled filter; update the find call (the AddressHeuristic lookup in the
else branch of upsertHeuristics) to include enabled: true alongside type, value
and deletedAt: null so it matches findCoordinateHeuristicsInRange's behavior;
ensure you only change the query object passed to find and keep existing
variable names (heuristic, existingRecords, providerName, dvSender) intact.
- Around line 108-123: The findRootAddress function currently follows
possibleDuplicateOf links using getById('Address', currentId) but doesn't
exclude soft-deleted records, so traversal can stop at a deleted node; update
the lookup in findRootAddress to fetch only non-deleted addresses (i.e., ensure
getById or the query checks deletedAt is null) and continue traversal when a
record is deleted, and after loop add a fallback: if the resolved root is
soft-deleted (or no non-deleted root found) return the original addressId
instead of a deleted currentId to avoid pointing possibleDuplicateOf at deleted
records.
In
`@apps/address-service/domains/common/utils/services/search/providers/AbstractSearchProvider.js`:
- Around line 168-174: The two providers return different "no key" sentinel
values; change AbstractSuggestionProvider.generateAddressKey to return null
instead of an empty string so it matches
AbstractSearchProvider.generateAddressKey; update the return path in
AbstractSuggestionProvider.generateAddressKey (the branch that currently returns
'') to return null and run tests to ensure callers of generateAddressKey (across
both AbstractSuggestionProvider and AbstractSearchProvider) continue to work
with the unified null sentinel.
🧹 Nitpick comments (16)
apps/address-service/domains/common/utils/services/search/plugins/SearchByFiasId.spec.js (1)
163-166: Remove commented-out code and prefermockResolvedValueOncefor async consistency.Line 163 is dead commented-out code. Additionally,
mockImplementationOncereturning a plain value (not a Promise) is inconsistent with line 288 which usesmockResolvedValue. Since the real function is async, prefermockResolvedValueOncefor clarity and consistency.♻️ Suggested cleanup
- // mockCreateOrUpdateAddressWithSource.mockResolvedValue(mockCreatedAddress) - mockCreateOrUpdateAddressWithSource.mockImplementationOnce((...args) => { - return mockCreatedAddress - }) + mockCreateOrUpdateAddressWithSource.mockResolvedValueOnce(mockCreatedAddress)Based on learnings: "Use mockReturnValue instead of mockImplementation(() => ({ ... })) for consistent mock objects."
apps/address-service/domains/common/utils/services/InjectionsSeeker.js (2)
10-12: Duplicated constants across multiple files.
JOINER,SPACE_REPLACER, andSPECIAL_SYMBOLS_TO_REMOVE_REGEXare defined identically in at least four files:addressKeyUtils.js,AbstractSearchProvider.js,AbstractSuggestionProvider.js, and here. Consider extracting them into a shared constants module to avoid drift.
216-254: Duplicated fallback key generation logic.
generateFallbackKeyreplicates the fallback path inaddressKeyUtils.js#generateAddressKey(same field list, same sanitization pipeline). SincegenerateAddressKeyinaddressKeyUtils.jsis now deprecated, this duplication is expected during the transition, but consolidating the shared logic into a single utility would reduce the maintenance surface.apps/address-service/domains/address/access/AddressHeuristic.js (1)
12-17: Minor: inconsistent boolean coercion with sibling access file.
canManageAddressHeuristicsreturnsuser.isAdmin || user.isSupportdirectly (truthy/falsy), while the neighboringResolveAddressDuplicateService.jsuses!!(user.isAdmin || user.isSupport)for explicit boolean coercion. Not a bug, but worth being consistent across access files.apps/address-service/domains/common/utils/services/suggest/providers/AbstractSuggestionProvider.js (1)
7-9: Significant code duplication withAbstractSearchProvider.js.The constants
JOINER,SPACE_REPLACER,SPECIAL_SYMBOLS_TO_REMOVE_REGEXand the key-generation logic (lines 110–127) are identical toAbstractSearchProvider.generateFallbackKey(lines 25–27, 123–140 in that file). Consider extracting these into a shared utility (e.g.,addressKeyUtils.jsalready exists in the codebase) to keep a single source of truth.Also, if every part in
normalizedBuilding.datais falsy,generateAddressKeyreturns''— an empty string. Callers that use this as a DB key or heuristic value should guard against it. TheAbstractSearchProvidervariant does guard (if (!fallbackKey) return []), but this method has no such check.Proposed: add empty-key guard and extract shared logic
generateAddressKey (normalizedBuilding) { const data = normalizedBuilding.data // ... parts & transform pipeline ... - .toLowerCase() + .toLowerCase() || null }For the duplication, consider a shared helper:
// in addressKeyUtils.js (or a new shared module) function buildFallbackKeyFromParts (data) { /* shared implementation */ }Also applies to: 90-128
apps/address-service/domains/common/utils/services/suggest/providers/PullentiSuggestionProvider.js (1)
84-92: Implementation is identical toDadataSuggestionProvider.generateAddressKey.Both Pullenti and Dadata suggestion providers share the exact same
generateAddressKeylogic (checkhouse_fias_id→ returnfias:prefix, else delegate to super). Consider moving this intoAbstractSuggestionProvideras the default behavior (checkhouse_fias_idfirst, then fall back to parts-based key), or into a small shared mixin/helper.apps/address-service/domains/common/utils/services/search/heuristicMatcher.spec.js (1)
47-51: Consider addingnull/undefinedinput tests.The invalid-input tests cover malformed strings but not
nullorundefinedarguments tocoordinatesMatchandparseCoordinates. These are likely real-world inputs (e.g., a missing heuristic value). Worth a quick test to ensure no unexpected TypeError.apps/address-service/bin/migrate-address-keys-to-heuristics.js (1)
62-79: No per-address error handling — a unique-key collision will crash the script mid-run.If two addresses produce the same migrated key (e.g., duplicate
fias:<uuid>entries), theAddress.updateat Line 75 will throw on the unique constraint and abort the entire batch. Consider wrapping the update in a try/catch to log the failure and continue, especially since the script is meant to be idempotent and re-runnable.🛡️ Suggested improvement
if (isDryRun) { console.info(` [DRY RUN] ${address.id}: ${address.key} → ${newKey}`) } else { - await Address.update(context, address.id, { dv, sender, key: newKey }) - console.info(` ${address.id}: ${address.key} → ${newKey}`) + try { + await Address.update(context, address.id, { dv, sender, key: newKey }) + console.info(` ${address.id}: ${address.key} → ${newKey}`) + } catch (err) { + console.error(` ❌ ${address.id}: Failed to migrate key ${address.key} → ${newKey}: ${err.message}`) + totalSkipped++ + continue + } }apps/address-service/bin/merge-duplicate-addresses.js (2)
64-66:totalSkippedis declared but never incremented.The summary at Line 108 will always report
0 skipped. Either wire up skip logic (for ambiguous cases per the doc comment) or remove the variable.
79-98: No per-address error handling — one failed merge aborts the entire batch.Same concern as in
migrate-address-keys-to-heuristics.js: ifmergeAddressesthrows for a single pair (e.g., the target was already soft-deleted by a prior merge in the same run), the script crashes. A try/catch with logging would make the script more resilient for large datasets.apps/address-service/domains/address/schema/AddressHeuristic.js (1)
45-57: Consider adding a database index onlatitude/longitudefor coordinate range queries.The PR summary describes coordinate matching via DB range queries with
COORDINATE_TOLERANCE = 0.00001. Without an index on these columns, those range queries will degrade as theAddressHeuristictable grows. A composite index on(latitude, longitude)filtered todeletedAt IS NULLwould help.apps/address-service/admin-ui/index.js (1)
71-78: Consider extracting shared path-parsing logic.Lines 73-74 duplicate the path extraction at Lines 40-41 in
UpdateAddress. A shared helper (e.g.,useAddressIdFromPath()) would reduce duplication and prevent drift.apps/address-service/bin/create-address-heuristics.js (3)
90-104: Inconsistentenabledfiltering between coordinate and non-coordinate heuristic existence checks.
findCoordinateHeuristicsInRange(called on Line 94) filters byenabled: true, but the non-coordinate path (Lines 98-103) does not. This means disabled coordinate heuristics are invisible to the existence check while disabled non-coordinate heuristics are not.In practice this is unlikely to cause a unique-constraint violation (coordinate values are fuzzy-matched so exact duplicates are rare), but the inconsistency could produce confusing results if disabled heuristics exist. Consider adding
enabled: trueto the non-coordinate query or removing it from the coordinate path, depending on the intended semantics.
106-127: Consider using structured logging for consistency.The script uses
console.info/console.errorthroughout, but the coding guidelines requiregetLogger()from@open-condo/keystone/loggingwith structured{ msg, ... }objects for**/*.{js,ts}files.While console output is common for CLI migration scripts, aligning with the project's logging pattern would keep the output machine-parseable and consistent with the rest of the codebase.
As per coding guidelines, "Use structured logging with standard fields (msg, data, entityId, entity, count, status, err) from
@open-condo/keystone/logging" and "Use getLogger() from@open-condo/keystone/loggingfor logger initialization".
129-193: Sequentialawaitin a tight loop — acceptable for a migration script, but consider batching for large datasets.Each heuristic check and create is an individual
awaitinside nested loops. For a one-off migration this is fine, but if the address table is large, this could take a long time. If performance becomes a concern, batching heuristic existence checks or creation could help.apps/address-service/domains/common/utils/services/search/searchServiceUtils.js (1)
69-78: Minor: potential race condition on concurrent heuristic-based address creation.Between
findAddressByHeuristics(Line 70) andupsertHeuristics(Line 91), a concurrent request could create the same heuristic record, hitting the(type, value)unique constraint. The upsert logic does a check-then-create, but under high concurrency this TOCTOU gap could produce transient errors.This is acceptable for the current workload, but if this path becomes high-throughput, consider wrapping the heuristic upsert in a try/catch to gracefully handle unique constraint violations (retry or skip).
apps/address-service/domains/common/utils/services/search/heuristicMatcher.js
Show resolved
Hide resolved
apps/address-service/domains/common/utils/services/search/heuristicMatcher.js
Outdated
Show resolved
Hide resolved
apps/address-service/domains/common/utils/services/search/providers/AbstractSearchProvider.js
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@apps/address-service/admin-ui/index.js`:
- Around line 130-134: The span used to render the clickable icon (style
ICON_STYLE with onClick and title) is not keyboard-accessible and currently has
a no-op onKeyDown; replace the <span> with a native <button type="button">
(apply ICON_STYLE to the button, keep title and onClick, and render <Link />
inside) so Enter/Space work natively and remove the no-op onKeyDown; if you
cannot use a button for some reason, add role="button", tabIndex={0}, and
implement onKeyDown to handle Enter/Space for the same behavior instead.
🧹 Nitpick comments (1)
apps/address-service/admin-ui/index.js (1)
71-74: Duplicated path-extraction logic could be shared.The
addressIdderivation fromlocation.pathname(split → splice → check againstTARGET_URL_PART) is repeated verbatim fromUpdateAddress(lines 36–41). Consider extracting a smalluseAddressId()hook to keep them in sync.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@apps/address-service/bin/local/merge-duplicate-addresses.js`:
- Around line 155-161: The loop resets skip to 0 in non-dry-run mode which
causes skipped address pairs to be re-fetched forever; instead introduce and
increment a persistent offset (e.g., totalSkipped) and use that as the query
skip/offset for both dry-run and non-dry-run runs so skipped records are skipped
on subsequent pages — update the loop to increment totalSkipped by
Math.min(pageSize, addresses.length) when records are skipped (or always
increment after each fetch) and replace the reset logic that sets skip = 0 with
using totalSkipped as the skip value passed to the query (keep isDryRun only
controlling whether merges are applied, not whether skip is reset).
- Around line 76-90: The resolveDuplicate function currently assumes
addressClient.executeAuthorizedMutation returns a populated data.result.status
and will throw if data is null/undefined or contains errors; update
resolveDuplicate to validate the mutation response from
addressClient.executeAuthorizedMutation (check for errors, ensure data and
data.result exist) before accessing data.result.status, and if invalid either
throw a descriptive error (including mutation name
RESOLVE_ADDRESS_DUPLICATE_MUTATION and addressId/winnerId) or return a safe
fallback value; reference the resolveDuplicate function, the data variable, and
the call to addressClient.executeAuthorizedMutation when making the changes.
- Around line 117-126: The loop processing addresses can throw if
address.possibleDuplicateOf is null (e.g., soft-deleted target), so add a guard
after extracting const target = address.possibleDuplicateOf to skip or log and
continue when target is null/undefined; update the console.info lines and
subsequent calls to isAddressReferenced(condoClient, target.id) to only run when
target exists (or use target?.id safely), ensuring the loop still increments
totalProcessed and handles skipping gracefully in the for..of over addresses in
merge-duplicate-addresses.js.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In
`@apps/address-service/domains/common/utils/services/search/heuristicMatcher.js`:
- Around line 148-194: The upsertHeuristics loop can overwrite
possibleDuplicateOf when multiple heuristic conflicts occur; in
upsertHeuristics, after detecting a conflict (existingRecords.length > 0),
determine the desired resolution strategy and stop further overwrites — simplest
fix: after resolving the first conflict (call to findRootAddress and
AddressServerUtils.update for possibleDuplicateOf) break or return from the
function so later heuristic iterations don't overwrite the link; alternatively,
if you prefer deterministic selection, collect conflicting existingAddressIds
across the loop, pick the best root (by reliability or other criteria) using
findRootAddress, and call AddressServerUtils.update once with that chosen root
instead of updating per-heuristic.
In
`@apps/address-service/domains/common/utils/services/suggest/providers/DadataSuggestionProvider.js`:
- Around line 430-445: Update the JSDoc for generateAddressKey to accurately
reflect it can return null by changing the `@returns` annotation to
"{string|null}" and ensure references in callers/tests accept null; locate the
method generateAddressKey in DadataSuggestionProvider (and note constants
HEURISTIC_TYPE_FIAS_ID and HEURISTIC_TYPE_FALLBACK and the call to
super.generateAddressKey) and only change the comment/annotation (no runtime
behavior changes).
🧹 Nitpick comments (5)
apps/address-service/domains/common/utils/services/search/providers/AbstractSearchProvider.js (1)
96-143: Implementation is solid. One minor nit: the@returns {string}JSDoc on line 100 should be{string|null}since line 142 explicitly returnsnullfor empty keys.📝 Proposed fix
- * `@returns` {string} + * `@returns` {string|null}apps/address-service/admin-ui/index.js (1)
76-79: DuplicatedaddressIdextraction logic.The
location.pathname.split('/').splice(2, 2)→addressIdpattern is repeated here (lines 78–79) and inUpdateAddress(lines 45–46). Consider extracting a small hook likeuseCurrentAddressId()to keep things DRY.function useCurrentAddressId () { const location = useLocation() const path = location.pathname.split('/').splice(2, 2) return (path[0] === TARGET_URL_PART && path[1]) ? path[1] : null }Both components can then call
const addressId = useCurrentAddressId().apps/address-service/domains/address/utils/mergeAddresses.js (1)
62-69:getByIddoes not filter soft-deleted records.
getById('Address', winnerId)on Line 63 may return a soft-deleted address. While unlikely in normal flow (the winner should be alive), consider usingfindwithdeletedAt: nullfor consistency, or at minimum checkingwinner.deletedAt.apps/address-service/docs/MIGRATION-heuristics.md (1)
72-91: Script C instructions reference an undocumented.envfile.The instructions tell users to
source apps/address-service/bin/local/.envwithout listing the required environment variables. Users unfamiliar with the setup may struggle. Consider briefly listing the expected variables (e.g.,ADDRESS_SERVICE_URL,CONDO_URL, auth tokens) or pointing to a.env.example.apps/address-service/domains/common/utils/services/search/heuristicMatcher.js (1)
21-25:parseCoordinateswill throw onnull/undefinedinput.
coordString.split(',')throws aTypeErrorifcoordStringisnullorundefined. Since this function is publicly exported, consider a guard.Proposed fix
function parseCoordinates (coordString) { + if (!coordString || typeof coordString !== 'string') return null const [lat, lon] = coordString.split(',').map(parseFloat)
apps/address-service/domains/common/utils/services/search/heuristicMatcher.js
Show resolved
Hide resolved
.../address-service/domains/common/utils/services/suggest/providers/DadataSuggestionProvider.js
Outdated
Show resolved
Hide resolved
a326f4d to
a45c0d2
Compare
a45c0d2 to
94ab4b9
Compare
94ab4b9 to
60ccfc1
Compare
Confidence Score: 5/5
Important Files Changed
Last reviewed commit: 25b5b41 |
apps/address-service/domains/common/utils/services/search/heuristicMatcher.js
Show resolved
Hide resolved
|
@greptileai, you are welcome |
|
Thank you! I appreciate your patience and collaboration. If you have any more questions about the PR or need clarification on any part of the heuristics-based address deduplication system, feel free to ask. Happy to help! |
…gic from migrate-address-keys-to-heuristics script Remove fias:<uuid> → fias_id:<uuid> migration logic as it's no longer needed. Update script to only handle fallback key prefix migration. Remove fiasToMigrateCount tracking, fias_id SQL update query, and related statistics. Update documentation to reflect that only fallback key migration is performed.
… create-address-heuristics script
…licate-addresses script for dry-run mode
…veAddressDuplicateService and admin UI Replace generic Error throws with structured GQLError in ResolveAddressDuplicateService. Add validation for soft-deleted target addresses. Restrict merge action to only allow possibleDuplicateOf as winnerId (remove bidirectional merge support). Add extractErrorMessage helper in admin UI to display user-friendly error messages from GraphQL errors. Update all validation tests to use expectToThrowGQLErrorToResult
…n AddressHeuristicHistoryRecord from 4 to 8 decimal places
…inate candidate query in create-address-heuristics script
…ount in create-address-heuristics script dry-run mode
…marketplace tests Add createTestBillingIntegration setup in PaymentsFile, MarketPriceScope, and RegisterResidentInvoiceService tests.
…psert to ActualizeAddressesService Add DadataSearchProvider to extract heuristics from DaData search results in ActualizeAddressesService. Call upsertHeuristics after address update to persist heuristics alongside actualized address data. Add comment explaining dual provider usage (SuggestionProvider for fresh data, SearchProvider for heuristic extraction).
…dataSearchProvider to handle zero values Replace truthy check with explicit null check for geoLat/geoLon to correctly handle zero coordinates (e.g., locations near equator/prime meridian). Previous implementation would skip valid coordinates with zero values.
…e in upsertHeuristics to end of function Move possibleDuplicateOf update from first pass to end of function to ensure only one update occurs. Compare conflicts from both passes and select highest reliability match overall.
…ss in ResolveAddressDuplicateService
…ject for structured logging Wrap all logger call parameters (addressId, winnerId, loserId, etc.) in a data object to follow structured logging conventions. Update logger calls in ResolveAddressDuplicateService, mergeAddresses, and heuristicMatcher.
…hitespace in parseCoordinates helper Add null/type check for coordString parameter and trim whitespace before parsing to handle malformed coordinate strings. Prevents errors when coordString is null, undefined, or non-string.
…ogleSearchProvider to handle zero values
…ss is referenced in condo Properties Skip merge operation when current address (duplicate) is referenced in Properties, as the mutation requires the target to be the winner. Previously attempted to swap winner/loser which would violate merge constraints.
… to create-address-heuristics script
…eate-address-heuristics to handle zero values
…ting possibleDuplicateOf in upsertHeuristics Add validation to prevent setting possibleDuplicateOf to the same address (self-link). Log warning when rootAddressId equals addressId and skip the update operation to avoid creating invalid duplicate relationships.
…ses script with batch processing and progress tracking Add batch property reference checking to reduce database queries, implement progress bar visualization, and improve logging format. Replace per-address isAddressReferenced calls with single batch getReferencedAddressIds query. Add total record count display and page-based progress indicators.
25b5b41 to
2910d11
Compare
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
Hey! We are open-source! |
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
@paulo-rossy, please pay attention to this merged PR. |




Address Heuristics: Provider-Agnostic Deduplication System
Problem
The address-service previously relied on
Address.keyfor deduplication. Since different providers (Dadata, Google, Pullenti) generate different keys for the same physical building, duplicate Address records accumulate over time. There was no mechanism to detect or resolve these cross-provider duplicates.Solution
Introduce AddressHeuristic — a model that stores structured identifiers extracted from each provider (FIAS ID, coordinates, Google Place ID, fallback key). When a new address is resolved, the system checks existing heuristics before creating a new Address, enabling cross-provider matching.
How It Works
fias_id+coordinates+fallback)findAddressByHeuristics()checks each heuristic against the DB, sorted by reliability score — first match winspossibleDuplicateOfmay be set if a near-match is detectedWhere to Start Reviewing
Start with the data model, then follow the flow:
domains/address/schema/AddressHeuristic.js— the new model (type, value, reliability, provider, lat/lon for coordinate range queries)domains/address/schema/Address.js— the newpossibleDuplicateOfrelationship fielddomains/common/utils/services/search/heuristicMatcher.js— core matching logic:findAddressByHeuristics(),upsertHeuristics(), coordinate range queriesdomains/common/utils/services/search/searchServiceUtils.js— integration point where heuristics enter the search flowextractHeuristics()methods — see how each provider extracts its identifiers:DadataSearchProvider.js→ fias_id, coordinates, fallbackGoogleSearchProvider.js→ google_place_id, coordinates, fallbackPullentiSearchProvider.js→ fias_id, fallbackInjectionsSeeker.js→ fallback onlyThen review the resolution/merge side:
domains/address/utils/mergeAddresses.js— shared utility for merging two addresses (moves sources + heuristics, soft-deletes loser)domains/address/schema/ResolveAddressDuplicateService.js— GraphQL mutation for admin merge/dismissdomains/address/schema/ActualizeAddressesService.js— updated to use shared merge utility (now handles heuristics during key collisions)admin-ui/index.js— the UI button for resolving duplicatesMigration scripts (all support
--dry-run):bin/migrate-address-keys-to-heuristics.js— updates Address.key format (fias: → fias_id:, bare → fallback:)bin/create-address-heuristics.js— backfills AddressHeuristic records from existing Address metabin/local/merge-duplicate-addresses.js— bulk auto-merge clear duplicate casesWhat to Pay Attention To
COORDINATE_TOLERANCE = 0.00001(~1.1m at equator). Useslatitude/longitudeDecimal fields with DB range queries — not in-memory scanning(type, value)wheredeletedAt IS NULL— prevents duplicate heuristic recordspossibleDuplicateOfalways points to the root address viafindRootAddress()Address.keyis still generated but now uses the best heuristic (highest reliability) as the key source. Fallback key remains for providers without strong identifiersmergeAddresses()— changes to merge behavior only need to happen in one placeSummary by CodeRabbit
New Features
Documentation
Tests