-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Formalise naming conventions across the codebase
Following on from #181 and recent discussions, we should look to standardise naming conventions across the codebase before 1.0. This ticket captures outstanding inconsistencies beyond the items already discussed.
Address table parameters
Current parameter names are inconsistent:
| Location | Current names | Proposed |
|---|---|---|
run_deterministic_match_pass |
df_addresses_to_match / df_addresses_to_search_within |
df_messy / df_canonical |
get_linker |
df_addresses_to_match / df_addresses_to_search_within |
df_messy / df_canonical |
select_top_match_candidates |
df_exact_matches / df_splink_matches |
df_high_precision_matches / df_probabilistic_matches |
The verbose names are descriptive but inconsistent with the messy/canonical terminology we want to adopt.
Address tables
Relatively uncontroversial, but we currently have two primary address tables, which we've given inconsistent names to in our example matching, due to the shape of the current API.
I would suggest that we use the following naming conventions for our primary tables:
messy_addresses(instead offuzzyor any other variation). This makes it clear which records are to be matchedcanonical_addresses- for the canonical list of addresses the user is to provide
Then we will have a third "address table" which contain our matches:
__ukam_results- as discussed and suggested in Data API: Formalising how data passes through the system #181 (comment).
Stages
We currently have two primary matching "phases" or "stages" - deterministic (ensemble of techniques with high precision) and probabilistic (Splink).
For cleaning stages, I will we are relatively content with our current naming schemes. We have pre-tf and tf cleaning, depending on the user's needs. These were refactored in:
I would propose that going forward we name our linkage stages as such:
run_high_precision_match_passinstead of "deterministic". This makes it clear what our intent is with this stagerun_probabilistic_match_passinstead of the Linker magic we currently have. This is the second phase which introduces the probabilistic linking techniques to our ensemble.
This is all partially captured within:
Match reason enum values
Current values in match_reasons.py mix description styles:
"exact: full match""splink: probabilistic match""unique_trigram: unique trigram match"
Proposed format: "{pass}:{stage}" for consistency:
"high_precision:exact""high_precision:trigram""probabilistic:splink"
This would be a breaking change for anyone parsing match reasons.
Terminology: stage vs pass vs phase
The codebase uses all three interchangeably. Proposed hierarchy:
- Phase - top level matching phase (high precision pass, probabilistic pass)
- Stage - sub-steps within a pass (exact stage, trigram stage within high precision pass)
Column suffix documentation
Splink outputs use _r and _l suffixes. We should document the mapping clearly:
_r= messy (right dataset in Splink)_l= canonical (left dataset in Splink)
This is set in splink_model.py via input_table_aliases=["m_", "c_"] but the relationship is not obvious to users.
If possible, it would be helpful to adjust the suffixes directly so it's easier for users to differentiate between messy and canonical addresses in the probabilistic output table.
Post-linkage function naming
Functions like best_matches_with_distinguishability, improve_predictions_using_distinguishing_tokens, and select_top_match_candidates have no common prefix and are confusing for the user. I think these should really be hidden behind the primary API as we proceed, but if we do keep them public for users, perhaps look to give them sensible prefixes? For example, prefix with postprocess_*.
Related: