Skip to content

docs: Formalising naming conventions #185

@ThomasHepworth

Description

@ThomasHepworth

Formalise naming conventions across the codebase

Following on from #181 and recent discussions, we should look to standardise naming conventions across the codebase before 1.0. This ticket captures outstanding inconsistencies beyond the items already discussed.

Address table parameters

Current parameter names are inconsistent:

Location Current names Proposed
run_deterministic_match_pass df_addresses_to_match / df_addresses_to_search_within df_messy / df_canonical
get_linker df_addresses_to_match / df_addresses_to_search_within df_messy / df_canonical
select_top_match_candidates df_exact_matches / df_splink_matches df_high_precision_matches / df_probabilistic_matches

The verbose names are descriptive but inconsistent with the messy/canonical terminology we want to adopt.

Address tables

Relatively uncontroversial, but we currently have two primary address tables, which we've given inconsistent names to in our example matching, due to the shape of the current API.

I would suggest that we use the following naming conventions for our primary tables:

  • messy_addresses (instead of fuzzy or any other variation). This makes it clear which records are to be matched
  • canonical_addresses - for the canonical list of addresses the user is to provide

Then we will have a third "address table" which contain our matches:

Stages

We currently have two primary matching "phases" or "stages" - deterministic (ensemble of techniques with high precision) and probabilistic (Splink).

For cleaning stages, I will we are relatively content with our current naming schemes. We have pre-tf and tf cleaning, depending on the user's needs. These were refactored in:

I would propose that going forward we name our linkage stages as such:

  • run_high_precision_match_pass instead of "deterministic". This makes it clear what our intent is with this stage
  • run_probabilistic_match_pass instead of the Linker magic we currently have. This is the second phase which introduces the probabilistic linking techniques to our ensemble.

This is all partially captured within:

Match reason enum values

Current values in match_reasons.py mix description styles:

  • "exact: full match"
  • "splink: probabilistic match"
  • "unique_trigram: unique trigram match"

Proposed format: "{pass}:{stage}" for consistency:

  • "high_precision:exact"
  • "high_precision:trigram"
  • "probabilistic:splink"

This would be a breaking change for anyone parsing match reasons.

Terminology: stage vs pass vs phase

The codebase uses all three interchangeably. Proposed hierarchy:

  • Phase - top level matching phase (high precision pass, probabilistic pass)
  • Stage - sub-steps within a pass (exact stage, trigram stage within high precision pass)

Column suffix documentation

Splink outputs use _r and _l suffixes. We should document the mapping clearly:

  • _r = messy (right dataset in Splink)
  • _l = canonical (left dataset in Splink)

This is set in splink_model.py via input_table_aliases=["m_", "c_"] but the relationship is not obvious to users.

If possible, it would be helpful to adjust the suffixes directly so it's easier for users to differentiate between messy and canonical addresses in the probabilistic output table.

Post-linkage function naming

Functions like best_matches_with_distinguishability, improve_predictions_using_distinguishing_tokens, and select_top_match_candidates have no common prefix and are confusing for the user. I think these should really be hidden behind the primary API as we proceed, but if we do keep them public for users, perhaps look to give them sensible prefixes? For example, prefix with postprocess_*.

Related:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions