docs: Formalising naming conventions

## Formalise naming conventions across the codebase

Following on from #181 and recent discussions, we should look to standardise naming conventions across the codebase before 1.0. This ticket captures outstanding inconsistencies beyond the items already discussed.

### Address table parameters

Current parameter names are inconsistent:

| Location | Current names | Proposed |
|----------|---------------|----------|
| `run_deterministic_match_pass` | `df_addresses_to_match` / `df_addresses_to_search_within` | `df_messy` / `df_canonical` |
| `get_linker` | `df_addresses_to_match` / `df_addresses_to_search_within` | `df_messy` / `df_canonical` |
| `select_top_match_candidates` | `df_exact_matches` / `df_splink_matches` | `df_high_precision_matches` / `df_probabilistic_matches` |

The verbose names are descriptive but inconsistent with the `messy`/`canonical` terminology we want to adopt.

### Address tables

Relatively uncontroversial, but we currently have two primary address tables, which we've given [inconsistent names](https://github.com/moj-analytical-services/uk_address_matcher/blob/main/examples/example_matching.py#L75) to in our example matching, due to the shape of the current API.

I would suggest that we use the following naming conventions for our primary tables:
- `messy_addresses` (instead of `fuzzy` or any other variation). This makes it clear which records are to be matched
- `canonical_addresses` - for the canonical list of addresses the user is to provide

Then we will have a third "address table" which contain our matches:
- `__ukam_results` - as discussed and suggested in https://github.com/moj-analytical-services/uk_address_matcher/issues/181#issuecomment-3846633217.

### Stages

We currently have two primary matching "phases" or "stages" - deterministic (ensemble of techniques with high precision) and probabilistic (Splink).

For cleaning stages, I will we are relatively content with our current naming schemes. We have pre-tf and tf cleaning, depending on the user's needs. These were refactored in:
- https://github.com/moj-analytical-services/uk_address_matcher/commit/7eba78752655581311ede7e9e0468605bfab6810

I would propose that going forward we name our linkage stages as such:
- `run_high_precision_match_pass` instead of "deterministic". This makes it clear what our intent is with this stage
- `run_probabilistic_match_pass` instead of the Linker magic we currently have. This is the second phase which introduces the probabilistic linking techniques to our ensemble.

This is all partially captured within:
- https://github.com/moj-analytical-services/uk_address_matcher/pull/144

### Match reason enum values

Current values in `match_reasons.py` mix description styles:
- `"exact: full match"`
- `"splink: probabilistic match"`
- `"unique_trigram: unique trigram match"`

Proposed format: `"{pass}:{stage}"` for consistency:
- `"high_precision:exact"`
- `"high_precision:trigram"`
- `"probabilistic:splink"`

This would be a breaking change for anyone parsing match reasons.

### Terminology: stage vs pass vs phase

The codebase uses all three interchangeably. Proposed hierarchy:
- **Phase** - top level matching phase (high precision pass, probabilistic pass)
- **Stage** - sub-steps within a pass (exact stage, trigram stage within high precision pass)

### Column suffix documentation

Splink outputs use `_r` and `_l` suffixes. We should document the mapping clearly:
- `_r` = messy (right dataset in Splink)
- `_l` = canonical (left dataset in Splink)

This is set in `splink_model.py` via `input_table_aliases=["m_", "c_"]` but the relationship is not obvious to users.

If possible, it would be helpful to adjust the suffixes directly so it's easier for users to differentiate between messy and canonical addresses in the probabilistic output table.

### Post-linkage function naming

Functions like `best_matches_with_distinguishability`, `improve_predictions_using_distinguishing_tokens`, and `select_top_match_candidates` have no common prefix and are confusing for the user. I think these should really be hidden behind the primary API as we proceed, but if we do keep them public for users, perhaps look to give them sensible prefixes? For example, prefix with `postprocess_*`.


Related:
- #181
- #144

Location	Current names	Proposed
`run_deterministic_match_pass`	`df_addresses_to_match` / `df_addresses_to_search_within`	`df_messy` / `df_canonical`
`get_linker`	`df_addresses_to_match` / `df_addresses_to_search_within`	`df_messy` / `df_canonical`
`select_top_match_candidates`	`df_exact_matches` / `df_splink_matches`	`df_high_precision_matches` / `df_probabilistic_matches`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Formalising naming conventions #185

Formalise naming conventions across the codebase

Address table parameters

Address tables

Stages

Match reason enum values

Terminology: stage vs pass vs phase

Column suffix documentation

Post-linkage function naming

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

docs: Formalising naming conventions #185

Description

Formalise naming conventions across the codebase

Address table parameters

Address tables

Stages

Match reason enum values

Terminology: stage vs pass vs phase

Column suffix documentation

Post-linkage function naming

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions