-
Notifications
You must be signed in to change notification settings - Fork 49
Description
Summary
It would be useful to have a function that automatically infers primary and foreign key relationships by finding columns with identical names across tables.
Motivation
Datasets that follow naming conventions like ADaM (Analysis Data Model) in pharmaceutical research often have consistent column naming. For example, USUBJID (subject ID) appears across multiple tables and is unique in the subject-level table (ADSL) but not in other tables.
Currently, users must manually specify all PK/FK relationships. An automatic inference function would simplify workflows significantly.
Proposed behavior
A function (e.g., dm_add_keys_by_name()) that:
- Finds columns that appear in multiple tables
- For each shared column, checks which tables have unique values (potential PK)
- If exactly one table has unique values and others don't, adds:
- A primary key to the table with unique values
- Foreign keys from tables with non-unique values
Example
orders <- tibble::tibble(
order_id = 1:5,
customer_id = c(1L, 2L, 1L, 3L, 2L),
amount = c(100, 200, 150, 300, 250)
)
customers <- tibble::tibble(
customer_id = 1:3,
name = c("Alice", "Bob", "Charlie")
)
# Create dm without keys
my_dm <- dm(orders, customers)
# Automatically infer and add keys based on column names
my_dm_with_keys <- dm_add_keys_by_name(my_dm)
# -> PK: customers$customer_id
# -> FK: orders$customer_id -> customers$customer_idEdge cases to consider
- Columns where all tables have unique values (ambiguous PK) should be skipped
- Columns where no table has unique values (no valid PK candidate) should be skipped
- Existing keys should be preserved
Related
This was previously explored in PR #2350 but not merged. The core idea remains valuable for datasets following consistent naming conventions.