Skip to content

Feature request: Infer keys by column name equality #2357

@christophsax

Description

@christophsax

Summary

It would be useful to have a function that automatically infers primary and foreign key relationships by finding columns with identical names across tables.

Motivation

Datasets that follow naming conventions like ADaM (Analysis Data Model) in pharmaceutical research often have consistent column naming. For example, USUBJID (subject ID) appears across multiple tables and is unique in the subject-level table (ADSL) but not in other tables.

Currently, users must manually specify all PK/FK relationships. An automatic inference function would simplify workflows significantly.

Proposed behavior

A function (e.g., dm_add_keys_by_name()) that:

  1. Finds columns that appear in multiple tables
  2. For each shared column, checks which tables have unique values (potential PK)
  3. If exactly one table has unique values and others don't, adds:
    • A primary key to the table with unique values
    • Foreign keys from tables with non-unique values

Example

orders <- tibble::tibble(
  order_id = 1:5,
  customer_id = c(1L, 2L, 1L, 3L, 2L),
  amount = c(100, 200, 150, 300, 250)
)

customers <- tibble::tibble(
  customer_id = 1:3,
  name = c("Alice", "Bob", "Charlie")
)

# Create dm without keys
my_dm <- dm(orders, customers)

# Automatically infer and add keys based on column names
my_dm_with_keys <- dm_add_keys_by_name(my_dm)
# -> PK: customers$customer_id
# -> FK: orders$customer_id -> customers$customer_id

Edge cases to consider

  • Columns where all tables have unique values (ambiguous PK) should be skipped
  • Columns where no table has unique values (no valid PK candidate) should be skipped
  • Existing keys should be preserved

Related

This was previously explored in PR #2350 but not merged. The core idea remains valuable for datasets following consistent naming conventions.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions