Skip to content

Latest commit

 

History

History
135 lines (110 loc) · 6.21 KB

File metadata and controls

135 lines (110 loc) · 6.21 KB

Multi-Valued Ordinals Implementation - Session Notes

Status: Core Implementation Complete ✅

What's Working

  • Multi-valued ordinals fully functional: All 33 tests passing in test-Entity-multivalued-ordinals.R
  • STF roundtrip: Multi-valued ordinals correctly import/export through STF format
  • Validation: New validator validate_entity_multivalued_ordinal_levels() ensures expanded values match ordinal_levels
  • Vocabulary fix: Critical bug fixed in get_hydrated_variable_and_category_metadata() - ordinal variables now get vocabulary from ordinal_levels metadata (canonical source) rather than from data

Code Changes Made

1. Enable Multi-Valued Ordinals

  • R/Entity-methods.R:1454-1457: Removed blocking code that prevented ordinal + multi-valued combination
  • R/Entity-methods.R:1589-1593: sync_ordinal_data() now filters out multi-valued ordinals (they stay as character strings, not factors)
  • R/Entity-methods.R:1256-1293: set_variable_ordinal_levels() now handles multi-valued by expanding values and validating against levels

2. Validation

  • R/entity-validators-data-types.R:211-220: validate_entity_ordinal_factors() skips multi-valued ordinals
  • R/entity-validators-data-types.R:305-371: New validate_entity_multivalued_ordinal_levels() validator
  • R/validators.R:169-170: Registered new validator in baseline profile

3. Vocabulary Fix (Important!)

  • R/Entity-methods.R:1174-1196: Split categorical/binary and ordinal processing
    • Categorical/binary: vocabulary from as.factor(data) %>% levels() (line 1157)
    • Ordinal: vocabulary from ordinal_levels metadata (line 1178) - this ensures correct order and includes user-defined levels not in data

4. VDI Export Multi-Valued Expansion

  • R/Study-export-VDI.R:344-399: String variables - separate single/multi-valued, use pmap for multi-valued
  • R/Study-export-VDI.R:405-462: Number variables - same pattern
  • R/Study-export-VDI.R:466-525: Date variables - same pattern
  • Key: Filters use is_multi_valued == 1 (it's converted to integer at line 275)
  • Performance: Single-valued variables processed in batch (critical for gene expression with 1000s of variables)
  • Multi-valued expansion: Each variable independently expanded with ID replication using rowwise/unnest pattern

Test Files Created

  1. inst/extdata/toy_example/householdsMultiValuedOrdinals.tsv: Test data with multi-valued ordinals
  2. tests/testthat/test-Entity-multivalued-ordinals.R: 33 tests, all passing ✅
  3. tests/testthat/test-Study-export-VDI-multivalued.R: Tests written but need fixture fixes

Remaining Issues

VDI Export Tests Not Passing

Problem: Test entities from householdsMultiValuedOrdinals.tsv have ID detection issues

  • Distances.to.well gets detected as ID column (all unique values in small dataset)
  • Tests fail validation before VDI export even runs

Why This Happened:

  • Unlucky test data - "Distances to well" ended up with all unique values
  • Real-world data wouldn't have this issue
  • Existing VDI tests use make_study() which creates properly structured test data

Solutions for Next Session:

Option A: Fix Test Data (Recommended)

Create better test data or use existing make_study() pattern:

# Instead of entity_from_file, consider:
study <- make_study(name = 'multivalued test')
households <- study %>% get_root_entity() %>%
  set_variables_multivalued('SomeVar' = ';')

Option B: Fix Existing Tests

Each test needs proper setup. Pattern that works:

households <- entity_from_file(file_path, name='household') %>%
  quiet() %>%
  set_variables_multivalued('Construction.materials' = ';') %>%
  redetect_columns_as_variables('Distances.to.well')  # Fix ID detection

Current State of VDI Tests:

  • Lines 1-80: test 1 (string) - partially fixed, needs validation
  • Lines 83-127: test 2 (number) - partially fixed
  • Lines 140-192: test 3 (date) - needs redetect fix
  • Lines 194-246: test 4 (integer) - needs redetect fix
  • Lines 248-326: test 5 (ordinals) - needs redetect fix (also line ~260)
  • Lines 332-407: test 6 (mixed) - fixed syntax, needs validation

Next Steps

  1. Fix VDI test fixtures (30 min)

    • Option A: Rewrite to use make_study() pattern
    • Option B: Add redetect_columns_as_variables() to remaining tests (lines 140, 194, 248-260)
  2. Run full test suite (10 min)

    docker exec --user rstudio study-wrangler-dev R -e "setwd('/study.wrangler'); library(devtools); test()"
    • Check for regressions in existing tests
    • Verify ordinal tests still pass
    • Verify existing VDI tests still pass
  3. Test existing multi-valued functionality (5 min)

    • Run test-Entity-multivalued.R to ensure we didn't break existing multi-valued (non-ordinal) behavior
    • Especially check line 18-25 which previously tested that ordinals can't be multi-valued
  4. Documentation (10 min)

    • Update CLAUDE.md if needed with multi-valued ordinal guidance
    • Consider adding example to vignettes

Technical Notes

Multi-Valued Storage

  • Character strings with delimiters: "Low;High;Medium"
  • NOT converted to factors: Multi-valued ordinals must remain character type
  • Validation: Expanded values must all exist in ordinal_levels

VDI Export Pattern

# For each multi-valued variable
data %>%
  select(id = {{id_col}}, value = all_of(var)) %>%
  rowwise() %>%
  mutate(values = list(strsplit(value, delim)[[1]])) %>%
  ungroup() %>%
  tidyr::unnest(values) %>%
  mutate("{id_col_vdi}" := id,
         attribute_stable_id = stable_ids[var],
         string_value = values,  # or number_value, date_value
         .keep = "none")

Key Files Modified

  • R/Entity-methods.R: ~100 lines changed (ordinals, hydrated metadata)
  • R/entity-validators-data-types.R: ~70 lines added (new validator)
  • R/validators.R: 2 lines added (register validator)
  • R/Study-export-VDI.R: ~180 lines changed (multi-valued expansion)
  • Tests: 2 new files, ~400 lines

Questions for User

  1. Do you want to use make_study() pattern for VDI tests, or fix existing entity_from_file() approach?
  2. Should we add multi-valued ordinal examples to vignettes?
  3. Any specific edge cases to test before considering this complete?