Skip to content

Make as_epi_archive() correct or balk at data.tables with corrupted metadata from dplyr operations #684

@brookslogan

Description

@brookslogan

See discussion here. We (devs and users) should never use dplyr on data.tables due to likely every verb violating its memory model + some of them producing corrupt data.tables. E.g., arrange can output a data.table with incorrect metadata. We should make as_epi_archive either:

  • Make something valid out of this invalid input (see comment A, B).
  • Detect invalid metadata somehow and balk, forcing user to correct their dplyr usage. This might be hard to do or involve peeking into data.table internals. Perhaps we could go with the first approach but also detect specific violations like not actually being appropriately sorted and balk at them. Or just give up on this idea.

The memory model violations mean that we could have input columns clobbering another data.table's "owned" columns. If we want to address these, then the first approach may be just: if x is a data.table, convert to a plain data.frame with as.data.frame (should dupe columns), then setDT to a data.table with the appropriate key. This should also fix the metadata-based issues since it should nuke the data.table metadata.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions