Skip to content

Attribute-value extension and standardisation #62

@ialarmedalien

Description

@ialarmedalien

Some notes on extensions and standardisation for attribute-value set/pairs/whatever-you-want-to-call-them.

Add a type field to disambiguate between different types of value that may be stored and to allow extra validation.

The type should come from an enum so that we can control what's going in the field. Ideally the value in the attribute would dictate what appears in the type field, and we can use existing ontologies/controlled vocabularies to automatically populate it.

Examples:

properties:
  - attribute:
      id: MIXS:0000117
      label: total phosphorous
    raw_value: 2.2 ppm
    unit: ppm
    numeric_value: 2.2
    type: float

  - attribute:
      id: MIXS:0000011
      label: collection date
    raw_value: 12 Jun 2025
    value: 2025-06-12
    type: iso_datetime

  - attribute:
      label: n people on railway track
    value: 5
    type: integer

  - attribute:
      id: MIXS:0000012
      label: env_broad_scale
    value: terrestrial biome
    value_cv_id: ENVO:00000446
    type: cv_term   # controlled vocab term

  - attribute:
      label: smell
    value: completely disgusting
    type: text

  - attribute:
      label: size_of_bear
    value: big
    type: BearSizeEnum

  - attribute:
      label: random json data
    value: '{"this": "that", "the other": [1,3,5]}'
    type: json

Eventually this could be a form of data validation provided by BERtron itself on data integration; in these early stages, we would have to rely on data providers to add the appropriate type fields.

Capture units as text and as ontology IDs

Not everyone has an encyclopaedic knowledge of the unit ontology (more's the pity), so whilst it is useful to be able to use a controlled vocab to express units, it is not very user friendly. Providing both the controlled vocab ID for the unit and the text string might be a good compromise.

For example:

value: 2.2
unit: UO:0000008

would become

value: 2.2
unit: meter
unit_cv_id: UO:0000008

At a future time point, it would be good if the labels were populated on ingest into BERtron using preloaded reference ontologies. For now, we will have to rely on data providers to have the correct term names and term IDs.

Standardise representation of ontology IDs and their labels

attribute is split into label and id, whilst value has an accompanying value_cv_id sibling. Choose one representation or the other.

  - attribute:
      id: MIXS:0000097
      label: depth
    value:
      raw_value: 2.2m
      value: 2.2  # value.label / value.id doesn't make sense here but 'value.value' is not ideal!
    unit:
      id: UO:0000008
      label: meter
    type: float

  - attribute:
      id: MIXS:0000012
      label: env_broad_scale
    value: 
      id: ENVO:00000446
      label: terrestrial biome
    type: cv_term   # controlled vocab term - value is expressed as value.label and value.id

Another possibility:

  - attribute_label: depth
    attribute_cv_id: MIXS:0000097
    value: 2.2
    unit_label: meter
    unit_cv_id: UO:0000008
    type: float

  - attribute_label: env_broad_scale
    attribute_cv_id: MIXS:0000012
    value_label: terrestrial biome
    value_cv_id: ENVO:00000446
    type: cv_term

A simpler version:

  - attribute: depth
    attribute_cv_id: MIXS:0000097
    value: 2.2
    unit: meter
    unit_cv_id: UO:0000008
    type: float

  - attribute: env_broad_scale
    attribute_cv_id: MIXS:0000012
    value: terrestrial biome
    value_cv_id: ENVO:00000446
    type: cv_term

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions