Skip to content

BacDive: _extract_value_from_json_path loses ~22% of phenotype data when intermediate nodes are arrays #474

@turbomam

Description

@turbomam

Summary

BacDive JSON data has inconsistent shapes where the same path can be either a single object or an array of objects. The current _extract_value_from_json_path() function in bacdive.py silently loses data when intermediate nodes are arrays, because it only handles dict traversal.

Estimated data loss: ~22% of METPO phenotype extractions (19,246 of 86,549 record-path combinations)

The Bug

In kg_microbe/transform_utils/bacdive/bacdive.py lines 314-320:

def _extract_value_from_json_path(self, record: dict, json_path: str):
    # ...
    for part in parts:
        if isinstance(current, dict):
            current = current.get(part)
            if current is None:
                return []
        else:
            return []  # ← BUG: If intermediate node is array, returns empty!

When traversing a path like "Morphology.cell morphology.cell shape":

  • If cell morphology is a dict: traversal continues, value extracted ✓
  • If cell morphology is an array: returns [], ALL values lost ✗

Data Loss Analysis

Analysis of 99,392 BacDive strains shows these METPO phenotype extraction paths are affected:

Phenotype Intermediate Path Total Object Array % Lost
halophily Physiology and metabolism.halophily 9,687 3,209 6,478 66.9%
oxygen preference Physiology and metabolism.oxygen tolerance 23,255 18,655 4,600 19.8%
cell shape Morphology.cell morphology 16,032 13,251 2,781 17.3%
gram stain Morphology.cell morphology 16,032 13,251 2,781 17.3%
motility Morphology.cell morphology 16,032 13,251 2,781 17.3%
biosafety level Safety information.risk assessment 31,642 26,678 4,964 15.7%
sporulation Physiology and metabolism.spore formation 5,443 5,050 393 7.2%
trophic type Physiology and metabolism.nutrition type 490 460 30 6.1%
TOTAL 86,549 67,303 19,246 22.2%

Verified Example

Document 98 - cell morphology is object (works):

{
  "Morphology": {
    "cell morphology": {
      "@ref": 119306,
      "gram stain": "negative",
      "cell shape": "coccus-shaped",
      "motility": "no"
    }
  }
}

Document 99 - cell morphology is array (data lost):

{
  "Morphology": {
    "cell morphology": [
      {"@ref": 22965, "gram stain": "negative", "cell shape": "coccus-shaped", "motility": "no"},
      {"@ref": 67771, "cell shape": "coccus-shaped"},
      {"@ref": 67771, "gram stain": "negative"},
      {"@ref": 120258, "gram stain": "negative", "cell shape": "coccus-shaped", "motility": "no"}
    ]
  }
}

For document 99, _extract_value_from_json_path("Morphology.cell morphology.cell shape") returns [] even though there are 4 valid cell shape values.

Proposed Fix

Modify _extract_value_from_json_path() to handle arrays at intermediate nodes:

def _extract_value_from_json_path(self, record: dict, json_path: str):
    parts = json_path.split(".")
    current = record

    for part in parts[:-1]:  # Traverse all but the last part
        if isinstance(current, dict):
            current = current.get(part)
            if current is None:
                return []
        elif isinstance(current, list):
            # Flatten: collect from all items in the array
            results = []
            for item in current:
                if isinstance(item, dict):
                    sub_result = self._extract_value_from_json_path(
                        {part: item.get(part)}, ".".join(parts[parts.index(part):])
                    )
                    results.extend(sub_result)
            return results
        else:
            return []

    # Handle the final value (existing logic)
    last_key = parts[-1]
    if isinstance(current, list):
        result = []
        for item in current:
            if isinstance(item, dict):
                value = item.get(last_key)
                if value:
                    result.append(str(value).strip())
            elif item:
                result.append(str(item).strip())
        return result
    elif isinstance(current, dict):
        value = current.get(last_key)
        if value:
            return [str(value).strip()]
        return []
    elif current is not None:
        return [str(current).strip()]
    else:
        return []

Or simpler - normalize arrays to be processed element-by-element:

def _extract_value_from_json_path(self, record: dict, json_path: str):
    parts = json_path.split(".")
    
    def traverse(current, remaining_parts):
        if not remaining_parts:
            if current is None:
                return []
            if isinstance(current, list):
                return [str(v).strip() for v in current if v]
            return [str(current).strip()] if current else []
        
        part = remaining_parts[0]
        rest = remaining_parts[1:]
        
        if isinstance(current, dict):
            return traverse(current.get(part), rest)
        elif isinstance(current, list):
            results = []
            for item in current:
                if isinstance(item, dict):
                    results.extend(traverse(item.get(part), rest))
            return results
        else:
            return []
    
    return traverse(record, parts)

BacDive Shape Patterns

Full intermediate node shape analysis (click to expand)
INTERMEDIATE NODES: object | array<object>
These nodes can be either a single object OR an array of objects

Path                                                             Object    Array    Total   %Array
----------------------------------------------------------------------------------------------
Culture and growth conditions.culture medium                      19587    21299    40886    52.1%
Culture and growth conditions.culture pH                           1147     5649     6796    83.1%
Culture and growth conditions.culture temp                        32617    16889    49506    34.1%
External links.literature                                         10636     9062    19698    46.0%
External links.phages                                               134       78      212    36.8%
External links.straininfo link                                    45914      171    46085     0.4%
General.NCBI tax id                                               94863     3711    98574     3.8%
General.strain history                                            32204    11317    43521    26.0%
Isolation, sampling and environmental information.isolation       44689    14399    59088    24.4%
Isolation, sampling and environmental information.isolation source categories    11768    31611    43379    72.9%
Morphology.cell morphology                                        13251     2781    16032    17.3%
Morphology.colony morphology                                       8122     2451    10573    23.2%
Morphology.multicellular morphology                                6528     1400     7928    17.7%
Morphology.multimedia                                              3022      788     3810    20.7%
Morphology.pigmentation                                            4090      444     4534     9.8%
Name and taxonomic classification.LPSN.synonyms                   24090    37801    61891    61.1%
Physiology and metabolism.antibiogram                               107      112      219    51.1%
Physiology and metabolism.antibiotic resistance                    4163     2397     6560    36.5%
Physiology and metabolism.compound production                      1615      893     2508    35.6%
Physiology and metabolism.enzymes                                   659    28660    29319    97.8%
Physiology and metabolism.fatty acid profile                       5420      214     5634     3.8%
Physiology and metabolism.halophily                                3209     6478     9687    66.9%
Physiology and metabolism.metabolite production                   15933     7587    23520    32.3%
Physiology and metabolism.metabolite tests                        12093     8106    20199    40.1%
Physiology and metabolism.metabolite utilization                    298    29901    30199    99.0%
Physiology and metabolism.murein                                   1307       12     1319     0.9%
Physiology and metabolism.nutrition type                            460       30      490     6.1%
Physiology and metabolism.observation                              7136     3911    11047    35.4%
Physiology and metabolism.oxygen tolerance                        18655     4600    23255    19.8%
Physiology and metabolism.spore formation                          5050      393     5443     7.2%
Safety information.risk assessment                                26678     4964    31642    15.7%
Sequence information.16S sequences                                21353     5691    27044    21.0%
Sequence information.GC content                                   10426     5390    15816    34.1%
Sequence information.Genome sequences                              4465    13140    17605    74.6%
----------------------------------------------------------------------------------------------
TOTAL                                                            491689   282330   774019    36.5%

Notes

  • The code already handles list vs dict at leaf nodes correctly (e.g., metabolite_utilization, enzymes) using isinstance checks
  • The bug is specifically in intermediate node traversal in _extract_value_from_json_path()
  • Other code paths that directly access fields (like culture medium processing at line 1241) already normalize with if not isinstance(media, list): media = [media]

Analysis Method

Shape analysis was performed on a local MongoDB copy of 99,392 BacDive strains using aggregation queries to count type distribution at each path. The bacdive_meta.property_schemas collection contains inferred JSON schemas that document the anyOf patterns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions