-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Summary
BacDive JSON data has inconsistent shapes where the same path can be either a single object or an array of objects. The current _extract_value_from_json_path() function in bacdive.py silently loses data when intermediate nodes are arrays, because it only handles dict traversal.
Estimated data loss: ~22% of METPO phenotype extractions (19,246 of 86,549 record-path combinations)
The Bug
In kg_microbe/transform_utils/bacdive/bacdive.py lines 314-320:
def _extract_value_from_json_path(self, record: dict, json_path: str):
# ...
for part in parts:
if isinstance(current, dict):
current = current.get(part)
if current is None:
return []
else:
return [] # ← BUG: If intermediate node is array, returns empty!When traversing a path like "Morphology.cell morphology.cell shape":
- If
cell morphologyis a dict: traversal continues, value extracted ✓ - If
cell morphologyis an array: returns[], ALL values lost ✗
Data Loss Analysis
Analysis of 99,392 BacDive strains shows these METPO phenotype extraction paths are affected:
| Phenotype | Intermediate Path | Total | Object | Array | % Lost |
|---|---|---|---|---|---|
| halophily | Physiology and metabolism.halophily | 9,687 | 3,209 | 6,478 | 66.9% |
| oxygen preference | Physiology and metabolism.oxygen tolerance | 23,255 | 18,655 | 4,600 | 19.8% |
| cell shape | Morphology.cell morphology | 16,032 | 13,251 | 2,781 | 17.3% |
| gram stain | Morphology.cell morphology | 16,032 | 13,251 | 2,781 | 17.3% |
| motility | Morphology.cell morphology | 16,032 | 13,251 | 2,781 | 17.3% |
| biosafety level | Safety information.risk assessment | 31,642 | 26,678 | 4,964 | 15.7% |
| sporulation | Physiology and metabolism.spore formation | 5,443 | 5,050 | 393 | 7.2% |
| trophic type | Physiology and metabolism.nutrition type | 490 | 460 | 30 | 6.1% |
| TOTAL | 86,549 | 67,303 | 19,246 | 22.2% |
Verified Example
Document 98 - cell morphology is object (works):
{
"Morphology": {
"cell morphology": {
"@ref": 119306,
"gram stain": "negative",
"cell shape": "coccus-shaped",
"motility": "no"
}
}
}Document 99 - cell morphology is array (data lost):
{
"Morphology": {
"cell morphology": [
{"@ref": 22965, "gram stain": "negative", "cell shape": "coccus-shaped", "motility": "no"},
{"@ref": 67771, "cell shape": "coccus-shaped"},
{"@ref": 67771, "gram stain": "negative"},
{"@ref": 120258, "gram stain": "negative", "cell shape": "coccus-shaped", "motility": "no"}
]
}
}For document 99, _extract_value_from_json_path("Morphology.cell morphology.cell shape") returns [] even though there are 4 valid cell shape values.
Proposed Fix
Modify _extract_value_from_json_path() to handle arrays at intermediate nodes:
def _extract_value_from_json_path(self, record: dict, json_path: str):
parts = json_path.split(".")
current = record
for part in parts[:-1]: # Traverse all but the last part
if isinstance(current, dict):
current = current.get(part)
if current is None:
return []
elif isinstance(current, list):
# Flatten: collect from all items in the array
results = []
for item in current:
if isinstance(item, dict):
sub_result = self._extract_value_from_json_path(
{part: item.get(part)}, ".".join(parts[parts.index(part):])
)
results.extend(sub_result)
return results
else:
return []
# Handle the final value (existing logic)
last_key = parts[-1]
if isinstance(current, list):
result = []
for item in current:
if isinstance(item, dict):
value = item.get(last_key)
if value:
result.append(str(value).strip())
elif item:
result.append(str(item).strip())
return result
elif isinstance(current, dict):
value = current.get(last_key)
if value:
return [str(value).strip()]
return []
elif current is not None:
return [str(current).strip()]
else:
return []Or simpler - normalize arrays to be processed element-by-element:
def _extract_value_from_json_path(self, record: dict, json_path: str):
parts = json_path.split(".")
def traverse(current, remaining_parts):
if not remaining_parts:
if current is None:
return []
if isinstance(current, list):
return [str(v).strip() for v in current if v]
return [str(current).strip()] if current else []
part = remaining_parts[0]
rest = remaining_parts[1:]
if isinstance(current, dict):
return traverse(current.get(part), rest)
elif isinstance(current, list):
results = []
for item in current:
if isinstance(item, dict):
results.extend(traverse(item.get(part), rest))
return results
else:
return []
return traverse(record, parts)BacDive Shape Patterns
Full intermediate node shape analysis (click to expand)
INTERMEDIATE NODES: object | array<object>
These nodes can be either a single object OR an array of objects
Path Object Array Total %Array
----------------------------------------------------------------------------------------------
Culture and growth conditions.culture medium 19587 21299 40886 52.1%
Culture and growth conditions.culture pH 1147 5649 6796 83.1%
Culture and growth conditions.culture temp 32617 16889 49506 34.1%
External links.literature 10636 9062 19698 46.0%
External links.phages 134 78 212 36.8%
External links.straininfo link 45914 171 46085 0.4%
General.NCBI tax id 94863 3711 98574 3.8%
General.strain history 32204 11317 43521 26.0%
Isolation, sampling and environmental information.isolation 44689 14399 59088 24.4%
Isolation, sampling and environmental information.isolation source categories 11768 31611 43379 72.9%
Morphology.cell morphology 13251 2781 16032 17.3%
Morphology.colony morphology 8122 2451 10573 23.2%
Morphology.multicellular morphology 6528 1400 7928 17.7%
Morphology.multimedia 3022 788 3810 20.7%
Morphology.pigmentation 4090 444 4534 9.8%
Name and taxonomic classification.LPSN.synonyms 24090 37801 61891 61.1%
Physiology and metabolism.antibiogram 107 112 219 51.1%
Physiology and metabolism.antibiotic resistance 4163 2397 6560 36.5%
Physiology and metabolism.compound production 1615 893 2508 35.6%
Physiology and metabolism.enzymes 659 28660 29319 97.8%
Physiology and metabolism.fatty acid profile 5420 214 5634 3.8%
Physiology and metabolism.halophily 3209 6478 9687 66.9%
Physiology and metabolism.metabolite production 15933 7587 23520 32.3%
Physiology and metabolism.metabolite tests 12093 8106 20199 40.1%
Physiology and metabolism.metabolite utilization 298 29901 30199 99.0%
Physiology and metabolism.murein 1307 12 1319 0.9%
Physiology and metabolism.nutrition type 460 30 490 6.1%
Physiology and metabolism.observation 7136 3911 11047 35.4%
Physiology and metabolism.oxygen tolerance 18655 4600 23255 19.8%
Physiology and metabolism.spore formation 5050 393 5443 7.2%
Safety information.risk assessment 26678 4964 31642 15.7%
Sequence information.16S sequences 21353 5691 27044 21.0%
Sequence information.GC content 10426 5390 15816 34.1%
Sequence information.Genome sequences 4465 13140 17605 74.6%
----------------------------------------------------------------------------------------------
TOTAL 491689 282330 774019 36.5%
Notes
- The code already handles list vs dict at leaf nodes correctly (e.g.,
metabolite_utilization,enzymes) usingisinstancechecks - The bug is specifically in intermediate node traversal in
_extract_value_from_json_path() - Other code paths that directly access fields (like culture medium processing at line 1241) already normalize with
if not isinstance(media, list): media = [media]
Analysis Method
Shape analysis was performed on a local MongoDB copy of 99,392 BacDive strains using aggregation queries to count type distribution at each path. The bacdive_meta.property_schemas collection contains inferred JSON schemas that document the anyOf patterns.