-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Labels
bugSomething isn't workingSomething isn't working
Description
When a SQL on FHIR ViewDefinition includes a resource-level column accessing <element>.text (e.g. code.text) alongside a forEach on <element>.coding (e.g. forEach: "code.coding"), the resulting Spark DataFrame produces incorrect results when converted to Pandas via toPandas().
Behaviour
- With Arrow enabled (
spark.sql.execution.arrow.pyspark.enabled = true):toPandas()returns a DataFrame with column data misaligned — values are shifted by one column position. - With Arrow disabled:
toPandas()returns an empty DataFrame (0 rows) whiledf.count()returns the correct row count. df.show()anddf.count()work correctly in both cases.
Minimal reproduction
from pathling import PathlingContext
pc = PathlingContext.create(enable_extensions=True)
pc.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
ds = pc.read.ndjson("/path/to/ndjson")
view_json = """
{
"resourceType": "ViewDefinition",
"resource": "Condition",
"status": "draft",
"name": "test",
"select": [
{
"column": [
{"name": "id", "path": "getResourceKey()", "type": "string"},
{"name": "patient_id", "path": "subject.getReferenceKey(Patient)", "type": "string"},
{"name": "source_code_text", "path": "code.text", "type": "string"}
]
},
{
"forEach": "code.coding",
"column": [
{"name": "code_system", "path": "system", "type": "uri"},
{"name": "code_value", "path": "code", "type": "code"},
{"name": "code_display", "path": "display", "type": "string"}
]
}
]
}
"""
df = ds.view(json=view_json)
print(df.count()) # Correct count
df.show() # Correct data
pdf = df.toPandas() # Empty or misaligned
print(pdf.shape) # (0, 6) with Arrow disabledRemoving the source_code_text column (which accesses code.text) resolves the issue. The problem appears specific to accessing <element>.text at the resource level while also iterating over <element>.coding with forEach — both paths reference the same parent FHIR element (code).
Tested with MIMIC-IV-on-FHIR demo data against Condition resources.
Environment
- pathling 9.4.0
- PySpark 4.0.2
- Python 3.14.2
- macOS (Darwin 25.3.0, arm64)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working
Type
Projects
Status
Planned