Skip to content

toPandas() returns corrupted or empty data when element.text column coexists with forEach on element.coding #2568

@johngrimes

Description

@johngrimes

When a SQL on FHIR ViewDefinition includes a resource-level column accessing <element>.text (e.g. code.text) alongside a forEach on <element>.coding (e.g. forEach: "code.coding"), the resulting Spark DataFrame produces incorrect results when converted to Pandas via toPandas().

Behaviour

  • With Arrow enabled (spark.sql.execution.arrow.pyspark.enabled = true): toPandas() returns a DataFrame with column data misaligned — values are shifted by one column position.
  • With Arrow disabled: toPandas() returns an empty DataFrame (0 rows) while df.count() returns the correct row count.
  • df.show() and df.count() work correctly in both cases.

Minimal reproduction

from pathling import PathlingContext

pc = PathlingContext.create(enable_extensions=True)
pc.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
ds = pc.read.ndjson("/path/to/ndjson")

view_json = """
{
  "resourceType": "ViewDefinition",
  "resource": "Condition",
  "status": "draft",
  "name": "test",
  "select": [
    {
      "column": [
        {"name": "id", "path": "getResourceKey()", "type": "string"},
        {"name": "patient_id", "path": "subject.getReferenceKey(Patient)", "type": "string"},
        {"name": "source_code_text", "path": "code.text", "type": "string"}
      ]
    },
    {
      "forEach": "code.coding",
      "column": [
        {"name": "code_system", "path": "system", "type": "uri"},
        {"name": "code_value", "path": "code", "type": "code"},
        {"name": "code_display", "path": "display", "type": "string"}
      ]
    }
  ]
}
"""

df = ds.view(json=view_json)
print(df.count())      # Correct count
df.show()              # Correct data
pdf = df.toPandas()    # Empty or misaligned
print(pdf.shape)       # (0, 6) with Arrow disabled

Removing the source_code_text column (which accesses code.text) resolves the issue. The problem appears specific to accessing <element>.text at the resource level while also iterating over <element>.coding with forEach — both paths reference the same parent FHIR element (code).

Tested with MIMIC-IV-on-FHIR demo data against Condition resources.

Environment

  • pathling 9.4.0
  • PySpark 4.0.2
  • Python 3.14.2
  • macOS (Darwin 25.3.0, arm64)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Planned

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions