Skip to content

feat(spark): support Struct type and literal, and include names for struct fields#342

Merged
vbarua merged 9 commits intosubstrait-io:mainfrom
Blizzara:avo/support-structs-and-names
Mar 28, 2025
Merged

feat(spark): support Struct type and literal, and include names for struct fields#342
vbarua merged 9 commits intosubstrait-io:mainfrom
Blizzara:avo/support-structs-and-names

Conversation

@Blizzara
Copy link
Contributor

BREAKING CHANGE: Plan root's "names" field now includes nested names in addition to the top field names. The new behavior is how Substrait spec defines the names, but this change breaks compatibility of existing plans containing struct fields since the names list will not match.

@Blizzara Blizzara changed the title feat(spark: support Struct type and literal, and include names for struct fields feat(spark): support Struct type and literal, and include names for struct fields Mar 12, 2025
ImmutableRoot
.builder()
.input(rel)
.addAllNames(ToSubstraitType.toNamedStruct(p.schema).names())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the produced plans to include DFS names in the root


private def buildNamedScan(schema: StructType, tableNames: List[String]): relation.NamedScan = {
val namedStruct = toNamedStruct(schema)
val namedStruct = ToSubstraitType.toNamedStruct(schema)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just being more explicit here

NamedStruct.of(names, struct)
}

def toStructType(namedStruct: NamedStruct): StructType = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these were previously inside ToSubstraitType, which is confusing given they return Spark types. Now they're moved into the ToSparkType object above, but also rewritten to use the newly-available struct type code.

(We should also split this file, but I'd rather do that as FLUP to keep the diff clean)

@Blizzara Blizzara force-pushed the avo/support-structs-and-names branch 2 times, most recently from 96b721c to eec9d21 Compare March 12, 2025 16:46
typeExpression.accept(new ToSparkType(Seq.empty))
}

def toStructType(namedStruct: NamedStruct): StructType = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved from below, see comment below

.map { case ((t, d), name) => StructField(name, d, t.nullable()) }
)
def toNamedStruct(schema: StructType): NamedStruct = {
val dfsNames = JavaConverters.seqAsJavaList(fieldNamesDfs(schema))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this changes anything using toNamedStruct to produce the full list of names rather than only top-level

def seqToOptionHelper(s: Seq[Option[T]], accum: Seq[T] = Seq[T]()): Option[Seq[T]] = {
s match {
case Some(head) :: Nil =>
case Seq(Some(head)) =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why but the earlier code just wasn't working

@Blizzara Blizzara force-pushed the avo/support-structs-and-names branch from eec9d21 to 36d3b20 Compare March 12, 2025 16:53
@Blizzara Blizzara marked this pull request as ready for review March 12, 2025 16:53
@Blizzara
Copy link
Contributor Author

@andrew-coleman @vbarua looking for your review for this :) Note that this is a breaking change since the currently produced plans aren't valid Substrait - they lack the inner names, this adds those, but that changes the meaning of the names-fields.

For Andrew: I saw you had added a bit of struct handling in your PR, this is compatible with that but adds the rest of the handling as well.

@vbarua vbarua self-requested a review March 12, 2025 16:58
Comment on lines +93 to +96
require(nameIdx < dfsNames.size)
val n = dfsNames(nameIdx)
nameIdx += 1
n
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
require(nameIdx < dfsNames.size)
val n = dfsNames(nameIdx)
nameIdx += 1
n
dfsNames(i)

Won't nameIdx always be the same i? If so, no need to increment a separate counter.
Incrementing an object-scoped counter assumes that the .map is behaving like a foreach - i.e. guaranteed order of execution, but I think (in theory, at least) the map could execute its functions concurrently (not sure it does, in practice), just assembling the results back into the original sequence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this further, I guess it's doing this because it's doing a depth-first traversal of a potentially nested struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, exactly

Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a brief skim, and overall looks reasonable to me.

Thanks for looking at this as well @andrew-coleman, I think you're more familiar with the Spark side as well.

@vbarua vbarua merged commit f27004a into substrait-io:main Mar 28, 2025
13 checks passed
@Blizzara Blizzara deleted the avo/support-structs-and-names branch April 2, 2025 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants