Fix catalyst error by MiguelRosaTauroni · Pull Request #419 · InseeFr/Trevas

MiguelRosaTauroni · 2025-06-03T06:57:18Z

The PR addresses a critical issue identified during the execution of VTL scripts using the Trevas engine v1.8.0/v1.9.0 on large datasets within AWS Glue. The problem manifests as a StackOverflowError triggered by Spark Catalyst during the logical plan construction phase.

Root Cause
The error is caused by the addMetadata method in SparkDataset, which iterates over all columns using repeated .withColumn() calls. In Spark, this creates a deeply nested logical plan, causing Catalyst to overflow the stack when handling datasets with many columns.

More details - #413

… in pom.xml and SparkDataset.java

NicoLaval

Great job, thanks!
Just one small issue with a spanish comment.

I also added a CI step to inseefr/develop, please, rebase and push, I will fire the new CI after.

NicoLaval · 2025-06-03T07:36:52Z

vtl-spark/src/main/java/fr/insee/vtl/spark/SparkDataset.java

-    }
-    return casted;
+
+    // Se construye una lista de expresiones para castear en una sola transformación


In english please :)

hadrienk

Looks good. The comment should be in english.

hadrienk · 2025-06-03T08:39:23Z

vtl-spark/src/main/java/fr/insee/vtl/spark/SparkDataset.java

+    List<Column> castedColumns =
+        Arrays.stream(schema.fields())
+            .map(
+                field -> {
+                  DataType type = field.dataType();
+                  Column col = sparkDataset.col(field.name());
+                  if (type instanceof IntegerType
+                      || type instanceof FloatType
+                      || type instanceof DecimalType) {
+                    return col.cast(
+                            type instanceof IntegerType ? DataTypes.LongType : DataTypes.DoubleType)
+                        .alias(field.name());
+                  }
+                  return col;
+                })
+            .collect(Collectors.toList());
+
+    return sparkDataset.select(castedColumns.toArray(new Column[0]));


Consider toArray();

Suggested change

List<Column> castedColumns =

Arrays.stream(schema.fields())

.map(

field -> {

DataType type = field.dataType();

Column col = sparkDataset.col(field.name());

if (type instanceof IntegerType

|| type instanceof FloatType

|| type instanceof DecimalType) {

return col.cast(

type instanceof IntegerType ? DataTypes.LongType : DataTypes.DoubleType)

.alias(field.name());

}

return col;

})

.collect(Collectors.toList());

return sparkDataset.select(castedColumns.toArray(new Column[0]));

var castColumns =

Arrays.stream(schema.fields())

.map(

field -> {

DataType type = field.dataType();

Column col = sparkDataset.col(field.name());

if (type instanceof IntegerType

|| type instanceof FloatType

|| type instanceof DecimalType) {

return col.cast(

type instanceof IntegerType ? DataTypes.LongType : DataTypes.DoubleType)

.alias(field.name());

}

return col;

})

.toArray(Column[]::new));

return sparkDataset.select(castColumns);

hadrienk · 2025-06-03T08:44:51Z

I see the tests are failing. Is this because it if a fork @NicoLaval ?

NicoLaval · 2025-06-03T08:51:25Z

I see the tests are failing. Is this because it if a fork @NicoLaval ?

Yes, see #420, I added a simpler test step to exclude SDMX

hadrienk · 2025-06-03T09:56:29Z

I see the tests are failing. Is this because it if a fork @NicoLaval ?

Yes, see #420, I added a simpler test step to exclude SDMX

Right, @MiguelRosaTauroni could you pull/rebase your branch?

…given in the pr Merge remote-tracking branch 'upstream/develop' into fix-catalyst-error

NicoLaval · 2025-07-09T08:16:18Z

Hi @MiguelRosaTauroni,

Trevas tests excluding vtl-smdx have passed.

We're going to release soon, do you have any emergency on your side for the next release?

EC2 Default User and others added 2 commits May 20, 2025 15:46

Fix the problems with the catalyst optimizer in runtime

c44a02d

Merge upstream/develop into fix-catalyst-error and resolved conflicts…

bee87ad

… in pom.xml and SparkDataset.java

NicoLaval requested changes Jun 3, 2025

View reviewed changes

hadrienk reviewed Jun 3, 2025

View reviewed changes

Delete the spanish comment + adapting the code by the recommendation …

0ae0dbd

…given in the pr Merge remote-tracking branch 'upstream/develop' into fix-catalyst-error

NicoLaval merged commit b2772db into InseeFr:develop Jul 9, 2025
5 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix catalyst error#419

Fix catalyst error#419
NicoLaval merged 3 commits intoInseeFr:developfrom
MiguelRosaTauroni:fix-catalyst-error

MiguelRosaTauroni commented Jun 3, 2025

Uh oh!

NicoLaval left a comment

Uh oh!

NicoLaval Jun 3, 2025

Uh oh!

hadrienk left a comment

Uh oh!

hadrienk Jun 3, 2025

Uh oh!

hadrienk commented Jun 3, 2025

Uh oh!

NicoLaval commented Jun 3, 2025

Uh oh!

hadrienk commented Jun 3, 2025

Uh oh!

NicoLaval commented Jul 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MiguelRosaTauroni commented Jun 3, 2025

Uh oh!

NicoLaval left a comment

Choose a reason for hiding this comment

Uh oh!

NicoLaval Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

hadrienk left a comment

Choose a reason for hiding this comment

Uh oh!

hadrienk Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

hadrienk commented Jun 3, 2025

Uh oh!

NicoLaval commented Jun 3, 2025

Uh oh!

hadrienk commented Jun 3, 2025

Uh oh!

NicoLaval commented Jul 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants