Skip to content

Fix catalyst error#419

Merged
NicoLaval merged 3 commits intoInseeFr:developfrom
MiguelRosaTauroni:fix-catalyst-error
Jul 9, 2025
Merged

Fix catalyst error#419
NicoLaval merged 3 commits intoInseeFr:developfrom
MiguelRosaTauroni:fix-catalyst-error

Conversation

@MiguelRosaTauroni
Copy link
Collaborator

The PR addresses a critical issue identified during the execution of VTL scripts using the Trevas engine v1.8.0/v1.9.0 on large datasets within AWS Glue. The problem manifests as a StackOverflowError triggered by Spark Catalyst during the logical plan construction phase.

Root Cause
The error is caused by the addMetadata method in SparkDataset, which iterates over all columns using repeated .withColumn() calls. In Spark, this creates a deeply nested logical plan, causing Catalyst to overflow the stack when handling datasets with many columns.

More details - #413

Copy link
Collaborator

@NicoLaval NicoLaval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, thanks!
Just one small issue with a spanish comment.

I also added a CI step to inseefr/develop, please, rebase and push, I will fire the new CI after.

}
return casted;

// Se construye una lista de expresiones para castear en una sola transformación
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In english please :)

Copy link
Collaborator

@hadrienk hadrienk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. The comment should be in english.

Comment on lines +97 to +114
List<Column> castedColumns =
Arrays.stream(schema.fields())
.map(
field -> {
DataType type = field.dataType();
Column col = sparkDataset.col(field.name());
if (type instanceof IntegerType
|| type instanceof FloatType
|| type instanceof DecimalType) {
return col.cast(
type instanceof IntegerType ? DataTypes.LongType : DataTypes.DoubleType)
.alias(field.name());
}
return col;
})
.collect(Collectors.toList());

return sparkDataset.select(castedColumns.toArray(new Column[0]));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider toArray();

Suggested change
List<Column> castedColumns =
Arrays.stream(schema.fields())
.map(
field -> {
DataType type = field.dataType();
Column col = sparkDataset.col(field.name());
if (type instanceof IntegerType
|| type instanceof FloatType
|| type instanceof DecimalType) {
return col.cast(
type instanceof IntegerType ? DataTypes.LongType : DataTypes.DoubleType)
.alias(field.name());
}
return col;
})
.collect(Collectors.toList());
return sparkDataset.select(castedColumns.toArray(new Column[0]));
var castColumns =
Arrays.stream(schema.fields())
.map(
field -> {
DataType type = field.dataType();
Column col = sparkDataset.col(field.name());
if (type instanceof IntegerType
|| type instanceof FloatType
|| type instanceof DecimalType) {
return col.cast(
type instanceof IntegerType ? DataTypes.LongType : DataTypes.DoubleType)
.alias(field.name());
}
return col;
})
.toArray(Column[]::new));
return sparkDataset.select(castColumns);

@hadrienk
Copy link
Collaborator

hadrienk commented Jun 3, 2025

I see the tests are failing. Is this because it if a fork @NicoLaval ?

@NicoLaval
Copy link
Collaborator

I see the tests are failing. Is this because it if a fork @NicoLaval ?

Yes, see #420, I added a simpler test step to exclude SDMX

@hadrienk
Copy link
Collaborator

hadrienk commented Jun 3, 2025

I see the tests are failing. Is this because it if a fork @NicoLaval ?

Yes, see #420, I added a simpler test step to exclude SDMX

Right, @MiguelRosaTauroni could you pull/rebase your branch?

…given in the pr

Merge remote-tracking branch 'upstream/develop' into fix-catalyst-error
@NicoLaval
Copy link
Collaborator

Hi @MiguelRosaTauroni,

Trevas tests excluding vtl-smdx have passed.

We're going to release soon, do you have any emergency on your side for the next release?

@NicoLaval NicoLaval merged commit b2772db into InseeFr:develop Jul 9, 2025
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants