Conversation
… in pom.xml and SparkDataset.java
NicoLaval
left a comment
There was a problem hiding this comment.
Great job, thanks!
Just one small issue with a spanish comment.
I also added a CI step to inseefr/develop, please, rebase and push, I will fire the new CI after.
| } | ||
| return casted; | ||
|
|
||
| // Se construye una lista de expresiones para castear en una sola transformación |
hadrienk
left a comment
There was a problem hiding this comment.
Looks good. The comment should be in english.
| List<Column> castedColumns = | ||
| Arrays.stream(schema.fields()) | ||
| .map( | ||
| field -> { | ||
| DataType type = field.dataType(); | ||
| Column col = sparkDataset.col(field.name()); | ||
| if (type instanceof IntegerType | ||
| || type instanceof FloatType | ||
| || type instanceof DecimalType) { | ||
| return col.cast( | ||
| type instanceof IntegerType ? DataTypes.LongType : DataTypes.DoubleType) | ||
| .alias(field.name()); | ||
| } | ||
| return col; | ||
| }) | ||
| .collect(Collectors.toList()); | ||
|
|
||
| return sparkDataset.select(castedColumns.toArray(new Column[0])); |
There was a problem hiding this comment.
Consider toArray();
| List<Column> castedColumns = | |
| Arrays.stream(schema.fields()) | |
| .map( | |
| field -> { | |
| DataType type = field.dataType(); | |
| Column col = sparkDataset.col(field.name()); | |
| if (type instanceof IntegerType | |
| || type instanceof FloatType | |
| || type instanceof DecimalType) { | |
| return col.cast( | |
| type instanceof IntegerType ? DataTypes.LongType : DataTypes.DoubleType) | |
| .alias(field.name()); | |
| } | |
| return col; | |
| }) | |
| .collect(Collectors.toList()); | |
| return sparkDataset.select(castedColumns.toArray(new Column[0])); | |
| var castColumns = | |
| Arrays.stream(schema.fields()) | |
| .map( | |
| field -> { | |
| DataType type = field.dataType(); | |
| Column col = sparkDataset.col(field.name()); | |
| if (type instanceof IntegerType | |
| || type instanceof FloatType | |
| || type instanceof DecimalType) { | |
| return col.cast( | |
| type instanceof IntegerType ? DataTypes.LongType : DataTypes.DoubleType) | |
| .alias(field.name()); | |
| } | |
| return col; | |
| }) | |
| .toArray(Column[]::new)); | |
| return sparkDataset.select(castColumns); |
|
I see the tests are failing. Is this because it if a fork @NicoLaval ? |
Yes, see #420, I added a simpler test step to exclude SDMX |
Right, @MiguelRosaTauroni could you pull/rebase your branch? |
…given in the pr Merge remote-tracking branch 'upstream/develop' into fix-catalyst-error
|
Trevas tests excluding We're going to release soon, do you have any emergency on your side for the next release? |
The PR addresses a critical issue identified during the execution of VTL scripts using the Trevas engine v1.8.0/v1.9.0 on large datasets within AWS Glue. The problem manifests as a StackOverflowError triggered by Spark Catalyst during the logical plan construction phase.
Root Cause
The error is caused by the addMetadata method in SparkDataset, which iterates over all columns using repeated .withColumn() calls. In Spark, this creates a deeply nested logical plan, causing Catalyst to overflow the stack when handling datasets with many columns.
More details - #413