All except simplification: Option 1 #1036
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #761
This PR rewrites
exceptto disallow pointing to nested columns. The columns selection DSL should only be allowed to "select" columns, not restructure it; we havemove,remove, etc. for that.When a user now calls, say
df.select { all() except userData.age }they will receive an
IllegalArgumentException:Cannot exclude the nested columns '[userData/age]' from the column set '[userData/{age, name}, otherCol1, otherCol2]' with the except-functions. Use the `DataFrame<*>.remove { }` operation instead.because
userData.ageis not directly selected byall(), only its parent,userdata, is.TODO:
Some internal struggle: It seems like
exceptwas used internally in a couple places and it relied on the old behavior.Rewriting this now pushes me to this horrible notation:
Before:
df.groupBy { allExcept(columnSetWithNestedCols) }...After:
df.groupBy { (this as DataFrame<*>) .remove(columnSetWithNestedCols) .columns() .toColumnSet() }...(something like
df.remove {}.groupBy {}wouldn't not work, because you still want the columns inside the group)Having to write something like this makes me rethink what we should do about this. Clearly there's a need for selecting the top-level columns of a version of the current dataframe where a (set of) column(s) and their parents are removed. Though, it seems out of the scope for
except(and the 0.15 version ofexcepteven breaks in some cases). We could:exceptand keep it as isremoveon it. Probably also needs adf.toColumnSet()function.@zaleslaw @koperagen @nikitinas @belovrv WDYT?