Skip to content

Consistent Replacement of List Column with NULL#6167

Merged
tdhock merged 31 commits intomasterfrom
consistentcolreplacement
Jan 18, 2025
Merged

Consistent Replacement of List Column with NULL#6167
tdhock merged 31 commits intomasterfrom
consistentcolreplacement

Conversation

@joshhwuu
Copy link
Member

@joshhwuu joshhwuu commented Jun 4, 2024

Closes #5558

Previous behavior

From @tdhock:

In base R data.frame we can replace an element of a list column with NULL via:

> DF1=data.frame(L=I(list("A")),i=1)
> DF1$L=list(NULL)
> DF1
     L i
1 NULL 1

However in data.table, doing that results in deleting the list column entirely:

> DT1=data.table(L=list("A"),i=1)
> DT1$L=list(NULL)
> DT1
       i
   <num>
1:     1

This was reported to be inconsistent with column replacement with more than one row, see:

# old replacement of multiple rows, correct but inconsistent
> DT2=data.table(L=list("B","C"),i=1)
> DT2$L <- list(NULL,NULL)
> DT2
        L     i
   <list> <num>
1:            1
2:            1

Additionally, there was this inconsistency as well:

I can do this:

library(data.table)
DT1=data.table(L=list("A"),i=1)
DT1[, `:=`(L = list(NULL))]

that works, but oddly not

DT1[, L := list(NULL)]

which should be identical per data.table documentation.

Changes

Request: can we make the above code do a replacement (like base R data.frame) instead of deleting the column?

In assign.c, add a new check to see if passed in values is list(NULL). If so, replace the list column with a list of NULL(s) of the same length.

This is the new behavior:

DT = data.table(L = list("A"), i = 1L)
DT$L = list(NULL)
#       L i
# 1: NULL 1

# SAME
DT[, L := list(NULL)]
#       L i
# 1: NULL 1

# SAME
DT[, `:=`(L = list(NULL))]
#       L i
# 1: NULL 1

We no longer delete the column, instead replace the column rows with NULLs.

This PR also changes behavior when doing more than one row, to be more consistent with data.frame replacement:

# data.frame replacement:
DF = data.frame(L = I(list("B", "C")), i = 1L)
DF$L = list(NULL)
#      L i
# 1 NULL 1
# 2 NULL 1

# old
DT = data.table(L = list("B", "C"), i = 1L)
DT$L = list(NULL)
#        i
# 1:     1
# 2:     1

# new
DT = data.table(L = list("B", "C"), i = 1L)
DT$L = list(NULL)
#       L i
# 1: NULL 1
# 2: NULL 1

Of course, this works with the other assignment methods.

Had to change one old test, test(2058.20) to reflect the new behavior as well.

@github-actions
Copy link

github-actions bot commented Jun 4, 2024

Comparison Plot

Generated via commit e14d93c

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 4 minutes and 38 seconds
Installing different package versions 8 minutes and 11 seconds
Running and plotting the test cases 2 minutes and 20 seconds

Copy link
Member

@tdhock tdhock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes and tests look good, can you please add a NEWS item.

Copy link
Member

@Anirban166 Anirban166 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the NEWS entry and the changes look good to me as well!

As for the tests they are good too and work (just tested) but I think it might be neat to comment or separate them out a bit for one to quickly see what each test is doing and how is it different from the other. For e.g., for your tests from top to bottom in order, it could be a comment that conveys that you replaced a list column with standard assignment to NULL, did the same but using the := syntax or modified in-place, compared with another data.table, replaced multiple elements with NULL and then followed up with tests similar to the single element replacement case.

@ben-schwen
Copy link
Member

I know I'm quite late to the party, but in my opinion, ideally, we would bring the assign.c parts of set and := closer together. Ultimately, this would result in that we can scrub the newcolnames argument from SEXP assign()

@joshhwuu
Copy link
Member Author

joshhwuu commented Jun 4, 2024

Hm.. not sure I understand what you mean here, I thought the simplest fix would be in assign.c's internal behavior. Could you elaborate on what you mean by bringing the assign.c parts of set and := closer together?

@tdhock
Copy link
Member

tdhock commented Jun 5, 2024

this may be a breaking change (revdep checks could fail as a result)
but probably good / worth making the change for consistency.
I wonder if the documentation needs updating? @avimallu wrote "should be identical as per data.table documentation" -- what part of the documentation was that, and should it be clarified to reflect this change?

@ben-schwen
Copy link
Member

Hm.. not sure I understand what you mean here, I thought the simplest fix would be in assign.c's internal behavior. Could you elaborate on what you mean by bringing the assign.c parts of set and := closer together?

Currently, there are multiple ways to alter/add new columns to a data.table, e.g. via := or set.
However, set and := both call SEXP assign but use different parts in the underlying C code. I think the goal would be that both use the same code reducing the complexity and the code to maintain in our code base.

@joshhwuu
Copy link
Member Author

joshhwuu commented Jun 5, 2024

Oh I see. Do you propose that we try to include the changes in this PR, or is it worth filing a separate issue?

@ben-schwen
Copy link
Member

There are already multiple issues about the divergence of set and :=. It does not have to be this PR, I just thought that this might be an interesting topic to work on in GSOC (maybe as stacked PR)

@joshhwuu
Copy link
Member Author

joshhwuu commented Jun 6, 2024

I wonder if the documentation needs updating? @avimallu wrote "should be identical as per data.table documentation" -- what part of the documentation was that, and should it be clarified to reflect this change?

@tdhock
While I can't say for sure which part of the documentation @avimallu was referring to, here are a few parts of the data.table documentation talking about the two forms:

Reference Semantics Vignette

b) The := operator

It can be used in j in two ways:

(a) The LHS := RHS form

DT[, c("colA", "colB", ...) := list(valA, valB, ...)]

# when you have only one column to assign to you
# can drop the quotes and list(), for convenience
DT[, colA := valA]

(b) The functional form

DT[, `:=`(colA = valA, # valA is assigned to colA
          colB = valB, # valB is assigned to colB
          ...
)]
  • In (a), LHS takes a character vector of column names and RHS a list of values. RHS just needs to be a list, irrespective of how its generated (e.g., using lapply(), list(), mget(), mapply() etc.). This form is usually easy to program with and is particularly useful when you don't know the columns to assign values to in advance.

  • On the other hand, (b) is handy if you would like to jot some comments down for later.

This vignette implies that the result of the two forms exist, with the primary difference being syntax and the functional form being more chatty. IMO, it implies(?) that the two are the same.

Assignment by reference doc

# 1. LHS := RHS form
DT[i, LHS := RHS, by = ...]
DT[i, c("LHS1", "LHS2") := list(RHS1, RHS2), by = ...]

# 2a. Functional form with `:=`
DT[i, `:=`(LHS1 = RHS1,
           LHS2 = RHS2,
           ...), by = ...]

# 2b. Functional form with let
DT[i, let(LHS1 = RHS1,
           LHS2 = RHS2,
           ...), by = ...]

I think this documentation also implies that the different usages both work. It does state that let and functional form are equivalent. So, I will add some tests in this PR to check that using let works the same as well, for thoroughness :)

Although these two documentations imply that the results of either form are largely the same, I haven't found anywhere in the documentation that says it is always guaranteed to be the same. While searching this up on google, I found this stack overflow thread talking about different results when using functional form and assigning by reference: https://stackoverflow.com/questions/44067091/different-results-for-standard-form-and-functional-form-of-data-table-assigne-by

Jan explained here that there are slight differences in how RHS is handled causing a difference in output between the two forms depending on whether the data we are assigning is a vector or a list:

dt <- data.table(a = c('a','b','c'))
l <- list(v)

print(copy(dt)[, new := l])
print(copy(dt)[, `:=` (new = l)])
        a    new
   <char> <char>
1:      a      A
2:      b      B
3:      c      C
        a    new
   <char> <list>
1:      a  A,B,C
2:      b  A,B,C
3:      c  A,B,C

This is still true as of current master (just tested), so I believe we shouldn't explicitly state that the results will be the exact same. But we should note that in most cases, the two forms are the same, which I believe the current documentation implies.

@avimallu
Copy link
Contributor

avimallu commented Jun 6, 2024

This vignette implies that the result of the two forms exist, with the primary difference being syntax and the functional form being more chatty. IMO, it implies(?) that the two are the same.

This was my interpretation when I commented on the issue!

@tdhock
Copy link
Member

tdhock commented Jun 7, 2024

would be good to clarify the docs, explicitly write they they should be the same, and when they are expected to be different

@joshhwuu
Copy link
Member Author

joshhwuu commented Jun 7, 2024

TBH @tdhock I'm still a little confused on the exact differences between standard and functional form of assigning by reference. I want to ask for some of Jan's (and others) input to help me understand it. Plus it'll keep the logs on this PR a little clearer, as this PR didn't intend to fix documentation but is only slightly related, WDYT about filing a separate issue for that?

Otherwise, if you think the vignette update is clear enough then we could keep it in this PR, however I'm having trouble reasoning why exactly the above behavior happens. My line of thinking at the moment is that because := is like an alias to list in functional form, then when we do:

dt[, `:=`(new = list(1:3))]

it is essentially equivalent to:

dt[, new := list(new = list(1:3))]

Since this is true (just tried), I wonder how the wrapping of RHS by list in standard form vs not wrapping in functional form is relevant

@joshhwuu
Copy link
Member Author

Hmm.. It seems that there's been an oversight on my end. While revisiting the code/documentation change again, I realized that since we know that the functional form wraps RHS in a list, SEXP assign will interpret this as a null replacement instead of deletion. I tested and it seems that I was right:

> DT = data.table(L = list('A'), i = 1)
> DT[, `:=`(L = NULL)]
> DT
#         L     i
#    <list> <num>
# 1: [NULL]     1

I think this can be fixed, but I'll need some time to think of a good solution, suggestions are welcome. I'll be reorganizing the unit tests to be more comprehensive and use all forms of assignment to thoroughly test. Thanks for everyone's patience!

@joshhwuu
Copy link
Member Author

joshhwuu commented Jun 20, 2024

Organized and added some tests, changed list wrapping behavior of rhs to not wrap (functional form only) when rhs is a singular NULL, thus allowing us to remove columns by assigning to NULL with functional form, as listed in the table above.

@tdhock
Copy link
Member

tdhock commented Jun 20, 2024

looks good to me, thanks for the extensive tests

@MichaelChirico
Copy link
Member

MichaelChirico commented Jan 18, 2025

This is a truly excellent PR, sorry it took so long to review!

I tidied things up very slightly, and added one more set of tests to cover one more situation: sub-assignment (i.e., cases where only some but not all rows are edited).

@MichaelChirico
Copy link
Member

I suspect this will cause some revdep breakages. We should think of if it's possible to retain the old behavior. I think some inconsistency here is impossible to avoid.

We've gone from apparent inconsistency:

DT[, list_col := list(NULL)] # delete
DT[, char_col := 2L]         # overwrite

To apparent inconsistency:

DT[, one_col    := list(NULL)]       # overwrite
DT[, (two_cols) := list(NULL, NULL)] # delete

All of this is ultimately a consequence of the convenience to have "naked" RHS of := instead of always requiring it to be list()-wrapped. Which I think is a good choice the large majority of the time (since list columns are not the norm).

@codecov
Copy link

codecov bot commented Jan 18, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.62%. Comparing base (6641ca0) to head (e14d93c).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6167   +/-   ##
=======================================
  Coverage   98.62%   98.62%           
=======================================
  Files          79       79           
  Lines       14642    14645    +3     
=======================================
+ Hits        14441    14444    +3     
  Misses        201      201           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@joshhwuu
Copy link
Member Author

This is a truly excellent PR, sorry it took so long to review!

No worries! This one is quite a lot of reading so I appreciate the time you took to review it.

I am always leaning towards the side that we should make users' experience as smooth as possible, and if revdeps are a concern then I'm always happy to stick with current behavior.

With that being said, IMO this behavior:

DT[, one_col    := list(NULL)]       # overwrite
DT[, (two_cols) := list(NULL, NULL)] # delete

doesn't look too inconsistent to me. This is mainly because when I see multiple entries on the LHS then I'd expect that each column on the LHS be assigned a corresponding value from the RHS. In this case both columns specified in the LHS would be assigned a value of NULL each, hence deleting both and that makes sense to me.

@tdhock
Copy link
Member

tdhock commented Jan 18, 2025

this definitely fixes my original issue thanks!
let's merge and check revdeps

names(jsub)=""
jsub[[1L]]=as.name("list")
# dont wrap the RHS in list if it is a singular NULL and if not creating a new column
if (length(jsub[-1L]) == 1L && as.character(jsub[-1L]) == 'NULL' && all(lhs %chin% names_x)) jsub[[1L]]=as.name("identity") else jsub[[1L]]=as.name("list")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshhwuu reading a bit more carefully here, part of the issue is relying on literal NULL being used (as opposed to using null_variable where null_variable=NULL). At a minimum, we should check the actual value of j[[1L]], see markfairbanks/tidytable#831.

MichaelChirico added a commit that referenced this pull request Jan 30, 2025
* revert #6167 (new rules on list(NULL) assignment)

* restore last missing line

* restore tests to working state on current master
@MichaelChirico MichaelChirico deleted the consistentcolreplacement branch July 8, 2025 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inconsistent replacement of list column element with NULL in table with 1 row

7 participants