Consistent Replacement of List Column with NULL by joshhwuu · Pull Request #6167 · Rdatatable/data.table

joshhwuu · 2024-06-04T00:59:36Z

Previous behavior

In base R data.frame we can replace an element of a list column with NULL via:

> DF1=data.frame(L=I(list("A")),i=1)
> DF1$L=list(NULL)
> DF1
     L i
1 NULL 1

However in data.table, doing that results in deleting the list column entirely:

> DT1=data.table(L=list("A"),i=1)
> DT1$L=list(NULL)
> DT1
       i
   <num>
1:     1

This was reported to be inconsistent with column replacement with more than one row, see:

# old replacement of multiple rows, correct but inconsistent
> DT2=data.table(L=list("B","C"),i=1)
> DT2$L <- list(NULL,NULL)
> DT2
        L     i
   <list> <num>
1:            1
2:            1

Additionally, there was this inconsistency as well:

I can do this:
library(data.table)
DT1=data.table(L=list("A"),i=1)
DT1[, `:=`(L = list(NULL))]
that works, but oddly not
DT1[, L := list(NULL)]
which should be identical per data.table documentation.

Changes

Request: can we make the above code do a replacement (like base R data.frame) instead of deleting the column?

In assign.c, add a new check to see if passed in values is list(NULL). If so, replace the list column with a list of NULL(s) of the same length.

This is the new behavior:

DT = data.table(L = list("A"), i = 1L)
DT$L = list(NULL)
#       L i
# 1: NULL 1

# SAME
DT[, L := list(NULL)]
#       L i
# 1: NULL 1

# SAME
DT[, `:=`(L = list(NULL))]
#       L i
# 1: NULL 1

We no longer delete the column, instead replace the column rows with NULLs.

This PR also changes behavior when doing more than one row, to be more consistent with data.frame replacement:

# data.frame replacement:
DF = data.frame(L = I(list("B", "C")), i = 1L)
DF$L = list(NULL)
#      L i
# 1 NULL 1
# 2 NULL 1

# old
DT = data.table(L = list("B", "C"), i = 1L)
DT$L = list(NULL)
#        i
# 1:     1
# 2:     1

# new
DT = data.table(L = list("B", "C"), i = 1L)
DT$L = list(NULL)
#       L i
# 1: NULL 1
# 2: NULL 1

Of course, this works with the other assignment methods.

Had to change one old test, test(2058.20) to reflect the new behavior as well.

github-actions · 2024-06-04T01:17:11Z

Generated via commit e14d93c

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	4 minutes and 38 seconds
Installing different package versions	8 minutes and 11 seconds
Running and plotting the test cases	2 minutes and 20 seconds

tdhock

these changes and tests look good, can you please add a NEWS item.

Anirban166

Thanks for adding the NEWS entry and the changes look good to me as well!

As for the tests they are good too and work (just tested) but I think it might be neat to comment or separate them out a bit for one to quickly see what each test is doing and how is it different from the other. For e.g., for your tests from top to bottom in order, it could be a comment that conveys that you replaced a list column with standard assignment to NULL, did the same but using the := syntax or modified in-place, compared with another data.table, replaced multiple elements with NULL and then followed up with tests similar to the single element replacement case.

ben-schwen · 2024-06-04T21:12:56Z

I know I'm quite late to the party, but in my opinion, ideally, we would bring the assign.c parts of set and := closer together. Ultimately, this would result in that we can scrub the newcolnames argument from SEXP assign()

joshhwuu · 2024-06-04T22:17:43Z

Hm.. not sure I understand what you mean here, I thought the simplest fix would be in assign.c's internal behavior. Could you elaborate on what you mean by bringing the assign.c parts of set and := closer together?

tdhock · 2024-06-05T00:59:35Z

this may be a breaking change (revdep checks could fail as a result)
but probably good / worth making the change for consistency.
I wonder if the documentation needs updating? @avimallu wrote "should be identical as per data.table documentation" -- what part of the documentation was that, and should it be clarified to reflect this change?

ben-schwen · 2024-06-05T14:57:41Z

Hm.. not sure I understand what you mean here, I thought the simplest fix would be in assign.c's internal behavior. Could you elaborate on what you mean by bringing the assign.c parts of set and := closer together?

Currently, there are multiple ways to alter/add new columns to a data.table, e.g. via := or set.
However, set and := both call SEXP assign but use different parts in the underlying C code. I think the goal would be that both use the same code reducing the complexity and the code to maintain in our code base.

joshhwuu · 2024-06-05T18:54:50Z

Oh I see. Do you propose that we try to include the changes in this PR, or is it worth filing a separate issue?

ben-schwen · 2024-06-05T19:16:52Z

There are already multiple issues about the divergence of set and :=. It does not have to be this PR, I just thought that this might be an interesting topic to work on in GSOC (maybe as stacked PR)

joshhwuu · 2024-06-06T19:57:16Z

I wonder if the documentation needs updating? @avimallu wrote "should be identical as per data.table documentation" -- what part of the documentation was that, and should it be clarified to reflect this change?

@tdhock
While I can't say for sure which part of the documentation @avimallu was referring to, here are a few parts of the data.table documentation talking about the two forms:

Reference Semantics Vignette

b) The := operator

It can be used in j in two ways:

(a) The LHS := RHS form
DT[, c("colA", "colB", ...) := list(valA, valB, ...)]

# when you have only one column to assign to you
# can drop the quotes and list(), for convenience
DT[, colA := valA]
(b) The functional form
DT[, `:=`(colA = valA, # valA is assigned to colA
          colB = valB, # valB is assigned to colB
          ...
)]
In (a), LHS takes a character vector of column names and RHS a list of values. RHS just needs to be a list, irrespective of how its generated (e.g., using lapply(), list(), mget(), mapply() etc.). This form is usually easy to program with and is particularly useful when you don't know the columns to assign values to in advance.

On the other hand, (b) is handy if you would like to jot some comments down for later.

This vignette implies that the result of the two forms exist, with the primary difference being syntax and the functional form being more chatty. IMO, it implies(?) that the two are the same.

Assignment by reference doc

# 1. LHS := RHS form
DT[i, LHS := RHS, by = ...]
DT[i, c("LHS1", "LHS2") := list(RHS1, RHS2), by = ...]

# 2a. Functional form with `:=`
DT[i, `:=`(LHS1 = RHS1,
           LHS2 = RHS2,
           ...), by = ...]

# 2b. Functional form with let
DT[i, let(LHS1 = RHS1,
           LHS2 = RHS2,
           ...), by = ...]

I think this documentation also implies that the different usages both work. It does state that let and functional form are equivalent. So, I will add some tests in this PR to check that using let works the same as well, for thoroughness :)

Although these two documentations imply that the results of either form are largely the same, I haven't found anywhere in the documentation that says it is always guaranteed to be the same. While searching this up on google, I found this stack overflow thread talking about different results when using functional form and assigning by reference: https://stackoverflow.com/questions/44067091/different-results-for-standard-form-and-functional-form-of-data-table-assigne-by

Jan explained here that there are slight differences in how RHS is handled causing a difference in output between the two forms depending on whether the data we are assigning is a vector or a list:

dt <- data.table(a = c('a','b','c'))
l <- list(v)

print(copy(dt)[, new := l])
print(copy(dt)[, `:=` (new = l)])
        a    new
   <char> <char>
1:      a      A
2:      b      B
3:      c      C
        a    new
   <char> <list>
1:      a  A,B,C
2:      b  A,B,C
3:      c  A,B,C

This is still true as of current master (just tested), so I believe we shouldn't explicitly state that the results will be the exact same. But we should note that in most cases, the two forms are the same, which I believe the current documentation implies.

avimallu · 2024-06-06T21:57:21Z

This vignette implies that the result of the two forms exist, with the primary difference being syntax and the functional form being more chatty. IMO, it implies(?) that the two are the same.

This was my interpretation when I commented on the issue!

tdhock · 2024-06-07T12:35:52Z

would be good to clarify the docs, explicitly write they they should be the same, and when they are expected to be different

joshhwuu · 2024-06-07T21:16:52Z

TBH @tdhock I'm still a little confused on the exact differences between standard and functional form of assigning by reference. I want to ask for some of Jan's (and others) input to help me understand it. Plus it'll keep the logs on this PR a little clearer, as this PR didn't intend to fix documentation but is only slightly related, WDYT about filing a separate issue for that?

Otherwise, if you think the vignette update is clear enough then we could keep it in this PR, however I'm having trouble reasoning why exactly the above behavior happens. My line of thinking at the moment is that because := is like an alias to list in functional form, then when we do:

dt[, `:=`(new = list(1:3))]

it is essentially equivalent to:

dt[, new := list(new = list(1:3))]

Since this is true (just tried), I wonder how the wrapping of RHS by list in standard form vs not wrapping in functional form is relevant

joshhwuu · 2024-06-19T01:29:20Z

Hmm.. It seems that there's been an oversight on my end. While revisiting the code/documentation change again, I realized that since we know that the functional form wraps RHS in a list, SEXP assign will interpret this as a null replacement instead of deletion. I tested and it seems that I was right:

> DT = data.table(L = list('A'), i = 1)
> DT[, `:=`(L = NULL)]
> DT
#         L     i
#    <list> <num>
# 1: [NULL]     1

I think this can be fixed, but I'll need some time to think of a good solution, suggestions are welcome. I'll be reorganizing the unit tests to be more comprehensive and use all forms of assignment to thoroughly test. Thanks for everyone's patience!

joshhwuu · 2024-06-20T00:19:52Z

Organized and added some tests, changed list wrapping behavior of rhs to not wrap (functional form only) when rhs is a singular NULL, thus allowing us to remove columns by assigning to NULL with functional form, as listed in the table above.

…ement

tdhock · 2024-06-20T00:46:19Z

looks good to me, thanks for the extensive tests

MichaelChirico · 2025-01-18T00:34:48Z

This is a truly excellent PR, sorry it took so long to review!

I tidied things up very slightly, and added one more set of tests to cover one more situation: sub-assignment (i.e., cases where only some but not all rows are edited).

MichaelChirico · 2025-01-18T00:57:13Z

I suspect this will cause some revdep breakages. We should think of if it's possible to retain the old behavior. I think some inconsistency here is impossible to avoid.

We've gone from apparent inconsistency:

DT[, list_col := list(NULL)] # delete
DT[, char_col := 2L]         # overwrite

To apparent inconsistency:

DT[, one_col    := list(NULL)]       # overwrite
DT[, (two_cols) := list(NULL, NULL)] # delete

All of this is ultimately a consequence of the convenience to have "naked" RHS of := instead of always requiring it to be list()-wrapped. Which I think is a good choice the large majority of the time (since list columns are not the norm).

codecov · 2025-01-18T01:05:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.62%. Comparing base (6641ca0) to head (e14d93c).
Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #6167   +/-   ##
=======================================
  Coverage   98.62%   98.62%           
=======================================
  Files          79       79           
  Lines       14642    14645    +3     
=======================================
+ Hits        14441    14444    +3     
  Misses        201      201

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

joshhwuu · 2025-01-18T02:20:24Z

This is a truly excellent PR, sorry it took so long to review!

No worries! This one is quite a lot of reading so I appreciate the time you took to review it.

I am always leaning towards the side that we should make users' experience as smooth as possible, and if revdeps are a concern then I'm always happy to stick with current behavior.

With that being said, IMO this behavior:

DT[, one_col    := list(NULL)]       # overwrite
DT[, (two_cols) := list(NULL, NULL)] # delete

doesn't look too inconsistent to me. This is mainly because when I see multiple entries on the LHS then I'd expect that each column on the LHS be assigned a corresponding value from the RHS. In this case both columns specified in the LHS would be assigned a value of NULL each, hence deleting both and that makes sense to me.

tdhock · 2025-01-18T09:23:14Z

this definitely fixes my original issue thanks!
let's merge and check revdeps

MichaelChirico · 2025-01-23T17:37:03Z

R/data.table.R

          names(jsub)=""
-          jsub[[1L]]=as.name("list")
+          # dont wrap the RHS in list if it is a singular NULL and if not creating a new column
+          if (length(jsub[-1L]) == 1L && as.character(jsub[-1L]) == 'NULL' && all(lhs %chin% names_x)) jsub[[1L]]=as.name("identity") else jsub[[1L]]=as.name("list")


@joshhwuu reading a bit more carefully here, part of the issue is relying on literal NULL being used (as opposed to using null_variable where null_variable=NULL). At a minimum, we should check the actual value of j[[1L]], see markfairbanks/tidytable#831.

* revert #6167 (new rules on list(NULL) assignment) * restore last missing line * restore tests to working state on current master

joshhwuu added 2 commits June 3, 2024 17:14

Changed replacement behavior of list columns

a1a2a4d

use allocNAVector() instead

0a5e1b7

joshhwuu requested review from Anirban166 and tdhock June 4, 2024 00:59

joshhwuu requested review from HughParsonage and MichaelChirico as code owners June 4, 2024 00:59

joshhwuu mentioned this pull request Jun 4, 2024

Master List of data.table Issues for GSoC '24 (Josh) joshhwuu/gsoc-2024#1

Open

11 tasks

HughParsonage approved these changes Jun 4, 2024

View reviewed changes

tdhock requested changes Jun 4, 2024

View reviewed changes

add news entry

7fc9d31

Anirban166 approved these changes Jun 4, 2024

View reviewed changes

comments on tests

4469520

let tests for thoroughness

e0c050c

joshhwuu added 4 commits June 7, 2024 12:31

updated vignette

b04aad9

better

e1a8d78

more

37c4ccc

typo

c69bf06

joshhwuu added 2 commits June 7, 2024 14:33

new line of thinking

810c62f

typo

3583912

new tests, slight change in behavior

ee0a462

Merge remote-tracking branch 'origin/master' into consistentcolreplac…

ac8ce38

…ement

joshhwuu mentioned this pull request Jul 31, 2024

error with set() or := NULL to a list column item #5526

Closed

tdhock added this to the 1.17.0 milestone Aug 7, 2024

MichaelChirico added 4 commits January 17, 2025 15:58

style

089e126

Add some new equivalence tests, spacing

07176c3

Add tests of sub-assignment

05fc065

Extend NEWS with some examples

3ab49af

MichaelChirico added 3 commits January 18, 2025 00:36

Merge branch 'master' into consistentcolreplacement

51666c4

NEWS position

42a80b3

Correct NEWS, and emphasize the breaking change

294f2bf

incorrect tests fixed

e14d93c

tdhock merged commit 1aa92bc into master Jan 18, 2025
11 checks passed

This was referenced Jan 18, 2025

consistent treatment of nested lists in replacement #6732

Open

Fix NA support in pivot_longer markfairbanks/tidytable#831

Open

revdep tidytable test failure after new NULL replacement #6740

Closed

MichaelChirico mentioned this pull request Jan 20, 2025

Inconsistent replacement of list column element with NULL in table with 1 row #5558

Open

MichaelChirico reviewed Jan 23, 2025

View reviewed changes

MichaelChirico added a commit that referenced this pull request Jan 23, 2025

revert #6167 (new rules on list(NULL) assignment)

b9436f4

MichaelChirico mentioned this pull request Jan 23, 2025

Revert new rules on list(NULL) assignment #6759

Merged

MichaelChirico added a commit that referenced this pull request Jan 30, 2025

Revert new rules on list(NULL) assignment (#6759)

bdbe15a

* revert #6167 (new rules on list(NULL) assignment) * restore last missing line * restore tests to working state on current master

MichaelChirico deleted the consistentcolreplacement branch July 8, 2025 17:47

Conversation

joshhwuu commented Jun 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Previous behavior

Changes

Uh oh!

github-actions bot commented Jun 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdhock left a comment

Choose a reason for hiding this comment

Uh oh!

Anirban166 left a comment

Choose a reason for hiding this comment

Uh oh!

ben-schwen commented Jun 4, 2024

Uh oh!

joshhwuu commented Jun 4, 2024

Uh oh!

tdhock commented Jun 5, 2024

Uh oh!

ben-schwen commented Jun 5, 2024

Uh oh!

joshhwuu commented Jun 5, 2024

Uh oh!

ben-schwen commented Jun 5, 2024

Uh oh!

joshhwuu commented Jun 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Semantics Vignette

b) The := operator

Assignment by reference doc

Uh oh!

avimallu commented Jun 6, 2024

Uh oh!

tdhock commented Jun 7, 2024

Uh oh!

joshhwuu commented Jun 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshhwuu commented Jun 19, 2024

Uh oh!

joshhwuu commented Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdhock commented Jun 20, 2024

Uh oh!

MichaelChirico commented Jan 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaelChirico commented Jan 18, 2025

Uh oh!

codecov bot commented Jan 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

joshhwuu commented Jan 18, 2025

Uh oh!

tdhock commented Jan 18, 2025

Uh oh!

Uh oh!

MichaelChirico Jan 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

joshhwuu commented Jun 4, 2024 •

edited

Loading

github-actions bot commented Jun 4, 2024 •

edited

Loading

joshhwuu commented Jun 6, 2024 •

edited

Loading

b) The `:=` operator

joshhwuu commented Jun 7, 2024 •

edited

Loading

joshhwuu commented Jun 20, 2024 •

edited

Loading

MichaelChirico commented Jan 18, 2025 •

edited

Loading

codecov bot commented Jan 18, 2025 •

edited

Loading