You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Implement the hash table
* memrecycle(): replace TRUELENGTH marks with a hash
* rbindlist(): replace 1/2 TRUELENGTH with hashing
Also avoid crashing when creating a 0-size hash.
* rbindlist(): replace 2/2 TRUELENGTH with hashing
This may likely require a dynamically growing hash of TRUELENGTHs
instead of the current pre-allocation approach with a very conservative
over-estimate.
* chmatchMain(): replace TRUELENGTH marks with hash
* copySharedColumns(): hash instead of TRUELENGTH
* combineFactorLevels(): hash instead of TRUELENGTH
* anySpecialStatic(): hash instead of TRUELENGTH
* forder(): hash instead of TRUELENGTH
The hash needs O(n) memory (actually 2*n/load_factor entries) which
isn't great.
* Remove savetl()
* Add codecov suppressions
* Dynamically grow the hash table with bound unknown
In forder() and rbindlist(), there is no good upper boundary on the
number of elements in the hash known ahead of time. Grow the hash table
dynamically. Since the R/W locks are far too slow and OpenMP atomics are
too limited, rely on strategically placed flushes, which isn't really a
solution.
* Minor hash improvements
Use only 28 bits of the pointer (lower 32 but discard the lowest 4).
Inline the linear search by advancing the pointer instead of repeatedly
computing and dividing the hash value. Average improvement of 10%.
* dhash: no need to keep previous table
The hash can only be enlarged from (1) a single-thread context, or (2)
under a critical section, so there is no need to worry about other
threads getting a use-after-free due to a reallocation. This should
halve the memory use by the hash table.
* use double hashing instead of linear probing (#7418)
* add lookup or insert
* use lookup or insert
* use lookup_or_insert
* really use lookup or insert
* use cuckoo hashing
* add rehash
* use power of 2 and mask instead of modulo
* mix instead of multiplication
* use different mixes
* change multipliers
* use double hashing
* remove xor folding
* Fix allocation non-overflow precondition
* Set the default load factor
* Inline hash_rehash()
* update comments
* Leave overflow checking to R_alloc
* internal_error() is not covered
---------
Co-authored-by: Ivan K <[email protected]>
* replace dhashtab with hashtab in rbindlist
* Use hashtab in forder()
Since range_str() runs a parallel OpenMP loop that may update the hash
table in a critical section, use a special form of hash_set that returns
the newly reallocated hash table instead of overwriting it in place.
* Drop dhashtab
* rbindlist: initial hash size = upperBoundUniqueNames
* Don't bother cleaning the hash before an error
* Avoid setting same key&value twice
* chmatch: hash x instead of table (#7454)
* add hash x branch in chmatch
* adapt kick-in threshold
* make chin branch more explicit
* Use linear probing instead of double hashing (#7455)
* use linear probing instead of double hashing
* remove mask from struct
* fix comment
* rbindlist(): better initial allocation size
Also adjust the minimal hash table size to avoid shift overflow on
size=0 and free=0 on size=1.
* dogroups: replace remaining uses of SETLENGTH
* hashtab: switch to C allocator to avoid longjumps
PROTECT() the corresponding EXTPTRSXP while used.
Introduce a separate hash_set_shared() operation that avoids long jumps.
Deallocate the previous hash table when growing a non-shared hash table.
* range_str: use hash_set_shared in OpenMP region
Explicitly check the return value and update the shared pointer when
necessary. If a reallocation attempt fails, signal an error when
possible.
* range_str: allocate temporaries on the R heap
Avoid memory leaks from potential long jumps in hash_set().
* memrecycle: allocate temporary on the R heap
This avoids a memory leak in case growVector or hash_set fails.
* chmatchdup: allocate temporaries on the R heap
Prevent memory leak in case hash_set() causes a long jump.
* rbindlist: drop 'uniq'
Turns out it wasn't used.
* rbindlist: use hash_set_shared
Where C-heap allocations also exist, catch and handle potential
allocation failures from the hash table.
* range_str: propagate rehashed marks to caller
* fix stale comment
* NEWS item
Co-authored-by: Benjamin Schwendinger <[email protected]>
Co-authored-by: HughParsonage <[email protected]>
Co-authored-by: Jan Gorecki <[email protected]>
* glci: update status expectations for R-devel
* copyAsGrowable: use the correct number of arguments
---------
Co-authored-by: Benjamin Schwendinger <[email protected]>
Co-authored-by: Benjamin Schwendinger <[email protected]>
Co-authored-by: HughParsonage <[email protected]>
Co-authored-by: Jan Gorecki <[email protected]>
Rscript -e 'l=tail(readLines("data.table.Rcheck/00check.log"), 1L); notes<-"Status: 2 NOTEs"; if (!identical(l, notes)) stop("Last line of ", shQuote("00check.log"), " is not ", shQuote(notes), " (non-API calls, V8 package) but ", shQuote(l)) else q("no")'
186
+
Rscript -e 'l=tail(readLines("data.table.Rcheck/00check.log"), 1L); notes<-"Status: 1 NOTE"; if (!identical(l, notes)) stop("Last line of ", shQuote("00check.log"), " is not ", shQuote(notes), " (V8 package) but ", shQuote(l)) else q("no")'
187
187
188
188
## R-devel on Linux clang
189
189
# R compiled with clang, flags removed: -flto=auto -fopenmp
@@ -206,7 +206,7 @@ test-lin-dev-clang-cran:
206
206
- R CMD check --as-cran $(ls -1t data.table_*.tar.gz | head -n 1)
Rscript -e 'l=tail(readLines("data.table.Rcheck/00check.log"), 1L); notes<-"Status: 2 NOTEs"; if (!identical(l, notes)) stop("Last line of ", shQuote("00check.log"), " is not ", shQuote(notes), " (non-API calls, V8 package) but ", shQuote(l)) else q("no")'
209
+
Rscript -e 'l=tail(readLines("data.table.Rcheck/00check.log"), 1L); notes<-"Status: 1 NOTE"; if (!identical(l, notes)) stop("Last line of ", shQuote("00check.log"), " is not ", shQuote(notes), " (V8 package) but ", shQuote(l)) else q("no")'
Copy file name to clipboardExpand all lines: NEWS.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -379,6 +379,8 @@ See [#2611](https://github.com/Rdatatable/data.table/issues/2611) for details. T
379
379
380
380
8. Retain important information in the error message about the source of the error when `i=` fails, e.g. pointing to `charToDate()` failing in `DT[date_col == "20250101"]`, [#7444](https://github.com/Rdatatable/data.table/issues/7444). Thanks @jan-swissre for the report and @MichaelChirico for the fix.
381
381
382
+
9. Internal use of declared non-API R functions `SETLENGTH`, `TRUELENGTH`, `SET_TRUELENGTH`, and `SET_GROWABLE_BIT` has been eliminated. Most usages have been migrated to R's experimental resizable vectors API (thanks to @ltierney, introduced in R 4.6.0, backported for older R versions), [#7451](https://github.com/Rdatatable/data.table/pull/7451). Uses of `TRUELENGTH` for marking seen items during grouping and binding operations (aka free hash table trick) have been replaced with proper hash tables, [#6694](https://github.com/Rdatatable/data.table/pull/6694). The new hash table implementation uses linear probing with power of 2 tables and automatic resizing. Additionally, `chmatch()` now hashes the needle (`x`) instead of the haystack (`table`) when `length(table) >> length(x)`, significantly improving performance for lookups into large tables. We've benchmarked the refactored code and find the performance satisfactory, but please do report any edge case performance regressions we may have missed. Thanks to @aitap, @ben-schwen, @jangorecki and @HughParsonage for implementation and reviews.
383
+
382
384
## data.table [v1.17.8](https://github.com/Rdatatable/data.table/milestone/41) (6 July 2025)
383
385
384
386
1. Internal functions used to signal errors are now marked as non-returning, silencing a compiler warning about potentially unchecked allocation failure. Thanks to Prof. Brian D. Ripley for the report and @aitap for the fix, [#7070](https://github.com/Rdatatable/data.table/pull/7070).
# TODO add: if (max(len__)==nrow) stopf("There is no need to deep copy x in this case")
1613
1613
# TODO move down to dogroup.c, too.
1614
-
SDenv$.SDall= .Call(CsubsetDT, x, if (length(len__)) seq_len(max(len__)) else0L, xcols) # must be deep copy when largest group is a subset
1614
+
SDenv$.SDall= .Call(CcopyAsGrowable, .Call(CsubsetDT, x, if (length(len__)) seq_len(max(len__)) else0L, xcols)) # must be deep copy when largest group is a subset
1615
1615
if (!is.data.table(SDenv$.SDall)) setattr(SDenv$.SDall, "class", c("data.table","data.frame")) # DF |> DT(,.SD[...],by=grp) needs .SD to be data.table, test 2022.012
1616
1616
if (xdotcols) setattr(SDenv$.SDall, 'names', ansvars[xcolsAns]) # now that we allow 'x.' prefix in 'j', #2313 bug fix - [xcolsAns]
for (intj=0; j<k; ++j) SET_TRUELENGTH(s, 0); // wipe our negative usage and restore 0
774
-
savetl_end(); // then restore R's own usage (if any)
775
-
internal_error(__func__, "levels of target are either not unique or have truelength<0"); // # nocov
776
-
// # nocov end
777
-
}
778
-
SET_TRUELENGTH(s, -k-1);
769
+
hash_set(marks, s, -k-1);
779
770
}
780
771
intnAdd=0;
781
772
for (intk=0; k<nSourceLevels; ++k) {
782
773
constSEXPs=sourceLevelsD[k];
783
-
constinttl=TRUELENGTH(s);
774
+
constinttl=hash_lookup(marks, s, 0);
784
775
if (tl>=0) {
785
776
if (!sourceIsFactor&&s==NA_STRING) continue; // don't create NA factor level when assigning character to factor; test 2117
786
-
if (tl>0) savetl(s);
787
-
SET_TRUELENGTH(s, -nTargetLevels-(++nAdd));
777
+
hash_set(marks, s, -nTargetLevels-(++nAdd));
788
778
} // else, when sourceIsString, it's normal for there to be duplicates here
789
779
}
790
780
constintnSource=length(source);
@@ -793,45 +783,36 @@ const char *memrecycle(const SEXP target, const SEXP where, const int start, con
793
783
constint*sourceD=INTEGER(source);
794
784
for (inti=0; i<nSource; ++i) { // convert source integers to refer to target levels
795
785
constintval=sourceD[i];
796
-
newSourceD[i] =val==NA_INTEGER ? NA_INTEGER : -TRUELENGTH(sourceLevelsD[val-1]); // retains NA factor levels here via TL(NA_STRING); e.g. ordered factor
786
+
newSourceD[i] =val==NA_INTEGER ? NA_INTEGER : -hash_lookup(marks, sourceLevelsD[val-1], 0); // retains NA factor levels here via TL(NA_STRING); e.g. ordered factor
797
787
}
798
788
} else {
799
789
constSEXP*sourceD=STRING_PTR_RO(source);
800
790
for (inti=0; i<nSource; ++i) { // convert source integers to refer to target levels
unsigned intmapsize=tablelen+nuniq; // lto compilation warning #5760 // +nuniq to store a 0 at the end of each group
99
-
int*counts=calloc(nuniq, sizeof(*counts));
100
-
int*map=calloc(mapsize, sizeof(*map));
101
-
if (!counts|| !map) {
102
-
// # nocov start
103
-
free(counts); free(map);
104
-
for (inti=0; i<tablelen; i++) SET_TRUELENGTH(td[i], 0);
105
-
savetl_end();
106
-
error(_("Failed to allocate %"PRIu64" bytes working memory in chmatchdup: length(table)=%d length(unique(table))=%d"), ((uint64_t)tablelen*2+nuniq)*sizeof(int), tablelen, nuniq);
107
-
// # nocov end
108
-
}
109
-
for (inti=0; i<tablelen; ++i) counts[-TRUELENGTH(td[i])-1]++;
0 commit comments