You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Including HKV Configs and Initializer Args. The initialization method for the parameters.
43
+
Configuration for the dynamic embedding table, including initializer args.
43
44
Common choices include "uniform", "normal", etc. Defaults to "uniform".
44
45
"""
45
46
use_dynamicemb: Optional[bool] = False
@@ -273,8 +274,8 @@ Parameters for each random initialization method in DynamicEmbInitializerMode.
273
274
## DynamicEmbScoreStrategy
274
275
275
276
The storage space is limited, but the value range of sparse features is relatively large,
276
-
so HKV introduces the concept of score to perform customized evcition of sparse features within the limited storage space.
277
-
Based on the score of HKV, dynamicemb provides the following strategies to set the score.
277
+
so dynamicemb introduces the concept of score to perform customized eviction of sparse features within the limited storage space.
278
+
dynamicemb provides the following strategies to set the score.
278
279
279
280
```python
280
281
#How to import
@@ -309,7 +310,40 @@ Based on the score of HKV, dynamicemb provides the following strategies to set t
309
310
CUSTOMIZED = 2
310
311
```
311
312
312
-
Users can specify the `DynamicEmbScoreStrategy` using `score_strategy` in `DynamicEmbTableOptions` per table.
313
+
Users can specify the `DynamicEmbScoreStrategy` using `score_strategy` in `DynamicEmbTableOptions` per table.
314
+
315
+
## DynamicEmbPoolingMode
316
+
317
+
DynamicEmb supports three pooling modes that determine how embedding lookups are aggregated. These modes correspond to how `EmbeddingCollection` (sequence) and `EmbeddingBagCollection` (pooled) work in TorchREC.
318
+
319
+
All pooling modes use fused CUDA kernels for both forward and backward passes. Tables with different embedding dimensions (mixed-D) are fully supported in `SUM` and `MEAN` modes.
320
+
321
+
```python
322
+
#How to import
323
+
from dynamicemb import DynamicEmbPoolingMode
324
+
325
+
#API arguments
326
+
class DynamicEmbPoolingMode(enum.IntEnum):
327
+
"""
328
+
Enumeration for pooling modes in dynamic embedding lookup.
329
+
330
+
Attributes
331
+
----------
332
+
SUM : int
333
+
Sum pooling. For each sample, the embeddings of all indices in the bag
334
+
are summed. Output shape: (batch_size, total_D) where total_D is the
335
+
sum of embedding dimensions across all features.
336
+
MEAN : int
337
+
Mean pooling. For each sample, the embeddings of all indices in the bag
338
+
are averaged. Output shape: same as SUM.
339
+
NONE : int
340
+
No pooling (sequence mode). Each index produces its own embedding row.
341
+
Output shape: (total_indices, D).
342
+
"""
343
+
SUM = 0
344
+
MEAN = 1
345
+
NONE = 2
346
+
```
313
347
314
348
## DynamicEmbTableOptions
315
349
@@ -345,18 +379,18 @@ Dynamic embedding table parameter class, used to configure the parameters for ea
345
379
caching: bool
346
380
Flag to indicate dynamic embedding tables is working on caching mode, default to `False`.
347
381
When the device memory on a single GPU is insufficient to accommodate a single shard of the dynamic embedding table,
348
-
HKV supports the mixed use of device memory and host memory(pinned memory).
382
+
dynamicemb supports the mixed use of device memory and host memory(pinned memory).
349
383
But by default, the values of the entire table are concatenated with device memory and host memory.
350
-
This means that the storage location of one embeddng is determined by `hash_function(key)`, and mapping to device memory will bring better lookup performance.
384
+
This means that the storage location of one embedding is determined by `hash_function(key)`, and mapping to device memory will bring better lookup performance.
351
385
However, sparse features in training are often with temporal locality.
352
-
In order to store hot keys in device memory, dynamicemb creates two HKV instances,
353
-
whose values are stored in device memory and memory respectively, and store hot keys on the GPU table priorily.
354
-
If the GPU table is full, the evicted keys will be inserted into the CPU table.
355
-
If the CPU table is also full, the key granularity will be evicted(all the eviction is based on the score per key).
386
+
In order to store hot keys in device memory, dynamicemb creates two table instances,
387
+
whose values are stored in device memory and host memory respectively, and store hot keys on the GPU table priorily.
388
+
If the GPU table is full, the evicted keys will be inserted into the host table.
389
+
If the host table is also full, the key will be evicted(all the eviction is based on the score per key).
356
390
The original intention of eviction is based on this insight: features that only appear once should not occupy memory(even host memory) for a long time.
357
391
In short:
358
-
set **`caching=True`** will create a GPU table and a CPU table, and make GPU table serves as a cache;
359
-
set **`caching=False`** will create a hybrid table which use GPU and CPU memory in a concated way to store value.
392
+
set **`caching=True`** will create a GPU table and a host table, and make GPU table serves as a cache;
393
+
set **`caching=False`** will create a hybrid table which use GPU and host memory in a concatenated way to store value.
360
394
All keys and other meta data are always stored on GPU for both cases.
361
395
init_capacity : Optional[int], optional
362
396
The initial capacity of the table. If not set, it defaults to max_capacity after sharding.
@@ -375,7 +409,7 @@ Dynamic embedding table parameter class, used to configure the parameters for ea
375
409
For the multi-GPUs scenario of model parallelism, every rank's score_strategy should keep the same for one table,
376
410
as they are the same table, but stored on different ranks.
377
411
bucket_capacity : int
378
-
Capacity of each bucket in HKV, and default is 128(using 1024 when HKV serves as cache).
412
+
Capacity of each bucket in the hash table, and default is 128(using 1024 when the table serves as cache).
379
413
A key will only be mapped to one bucket.
380
414
When the bucket is full, the key with the smallest score in the bucket will be evicted, and its slot will be used to store a new key.
381
415
The larger the bucket capacity, the more accurate the score based eviction will be, but it will also result in performance loss.
@@ -390,7 +424,7 @@ Dynamic embedding table parameter class, used to configure the parameters for ea
390
424
When `caching=True`, it decides the table capacity of the GPU table.
391
425
external_storage: Storage
392
426
The external storage/ParamterServer which inherits the interface of Storage, and can be configured per table.
393
-
If not provided, will using KeyValueTable as the Storage.
427
+
If not provided, will using DynamicEmbeddingTable as the Storage.
394
428
index_type : Optional[torch.dtype], optional
395
429
Index type of sparse features, will be set to DEFAULT_INDEX_TYPE(torch.int64) by default.
@@ -405,8 +439,7 @@ Dynamic embedding table parameter class, used to configure the parameters for ea
405
439
406
440
Notes
407
441
-----
408
-
For detailed descriptions and additional context on each parameter, please refer to the documentation at
409
-
https://github.com/NVIDIA-Merlin/HierarchicalKV.
442
+
For detailed descriptions and additional context on each parameter, please refer to the documentation in this repository.
410
443
"""
411
444
412
445
training: bool = True
@@ -762,7 +795,7 @@ class FrequencyAdmissionStrategy(AdmissionStrategy):
762
795
763
796
Once the model containing `EmbeddingCollection` is built and initialized through `DistributedModelParallel`, it can be trained and evaluated on each GPU like a single GPU, with torchrec completing communication between different GPUs.
764
797
765
-
The switching between training and evaluation modes should be consistent with `nn.Module`, while `training` in [DynamicEmbTableOptions](../dynamicemb/dynamicemb_config.py) is used to guide whether to allocate memory to optimizer states when builds the table.
798
+
The switching between training and evaluation modes should be consistent with `nn.Module`, while `training` in [DynamicEmbTableOptions](./dynamicemb/dynamicemb_config.py) is used to guide whether to allocate memory to optimizer states when builds the table.
766
799
767
800
Due to limited resources, the dynamic embedding table does not pre allocate memory for all keys. If a key appears for the first time during training, it will be initialized immediately during the training process. Please see `initializer_args` and `eval_initializer_args` in `DynamicEmbTableOptions` for more information.
768
801
@@ -772,12 +805,12 @@ The size of the table is finite, but the set of keys during training may be infi
772
805
773
806
## Caching and prefetch
774
807
775
-
dynamicemb supports caching hot embeddings on GPU memory, and you can prefetch keys from host to device like torchrec(document and example is waiting to append, and now please see `test_prefetch_flush_in_cache` in [test prefetch](./test/test_batched_dynamic_embedding_tables_v2.py)).
808
+
dynamicemb supports caching hot embeddings on GPU memory, and you can prefetch keys from host to device like torchrec. Caching and prefetch work for both sequence mode (`NONE`) and pooling modes (`SUM`/`MEAN`). See `test_prefetch_flush_in_cache` in [test prefetch](./test/test_batched_dynamic_embedding_tables_v2.py) for usage examples.
776
809
777
810
## External storage
778
811
779
-
dynamicemb supports external storage once `external_storage` in `DynamicEmbTableOptions` inherits the `Storage` interface under [types.py](../dynamicemb/types.py).
780
-
Refer to demo `PyDictStorage` in [uint test](../test/test_batched_dynamic_embedding_tables_v2.py) for detailed usage.
812
+
dynamicemb supports external storage once `external_storage` in `DynamicEmbTableOptions` inherits the `Storage` interface under [types.py](./dynamicemb/types.py).
813
+
Refer to demo `PyDictStorage` in [unit test](./test/test_batched_dynamic_embedding_tables_v2.py) for detailed usage.
781
814
782
815
783
816
## Table expansion
@@ -791,7 +824,7 @@ Dump/Load and incremental dump is different from general module in PyTorch, beca
791
824
792
825
So dynamicemb provides dedicated interface to load/save models' states, and provide conditional dump to support online training.
793
826
794
-
Please see `DynamicEmbDump`, `DynamicEmbLoad`, `incremental_dump` in [APIs Doc](../DynamicEmb_APIs.md) for more information.
827
+
Please see `DynamicEmbDump`, `DynamicEmbLoad`, `incremental_dump` in [APIs Doc](./DynamicEmb_APIs.md) for more information.
Copy file name to clipboardExpand all lines: corelib/dynamicemb/README.md
+10-9Lines changed: 10 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# DynamicEmb
2
2
3
-
DynamicEmb is a Python package that provides model-parallel dynamic embedding tables and embedding lookup functionalities for TorchREC, specifically targeting the sparse training aspects of recommendation systems. Currently, DynamicEmb utilizes the [HierarchicalKV](https://github.com/NVIDIA-Merlin/HierarchicalKV)hash table backend, which is designed to store key-value (feature-embedding) pairs in the high-bandwidth memory (HBM) of GPUs as well as in host memory.
3
+
DynamicEmb is a Python package that provides model-parallel dynamic embedding tables and embedding lookup functionalities for TorchREC, specifically targeting the sparse training aspects of recommendation systems. DynamicEmb uses a GPU-optimized scored hash table backend to store key-value (feature-embedding) pairs in the high-bandwidth memory (HBM) of GPUs as well as in host memory.
4
4
5
5
The lookup kernel algorithms implemented in DynamicEmb primarily leverage portions of the algorithms from the [EMBark](https://dl.acm.org/doi/abs/10.1145/3640457.3688111) paper (Embedding Optimization for Training Large-scale Deep Learning Recommendation Systems with EMBark).
6
6
@@ -29,6 +29,8 @@ The lookup kernel algorithms implemented in DynamicEmb primarily leverage portio
29
29
30
30
- Support for creating dynamic embedding tables within `EmbeddingBagCollection` and `EmbeddingCollection` in TorchREC, allowing for embedding storage and lookup, and enabling coexistence with native Torch embedding tables within Torch models.
31
31
32
+
-**Pooling Mode Support**: DynamicEmb supports `SUM`, `MEAN`, and `NONE` (sequence) pooling modes with fused CUDA kernels for both forward and backward passes. Tables with different embedding dimensions (mixed-D) are fully supported in pooling mode.
33
+
32
34
- Support for optimizer types: `EXACT_SGD`,`ADAM`,`EXACT_ADAGRAD`,`EXACT_ROWWISE_ADAGRAD`.
33
35
34
36
- Support for automatically parallel `dump`/`load` of embedding weights in dynamic embedding tables.
@@ -93,7 +95,7 @@ Regarding how to use the DynamicEmb APIs and their parameters, please refer to t
93
95
3. The allocated memory for dynamic embedding tables may have slight differences from the specified `num_embeddings` because each dynamic embedding table must set a capacity as a power of 2. This will be automatically calculated by the code, so please ensure that `num_embeddings` is aligned to a power of 2 when applying.
94
96
4. The lookup process for each dynamic embedding table incurs additional overhead from unique or radix sort operations. Therefore, if you request a large number of small dynamic embedding tables for lookup, the performance will be poor. Since the lookup range of dynamic embedding tables is particularly large (using the entire range of `int64_t`), it is recommended to create one large embedding table and perform a fused lookup for multiple features.
95
97
5. Although dynamic embedding tables can be trained together with TorchREC tables, they cannot be fused together for embedding lookup. Therefore, it is recommended to select dynamic embedding tables for all model-parallel tables during training.
96
-
6.Currently, DynamicEmb supports training with TorchREC's `EmbeddingBagCollection` and `EmbeddingCollection`. However, in version v0.1, the main lookup process of `EmbeddingBagCollection` is implemented using torch's ops, not fuse a lot of cuda kernels, which may result in some performance issues. Will fix this performance problem in future versions.
98
+
6. DynamicEmb supports training with TorchREC's `EmbeddingBagCollection`(pooling mode: SUM/MEAN) and `EmbeddingCollection` (sequence mode). Both modes use fused CUDA kernels for embedding lookup and gradient reduction. Tables with different embedding dimensions are supported in pooling mode.
97
99
98
100
### DynamicEmb Insertion Behavior Checking Modes
99
101
@@ -106,7 +108,7 @@ To prevent this behavior from affecting training without user awareness, Dynamic
# Configure the DynamicEmbTableOptions with safe check mode enabled
112
114
table_options = DynamicEmbTableOptions(
@@ -126,12 +128,11 @@ To get started with DynamicEmb, we highly recommend checking out the [example.py
126
128
## Future Plans
127
129
128
130
1. Support the latest version of TorchREC and continuously follow TorchREC's version updates.
129
-
2. Continuously optimize the performance of embedding lookup and embedding bag lookup.
130
-
3. Support multiple optimizer types, aligning with the optimizer types supported by TorchREC.
131
-
4. Support more configurations for dynamic embedding table eviction mechanisms.
132
-
5. Support the separation of backward and optimizer update (required by certain large language model frameworks like Megatron), to better support large-scale GR training.
133
-
6. Add more shard types for dynamic embedding tables, including `table-wise`, `table-row-wise` and `column-wise`.
131
+
2. Support the separation of backward and optimizer update (required by certain large language model frameworks like Megatron), to better support large-scale GR training.
132
+
3. Add more shard types for dynamic embedding tables, including `table-wise`, `table-row-wise` and `column-wise`.
134
133
135
134
## Acknowledgements
136
135
137
-
We would like to thank the Meta team and specially [Huanyu He](https://github.com/TroyGarden) for their support in [TorchRec](https://github.com/pytorch/torchrec).
136
+
We would like to thank the Meta team and specially [Huanyu He](https://github.com/TroyGarden) for their support in [TorchRec](https://github.com/pytorch/torchrec).
137
+
138
+
We also acknowledge the [HierarchicalKV](https://github.com/NVIDIA-Merlin/HierarchicalKV) project, which inspired the scored hash table design used in DynamicEmb.
0 commit comments