Skip to content

Commit 02ce8a2

Browse files
shijieliujiashuy
andauthored
Fea unify pooling to dynamic embedding table (#301)
* DynamicEmbeddingFunctionV2 support pooling and mixed_D pooling clean hkv & related code update doc minor fix * clean * fix * minor fix * minor fix * Update benchmark results * remove device_sm_count in python * fix pooling mode enum * fix * add check * fix ci * fix ci * clean v2 suffix --------- Co-authored-by: Jiashu Yao <jiashu.yao.cn@gmail.com>
1 parent 0010954 commit 02ce8a2

File tree

62 files changed

+1063
-7257
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

62 files changed

+1063
-7257
lines changed

.gitmodules

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,3 @@
1-
[submodule "third_party/HierarchicalKV"]
2-
path = third_party/HierarchicalKV
3-
url = https://github.com/NVIDIA-Merlin/HierarchicalKV.git
41
[submodule "third_party/cutlass"]
52
path = third_party/cutlass
63
url = https://github.com/NVIDIA/cutlass.git

corelib/dynamicemb/DynamicEmb_APIs.md

Lines changed: 54 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ This document consists of two parts, one is the introduction to the API, which c
1111
- [DynamicEmbCheckMode](#dynamicembcheckmode)
1212
- [DynamicEmbInitializerMode](#dynamicembinitializermode)
1313
- [DynamicEmbInitializerArgs](#dynamicembinitializerargs)
14+
- [DynamicEmbPoolingMode](#dynamicembpoolingmode)
1415
- [DynamicEmbTableOptions](#dynamicembtableoptions)
1516
- [DynamicEmbDump](#dynamicembdump)
1617
- [DynamicEmbLoad](#dynamicembload)
@@ -39,7 +40,7 @@ The `DynamicEmbParameterConstraints` function inherits from TorchREC's `Paramete
3940
use_dynamicemb : Optional[bool]
4041
A flag indicating whether to use DynamicEmb storage. Defaults to False.
4142
dynamicemb_options : Optional[DynamicEmbTableOptions]
42-
Including HKV Configs and Initializer Args. The initialization method for the parameters.
43+
Configuration for the dynamic embedding table, including initializer args.
4344
Common choices include "uniform", "normal", etc. Defaults to "uniform".
4445
"""
4546
use_dynamicemb: Optional[bool] = False
@@ -273,8 +274,8 @@ Parameters for each random initialization method in DynamicEmbInitializerMode.
273274
## DynamicEmbScoreStrategy
274275

275276
The storage space is limited, but the value range of sparse features is relatively large,
276-
so HKV introduces the concept of score to perform customized evcition of sparse features within the limited storage space.
277-
Based on the score of HKV, dynamicemb provides the following strategies to set the score.
277+
so dynamicemb introduces the concept of score to perform customized eviction of sparse features within the limited storage space.
278+
dynamicemb provides the following strategies to set the score.
278279

279280
```python
280281
#How to import
@@ -309,7 +310,40 @@ Based on the score of HKV, dynamicemb provides the following strategies to set t
309310
CUSTOMIZED = 2
310311
```
311312

312-
Users can specify the `DynamicEmbScoreStrategy` using `score_strategy` in `DynamicEmbTableOptions` per table.
313+
Users can specify the `DynamicEmbScoreStrategy` using `score_strategy` in `DynamicEmbTableOptions` per table.
314+
315+
## DynamicEmbPoolingMode
316+
317+
DynamicEmb supports three pooling modes that determine how embedding lookups are aggregated. These modes correspond to how `EmbeddingCollection` (sequence) and `EmbeddingBagCollection` (pooled) work in TorchREC.
318+
319+
All pooling modes use fused CUDA kernels for both forward and backward passes. Tables with different embedding dimensions (mixed-D) are fully supported in `SUM` and `MEAN` modes.
320+
321+
```python
322+
#How to import
323+
from dynamicemb import DynamicEmbPoolingMode
324+
325+
#API arguments
326+
class DynamicEmbPoolingMode(enum.IntEnum):
327+
"""
328+
Enumeration for pooling modes in dynamic embedding lookup.
329+
330+
Attributes
331+
----------
332+
SUM : int
333+
Sum pooling. For each sample, the embeddings of all indices in the bag
334+
are summed. Output shape: (batch_size, total_D) where total_D is the
335+
sum of embedding dimensions across all features.
336+
MEAN : int
337+
Mean pooling. For each sample, the embeddings of all indices in the bag
338+
are averaged. Output shape: same as SUM.
339+
NONE : int
340+
No pooling (sequence mode). Each index produces its own embedding row.
341+
Output shape: (total_indices, D).
342+
"""
343+
SUM = 0
344+
MEAN = 1
345+
NONE = 2
346+
```
313347

314348
## DynamicEmbTableOptions
315349

@@ -345,18 +379,18 @@ Dynamic embedding table parameter class, used to configure the parameters for ea
345379
caching: bool
346380
Flag to indicate dynamic embedding tables is working on caching mode, default to `False`.
347381
When the device memory on a single GPU is insufficient to accommodate a single shard of the dynamic embedding table,
348-
HKV supports the mixed use of device memory and host memory(pinned memory).
382+
dynamicemb supports the mixed use of device memory and host memory(pinned memory).
349383
But by default, the values of the entire table are concatenated with device memory and host memory.
350-
This means that the storage location of one embeddng is determined by `hash_function(key)`, and mapping to device memory will bring better lookup performance.
384+
This means that the storage location of one embedding is determined by `hash_function(key)`, and mapping to device memory will bring better lookup performance.
351385
However, sparse features in training are often with temporal locality.
352-
In order to store hot keys in device memory, dynamicemb creates two HKV instances,
353-
whose values are stored in device memory and memory respectively, and store hot keys on the GPU table priorily.
354-
If the GPU table is full, the evicted keys will be inserted into the CPU table.
355-
If the CPU table is also full, the key granularity will be evicted(all the eviction is based on the score per key).
386+
In order to store hot keys in device memory, dynamicemb creates two table instances,
387+
whose values are stored in device memory and host memory respectively, and store hot keys on the GPU table priorily.
388+
If the GPU table is full, the evicted keys will be inserted into the host table.
389+
If the host table is also full, the key will be evicted(all the eviction is based on the score per key).
356390
The original intention of eviction is based on this insight: features that only appear once should not occupy memory(even host memory) for a long time.
357391
In short:
358-
set **`caching=True`** will create a GPU table and a CPU table, and make GPU table serves as a cache;
359-
set **`caching=False`** will create a hybrid table which use GPU and CPU memory in a concated way to store value.
392+
set **`caching=True`** will create a GPU table and a host table, and make GPU table serves as a cache;
393+
set **`caching=False`** will create a hybrid table which use GPU and host memory in a concatenated way to store value.
360394
All keys and other meta data are always stored on GPU for both cases.
361395
init_capacity : Optional[int], optional
362396
The initial capacity of the table. If not set, it defaults to max_capacity after sharding.
@@ -375,7 +409,7 @@ Dynamic embedding table parameter class, used to configure the parameters for ea
375409
For the multi-GPUs scenario of model parallelism, every rank's score_strategy should keep the same for one table,
376410
as they are the same table, but stored on different ranks.
377411
bucket_capacity : int
378-
Capacity of each bucket in HKV, and default is 128(using 1024 when HKV serves as cache).
412+
Capacity of each bucket in the hash table, and default is 128 (using 1024 when the table serves as cache).
379413
A key will only be mapped to one bucket.
380414
When the bucket is full, the key with the smallest score in the bucket will be evicted, and its slot will be used to store a new key.
381415
The larger the bucket capacity, the more accurate the score based eviction will be, but it will also result in performance loss.
@@ -390,7 +424,7 @@ Dynamic embedding table parameter class, used to configure the parameters for ea
390424
When `caching=True`, it decides the table capacity of the GPU table.
391425
external_storage: Storage
392426
The external storage/ParamterServer which inherits the interface of Storage, and can be configured per table.
393-
If not provided, will using KeyValueTable as the Storage.
427+
If not provided, will using DynamicEmbeddingTable as the Storage.
394428
index_type : Optional[torch.dtype], optional
395429
Index type of sparse features, will be set to DEFAULT_INDEX_TYPE(torch.int64) by default.
396430
admit_strategy : Optional[AdmissionStrategy], optional
@@ -405,8 +439,7 @@ Dynamic embedding table parameter class, used to configure the parameters for ea
405439
406440
Notes
407441
-----
408-
For detailed descriptions and additional context on each parameter, please refer to the documentation at
409-
https://github.com/NVIDIA-Merlin/HierarchicalKV.
442+
For detailed descriptions and additional context on each parameter, please refer to the documentation in this repository.
410443
"""
411444

412445
training: bool = True
@@ -762,7 +795,7 @@ class FrequencyAdmissionStrategy(AdmissionStrategy):
762795

763796
Once the model containing `EmbeddingCollection` is built and initialized through `DistributedModelParallel`, it can be trained and evaluated on each GPU like a single GPU, with torchrec completing communication between different GPUs.
764797

765-
The switching between training and evaluation modes should be consistent with `nn.Module`, while `training` in [DynamicEmbTableOptions](../dynamicemb/dynamicemb_config.py) is used to guide whether to allocate memory to optimizer states when builds the table.
798+
The switching between training and evaluation modes should be consistent with `nn.Module`, while `training` in [DynamicEmbTableOptions](./dynamicemb/dynamicemb_config.py) is used to guide whether to allocate memory to optimizer states when builds the table.
766799

767800
Due to limited resources, the dynamic embedding table does not pre allocate memory for all keys. If a key appears for the first time during training, it will be initialized immediately during the training process. Please see `initializer_args` and `eval_initializer_args` in `DynamicEmbTableOptions` for more information.
768801

@@ -772,12 +805,12 @@ The size of the table is finite, but the set of keys during training may be infi
772805

773806
## Caching and prefetch
774807

775-
dynamicemb supports caching hot embeddings on GPU memory, and you can prefetch keys from host to device like torchrec(document and example is waiting to append, and now please see `test_prefetch_flush_in_cache` in [test prefetch](./test/test_batched_dynamic_embedding_tables_v2.py)).
808+
dynamicemb supports caching hot embeddings on GPU memory, and you can prefetch keys from host to device like torchrec. Caching and prefetch work for both sequence mode (`NONE`) and pooling modes (`SUM`/`MEAN`). See `test_prefetch_flush_in_cache` in [test prefetch](./test/test_batched_dynamic_embedding_tables_v2.py) for usage examples.
776809

777810
## External storage
778811

779-
dynamicemb supports external storage once `external_storage` in `DynamicEmbTableOptions` inherits the `Storage` interface under [types.py](../dynamicemb/types.py).
780-
Refer to demo `PyDictStorage` in [uint test](../test/test_batched_dynamic_embedding_tables_v2.py) for detailed usage.
812+
dynamicemb supports external storage once `external_storage` in `DynamicEmbTableOptions` inherits the `Storage` interface under [types.py](./dynamicemb/types.py).
813+
Refer to demo `PyDictStorage` in [unit test](./test/test_batched_dynamic_embedding_tables_v2.py) for detailed usage.
781814

782815

783816
## Table expansion
@@ -791,7 +824,7 @@ Dump/Load and incremental dump is different from general module in PyTorch, beca
791824

792825
So dynamicemb provides dedicated interface to load/save models' states, and provide conditional dump to support online training.
793826

794-
Please see `DynamicEmbDump`, `DynamicEmbLoad`, `incremental_dump` in [APIs Doc](../DynamicEmb_APIs.md) for more information.
827+
Please see `DynamicEmbDump`, `DynamicEmbLoad`, `incremental_dump` in [APIs Doc](./DynamicEmb_APIs.md) for more information.
795828

796829
## Deterministic mode
797830

corelib/dynamicemb/README.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# DynamicEmb
22

3-
DynamicEmb is a Python package that provides model-parallel dynamic embedding tables and embedding lookup functionalities for TorchREC, specifically targeting the sparse training aspects of recommendation systems. Currently, DynamicEmb utilizes the [HierarchicalKV](https://github.com/NVIDIA-Merlin/HierarchicalKV) hash table backend, which is designed to store key-value (feature-embedding) pairs in the high-bandwidth memory (HBM) of GPUs as well as in host memory.
3+
DynamicEmb is a Python package that provides model-parallel dynamic embedding tables and embedding lookup functionalities for TorchREC, specifically targeting the sparse training aspects of recommendation systems. DynamicEmb uses a GPU-optimized scored hash table backend to store key-value (feature-embedding) pairs in the high-bandwidth memory (HBM) of GPUs as well as in host memory.
44

55
The lookup kernel algorithms implemented in DynamicEmb primarily leverage portions of the algorithms from the [EMBark](https://dl.acm.org/doi/abs/10.1145/3640457.3688111) paper (Embedding Optimization for Training Large-scale Deep Learning Recommendation Systems with EMBark).
66

@@ -29,6 +29,8 @@ The lookup kernel algorithms implemented in DynamicEmb primarily leverage portio
2929

3030
- Support for creating dynamic embedding tables within `EmbeddingBagCollection` and `EmbeddingCollection` in TorchREC, allowing for embedding storage and lookup, and enabling coexistence with native Torch embedding tables within Torch models.
3131

32+
- **Pooling Mode Support**: DynamicEmb supports `SUM`, `MEAN`, and `NONE` (sequence) pooling modes with fused CUDA kernels for both forward and backward passes. Tables with different embedding dimensions (mixed-D) are fully supported in pooling mode.
33+
3234
- Support for optimizer types: `EXACT_SGD`,`ADAM`,`EXACT_ADAGRAD`,`EXACT_ROWWISE_ADAGRAD`.
3335

3436
- Support for automatically parallel `dump`/`load` of embedding weights in dynamic embedding tables.
@@ -93,7 +95,7 @@ Regarding how to use the DynamicEmb APIs and their parameters, please refer to t
9395
3. The allocated memory for dynamic embedding tables may have slight differences from the specified `num_embeddings` because each dynamic embedding table must set a capacity as a power of 2. This will be automatically calculated by the code, so please ensure that `num_embeddings` is aligned to a power of 2 when applying.
9496
4. The lookup process for each dynamic embedding table incurs additional overhead from unique or radix sort operations. Therefore, if you request a large number of small dynamic embedding tables for lookup, the performance will be poor. Since the lookup range of dynamic embedding tables is particularly large (using the entire range of `int64_t`), it is recommended to create one large embedding table and perform a fused lookup for multiple features.
9597
5. Although dynamic embedding tables can be trained together with TorchREC tables, they cannot be fused together for embedding lookup. Therefore, it is recommended to select dynamic embedding tables for all model-parallel tables during training.
96-
6. Currently, DynamicEmb supports training with TorchREC's `EmbeddingBagCollection` and `EmbeddingCollection`. However, in version v0.1, the main lookup process of `EmbeddingBagCollection` is implemented using torch's ops, not fuse a lot of cuda kernels, which may result in some performance issues. Will fix this performance problem in future versions.
98+
6. DynamicEmb supports training with TorchREC's `EmbeddingBagCollection` (pooling mode: SUM/MEAN) and `EmbeddingCollection` (sequence mode). Both modes use fused CUDA kernels for embedding lookup and gradient reduction. Tables with different embedding dimensions are supported in pooling mode.
9799

98100
### DynamicEmb Insertion Behavior Checking Modes
99101

@@ -106,7 +108,7 @@ To prevent this behavior from affecting training without user awareness, Dynamic
106108
#### Example
107109

108110
```python
109-
from dynamic_emb import DynamicEmbTableOptions, DynamicEmbCheckMode
111+
from dynamicemb import DynamicEmbTableOptions, DynamicEmbCheckMode
110112

111113
# Configure the DynamicEmbTableOptions with safe check mode enabled
112114
table_options = DynamicEmbTableOptions(
@@ -126,12 +128,11 @@ To get started with DynamicEmb, we highly recommend checking out the [example.py
126128
## Future Plans
127129

128130
1. Support the latest version of TorchREC and continuously follow TorchREC's version updates.
129-
2. Continuously optimize the performance of embedding lookup and embedding bag lookup.
130-
3. Support multiple optimizer types, aligning with the optimizer types supported by TorchREC.
131-
4. Support more configurations for dynamic embedding table eviction mechanisms.
132-
5. Support the separation of backward and optimizer update (required by certain large language model frameworks like Megatron), to better support large-scale GR training.
133-
6. Add more shard types for dynamic embedding tables, including `table-wise`, `table-row-wise` and `column-wise`.
131+
2. Support the separation of backward and optimizer update (required by certain large language model frameworks like Megatron), to better support large-scale GR training.
132+
3. Add more shard types for dynamic embedding tables, including `table-wise`, `table-row-wise` and `column-wise`.
134133

135134
## Acknowledgements
136135

137-
We would like to thank the Meta team and specially [Huanyu He](https://github.com/TroyGarden) for their support in [TorchRec](https://github.com/pytorch/torchrec).
136+
We would like to thank the Meta team and specially [Huanyu He](https://github.com/TroyGarden) for their support in [TorchRec](https://github.com/pytorch/torchrec).
137+
138+
We also acknowledge the [HierarchicalKV](https://github.com/NVIDIA-Merlin/HierarchicalKV) project, which inspired the scored hash table design used in DynamicEmb.

corelib/dynamicemb/benchmark/benchmark_batched_dynamicemb_tables.py

Lines changed: 1 addition & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,6 @@
1616
import argparse
1717
import json
1818
import os
19-
from typing import cast
2019

2120
import numpy as np
2221
import torch
@@ -32,8 +31,6 @@
3231
EmbOptimType,
3332
)
3433
from dynamicemb.batched_dynamicemb_tables import BatchedDynamicEmbeddingTablesV2
35-
from dynamicemb.key_value_table import KeyValueTable
36-
from dynamicemb_extensions import DynamicEmbTable, insert_or_assign
3734
from fbgemm_gpu.runtime_monitor import StdLogStatsReporterConfig
3835
from fbgemm_gpu.split_embedding_configs import EmbOptimType as OptimType
3936
from fbgemm_gpu.split_embedding_configs import SparseType
@@ -307,46 +304,10 @@ def generate_sequence_sparse_feature(args, device):
307304
)
308305

309306

310-
class TableShim:
311-
def __init__(self, table):
312-
if isinstance(table, DynamicEmbTable):
313-
self.table = cast(DynamicEmbTable, table)
314-
elif isinstance(table, KeyValueTable):
315-
self.table = table
316-
else:
317-
raise ValueError("Not support table type")
318-
319-
def optim_states_dim(self) -> int:
320-
if isinstance(self.table, DynamicEmbTable):
321-
return self.table.optstate_dim()
322-
else:
323-
return self.table.value_dim() - self.table.embedding_dim()
324-
325-
def init_optim_state(self) -> float:
326-
if isinstance(self.table, DynamicEmbTable):
327-
return self.table.get_initial_optstate()
328-
else:
329-
return self.table.init_optimizer_state()
330-
331-
def insert(
332-
self,
333-
n,
334-
unique_indices,
335-
unique_values,
336-
scores,
337-
) -> None:
338-
if isinstance(self.table, DynamicEmbTable):
339-
insert_or_assign(self.table, n, unique_indices, unique_values, scores)
340-
else:
341-
# self.table.set_score(scores[0].item())
342-
self.table.insert(unique_indices, unique_values, scores)
343-
344-
345307
def create_dynamic_embedding_tables(args, device):
346308
table_options = []
347309
table_num = args.num_embedding_table
348310
for i in range(table_num):
349-
TableModule = BatchedDynamicEmbeddingTablesV2
350311
table_options.append(
351312
DynamicEmbTableOptions(
352313
index_type=torch.int64,
@@ -365,7 +326,7 @@ def create_dynamic_embedding_tables(args, device):
365326
)
366327
)
367328

368-
var = TableModule(
329+
var = BatchedDynamicEmbeddingTablesV2(
369330
table_options=table_options,
370331
table_names=[table_idx_to_name(i) for i in range(table_num)],
371332
use_index_dedup=args.use_index_dedup,
2.87 KB
Loading

0 commit comments

Comments
 (0)