Fea remove dynamicemb hybrid mode support#306
Conversation
afd10a7 to
37fee97
Compare
Greptile OverviewGreptile SummaryThis PR removes the explicit Key ChangesArchitecture Simplification:
Code Simplification:
Test Updates:
ImpactThis refactoring maintains the same functional behavior while significantly simplifying the codebase (247 net line reduction). The memory management logic becomes more intuitive - users simply specify their HBM budget, and the system automatically configures the appropriate memory layout. Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant API as DynamicEmbTableOptions
participant BatchedTables as BatchedDynamicEmbeddingTablesV2
participant Table as DynamicEmbeddingTable
participant Optimizer as Optimizer (Python)
participant OptimizerCU as Optimizer (CUDA)
participant Kernel as CUDA Kernels
Note over API,Kernel: Before: Hybrid Mode with caching parameter
API->>BatchedTables: caching=True/False
alt caching=True
BatchedTables->>Table: Create GPU cache table (dev_table)
BatchedTables->>Table: Create host storage table (uvm_table)
else caching=False
BatchedTables->>Table: Create single hybrid table (dev_table + uvm_table)
end
Optimizer->>OptimizerCU: fused_update(grads, indices, dev_table, uvm_table)
OptimizerCU->>Kernel: launch with split_index (determines dev vs uvm)
Note over API,Kernel: After: Unified table model
API->>BatchedTables: local_hbm_for_values (budget)
alt budget == 0
BatchedTables->>Table: Create host-only table (_table on UVM)
else budget >= total_table_bytes
BatchedTables->>Table: Create GPU-only table (_table on device)
Note over BatchedTables: Same table serves as cache & storage
else budget < total_table_bytes
BatchedTables->>Table: Create GPU cache (_table on device)
BatchedTables->>Table: Create host storage (_table on UVM)
end
Optimizer->>OptimizerCU: fused_update(grads, indices, table)
OptimizerCU->>Kernel: launch with single table pointer (no split_index)
|
|
What’s the motivation for removing the hybrid mode? Would storing the entire table on the host consume more memory than using the hybrid mode? |
|
hi @tiankongdeguiji we’ve added support for caching in HBM and now store the entire table on the host. This approach delivers significantly better performance compared to the hybrid mode. Moreover, with advanced prefetching techniques, we can effectively hide host memory latency and achieve additional performance gains. As for host memory consumption, it may not be a major concern since HBM holds only a small portion of the embeddings in hybrid mode. So we plan to remove hybrid mode and make caching as default when using HBM + Host |
How much performance improvement did you see in your benchmarks? And how do we enable/use the prefetch? |
|
https://github.com/NVIDIA/recsys-examples/tree/main/corelib/dynamicemb/benchmark#test-results-1 you can reference those number for caching performance. As for prefetch, we are working on some optimization and may release official benchmark results on late March. |
Hi @shijieliu, My understanding is that hybrid mode is similar to the UVM kernel in TorchRec, while caching mode is similar to UVMCaching. I suggest we keep both modes and make caching the default. In some scenarios (e.g., online learning), host memory may be insufficient. Scaling out to multiple machines just to meet host memory requirements can be prohibitively expensive. Instead, we could use a planner that takes HBM and host memory limits as inputs and decides when to use hybrid versus caching mode. Also, removing hybrid mode outright could cause existing users to hit OOM after upgrading DynamicEmb. |
|
I see. thanks for letting us know. We will keep hybrid mode and make caching=True as default. FYI @jiashuy |
|
|
||
| auto kernel = update_with_index_kernel<GradType, WeightType, IndexType, | ||
| OptimizerType>; | ||
| kernel<<<grid_size, block_size, smem_size_f(block_size), stream>>>( |
There was a problem hiding this comment.
Hi @shijieliu , here we should configure the external shared memory usage.
you know we already done in the main branch, but the commit has some conflicts with main branch
Description
Checklist