Fea remove dynamicemb hybrid mode support by shijieliu · Pull Request #306 · NVIDIA/recsys-examples

shijieliu · 2026-02-11T07:46:27Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

greptile-apps · 2026-02-11T07:52:24Z

Greptile Overview

Greptile Summary

This PR removes the explicit caching parameter from the dynamic embedding table configuration and simplifies the memory management architecture by replacing the dual-table model (dev_table + uvm_table) with a unified single-table design.

Key Changes

Architecture Simplification:

Removed the caching boolean flag from DynamicEmbTableOptions that explicitly controlled whether to use cache+storage mode vs hybrid concatenated mode
The system now automatically determines the appropriate mode based on comparing local_hbm_for_values (HBM budget) against the total table memory requirements
Three memory modes are now automatically selected:
- budget == 0: Host-only storage (table on UVM)
- budget >= total_table_bytes: GPU-only (single table serves as both cache and storage)
- budget < total_table_bytes: Cache+storage mode (GPU cache + host storage)

Code Simplification:

Python layer: Replaced dev_table/uvm_table dual parameters with single _table in DynamicEmbeddingTable
Optimizer interfaces: Changed from fused_update(grads, indices, dev_table, uvm_table) to fused_update(grads, indices, table)
C++/CUDA layer: Removed split_index logic that determined whether to access dev_table or uvm_table based on index ranges
Function renamings: *_for_combined_table → *_for_table, load_from_combined_table → load_from_table, etc.

Test Updates:

Removed all caching parameter usage from tests and examples
Changed default cache_capacity_ratio from 0.1 to 1.0 when not in caching mode (now means "use full HBM budget")

Impact

This refactoring maintains the same functional behavior while significantly simplifying the codebase (247 net line reduction). The memory management logic becomes more intuitive - users simply specify their HBM budget, and the system automatically configures the appropriate memory layout.

Confidence Score: 5/5

This PR is safe to merge - it's a well-executed refactoring that simplifies the codebase while maintaining functional equivalence
The changes demonstrate excellent software engineering practices: systematic removal of an abstraction (the caching boolean) in favor of automatic mode selection based on memory constraints. The refactoring is comprehensive across all layers (Python, C++, CUDA kernels, tests, examples) with consistent naming changes. The logic in batched_dynamicemb_tables.py correctly maps the old behavior to the new model. All test updates are appropriate, and the documentation updates accurately reflect the new mental model.
No files require special attention

Important Files Changed

Filename	Overview
corelib/dynamicemb/dynamicemb/dynamicemb_config.py	Removed `caching` parameter from `DynamicEmbTableOptions`, updated documentation to reflect unified memory model, added helper functions for memory calculations
corelib/dynamicemb/dynamicemb/key_value_table.py	Replaced dual `dev_table`/`uvm_table` design with single `_table`, simplified memory allocation logic, updated all load/store operations to use unified table interface
corelib/dynamicemb/dynamicemb/batched_dynamicemb_tables.py	Refactored cache/storage creation logic to automatically determine mode based on HBM budget vs total table size, maintaining same functionality with cleaner logic
corelib/dynamicemb/src/optimizer.cu	Removed split_index logic and dual table handling, simplified all optimizer update functions to work with single unified table
corelib/dynamicemb/src/dynamic_emb_op.cu	Renamed `load_from_combined_table`/`store_to_combined_table` to `load_from_table`/`store_to_table`, removed dual table handling and split logic from kernels
corelib/dynamicemb/test/unit_tests/test_embedding_dump_load.py	Removed `caching` parameter, updated HBM budget calculation to use `cache_capacity_ratio` directly, changed default non-caching ratio from 0.1 to 1.0

Sequence Diagram

sequenceDiagram
    participant API as DynamicEmbTableOptions
    participant BatchedTables as BatchedDynamicEmbeddingTablesV2
    participant Table as DynamicEmbeddingTable
    participant Optimizer as Optimizer (Python)
    participant OptimizerCU as Optimizer (CUDA)
    participant Kernel as CUDA Kernels

    Note over API,Kernel: Before: Hybrid Mode with caching parameter
    API->>BatchedTables: caching=True/False
    alt caching=True
        BatchedTables->>Table: Create GPU cache table (dev_table)
        BatchedTables->>Table: Create host storage table (uvm_table)
    else caching=False
        BatchedTables->>Table: Create single hybrid table (dev_table + uvm_table)
    end
    Optimizer->>OptimizerCU: fused_update(grads, indices, dev_table, uvm_table)
    OptimizerCU->>Kernel: launch with split_index (determines dev vs uvm)
    
    Note over API,Kernel: After: Unified table model
    API->>BatchedTables: local_hbm_for_values (budget)
    alt budget == 0
        BatchedTables->>Table: Create host-only table (_table on UVM)
    else budget >= total_table_bytes
        BatchedTables->>Table: Create GPU-only table (_table on device)
        Note over BatchedTables: Same table serves as cache & storage
    else budget < total_table_bytes
        BatchedTables->>Table: Create GPU cache (_table on device)
        BatchedTables->>Table: Create host storage (_table on UVM)
    end
    Optimizer->>OptimizerCU: fused_update(grads, indices, table)
    OptimizerCU->>Kernel: launch with single table pointer (no split_index)

greptile-apps

_{20 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

tiankongdeguiji · 2026-02-11T09:07:30Z

What’s the motivation for removing the hybrid mode? Would storing the entire table on the host consume more memory than using the hybrid mode?

shijieliu · 2026-02-12T03:23:08Z

hi @tiankongdeguiji we’ve added support for caching in HBM and now store the entire table on the host. This approach delivers significantly better performance compared to the hybrid mode. Moreover, with advanced prefetching techniques, we can effectively hide host memory latency and achieve additional performance gains. As for host memory consumption, it may not be a major concern since HBM holds only a small portion of the embeddings in hybrid mode. So we plan to remove hybrid mode and make caching as default when using HBM + Host

tiankongdeguiji · 2026-02-12T03:59:27Z

hi @tiankongdeguiji we’ve added support for caching in HBM and now store the entire table on the host. This approach delivers significantly better performance compared to the hybrid mode. Moreover, with advanced prefetching techniques, we can effectively hide host memory latency and achieve additional performance gains. As for host memory consumption, it may not be a major concern since HBM holds only a small portion of the embeddings in hybrid mode. So we plan to remove hybrid mode and make caching as default when using HBM + Host

How much performance improvement did you see in your benchmarks? And how do we enable/use the prefetch?

shijieliu · 2026-02-12T06:06:40Z

https://github.com/NVIDIA/recsys-examples/tree/main/corelib/dynamicemb/benchmark#test-results-1 you can reference those number for caching performance. As for prefetch, we are working on some optimization and may release official benchmark results on late March.

tiankongdeguiji · 2026-02-12T15:01:44Z

hi @tiankongdeguiji we’ve added support for caching in HBM and now store the entire table on the host. This approach delivers significantly better performance compared to the hybrid mode. Moreover, with advanced prefetching techniques, we can effectively hide host memory latency and achieve additional performance gains. As for host memory consumption, it may not be a major concern since HBM holds only a small portion of the embeddings in hybrid mode. So we plan to remove hybrid mode and make caching as default when using HBM + Host

Hi @shijieliu, My understanding is that hybrid mode is similar to the UVM kernel in TorchRec, while caching mode is similar to UVMCaching. I suggest we keep both modes and make caching the default.

In some scenarios (e.g., online learning), host memory may be insufficient. Scaling out to multiple machines just to meet host memory requirements can be prohibitively expensive. Instead, we could use a planner that takes HBM and host memory limits as inputs and decides when to use hybrid versus caching mode.

Also, removing hybrid mode outright could cause existing users to hit OOM after upgrading DynamicEmb.

shijieliu · 2026-02-13T01:03:04Z

I see. thanks for letting us know. We will keep hybrid mode and make caching=True as default. FYI @jiashuy

jiashuy · 2026-02-13T01:46:02Z

corelib/dynamicemb/src/optimizer.cu


    auto kernel = update_with_index_kernel<GradType, WeightType, IndexType,
                                           OptimizerType>;
-    kernel<<<grid_size, block_size, smem_size_f(block_size), stream>>>(


Hi @shijieliu , here we should configure the external shared memory usage.
you know we already done in the main branch, but the commit has some conflicts with main branch

jiashuy and others added 2 commits February 10, 2026 23:44

Remove supports to concated table.

dc86908

remove caching in dynamicemb option

37fee97

shijieliu force-pushed the fea-remove_dynamicemb_hybrid_mode_support branch from afd10a7 to 37fee97 Compare February 11, 2026 07:48

greptile-apps bot reviewed Feb 11, 2026

View reviewed changes

shijieliu closed this Feb 13, 2026

jiashuy reviewed Feb 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fea remove dynamicemb hybrid mode support#306

Fea remove dynamicemb hybrid mode support#306
shijieliu wants to merge 2 commits intoNVIDIA:mainfrom
shijieliu:fea-remove_dynamicemb_hybrid_mode_support

shijieliu commented Feb 11, 2026

Uh oh!

greptile-apps bot commented Feb 11, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

tiankongdeguiji commented Feb 11, 2026

Uh oh!

shijieliu commented Feb 12, 2026

Uh oh!

tiankongdeguiji commented Feb 12, 2026

Uh oh!

shijieliu commented Feb 12, 2026

Uh oh!

tiankongdeguiji commented Feb 12, 2026

Uh oh!

shijieliu commented Feb 13, 2026

Uh oh!

jiashuy Feb 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shijieliu commented Feb 11, 2026

Description

Checklist

Uh oh!

greptile-apps bot commented Feb 11, 2026

Greptile Overview

Greptile Summary

Key Changes

Impact

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

tiankongdeguiji commented Feb 11, 2026

Uh oh!

shijieliu commented Feb 12, 2026

Uh oh!

tiankongdeguiji commented Feb 12, 2026

Uh oh!

shijieliu commented Feb 12, 2026

Uh oh!

tiankongdeguiji commented Feb 12, 2026

Uh oh!

shijieliu commented Feb 13, 2026

Uh oh!

jiashuy Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiashuy Feb 13, 2026 •

edited

Loading