Skip to content

Fea remove dynamicemb hybrid mode support#306

Closed
shijieliu wants to merge 2 commits intoNVIDIA:mainfrom
shijieliu:fea-remove_dynamicemb_hybrid_mode_support
Closed

Fea remove dynamicemb hybrid mode support#306
shijieliu wants to merge 2 commits intoNVIDIA:mainfrom
shijieliu:fea-remove_dynamicemb_hybrid_mode_support

Conversation

@shijieliu
Copy link
Collaborator

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@shijieliu shijieliu force-pushed the fea-remove_dynamicemb_hybrid_mode_support branch from afd10a7 to 37fee97 Compare February 11, 2026 07:48
@greptile-apps
Copy link

greptile-apps bot commented Feb 11, 2026

Greptile Overview

Greptile Summary

This PR removes the explicit caching parameter from the dynamic embedding table configuration and simplifies the memory management architecture by replacing the dual-table model (dev_table + uvm_table) with a unified single-table design.

Key Changes

Architecture Simplification:

  • Removed the caching boolean flag from DynamicEmbTableOptions that explicitly controlled whether to use cache+storage mode vs hybrid concatenated mode
  • The system now automatically determines the appropriate mode based on comparing local_hbm_for_values (HBM budget) against the total table memory requirements
  • Three memory modes are now automatically selected:
    • budget == 0: Host-only storage (table on UVM)
    • budget >= total_table_bytes: GPU-only (single table serves as both cache and storage)
    • budget < total_table_bytes: Cache+storage mode (GPU cache + host storage)

Code Simplification:

  • Python layer: Replaced dev_table/uvm_table dual parameters with single _table in DynamicEmbeddingTable
  • Optimizer interfaces: Changed from fused_update(grads, indices, dev_table, uvm_table) to fused_update(grads, indices, table)
  • C++/CUDA layer: Removed split_index logic that determined whether to access dev_table or uvm_table based on index ranges
  • Function renamings: *_for_combined_table*_for_table, load_from_combined_tableload_from_table, etc.

Test Updates:

  • Removed all caching parameter usage from tests and examples
  • Changed default cache_capacity_ratio from 0.1 to 1.0 when not in caching mode (now means "use full HBM budget")

Impact

This refactoring maintains the same functional behavior while significantly simplifying the codebase (247 net line reduction). The memory management logic becomes more intuitive - users simply specify their HBM budget, and the system automatically configures the appropriate memory layout.

Confidence Score: 5/5

  • This PR is safe to merge - it's a well-executed refactoring that simplifies the codebase while maintaining functional equivalence
  • The changes demonstrate excellent software engineering practices: systematic removal of an abstraction (the caching boolean) in favor of automatic mode selection based on memory constraints. The refactoring is comprehensive across all layers (Python, C++, CUDA kernels, tests, examples) with consistent naming changes. The logic in batched_dynamicemb_tables.py correctly maps the old behavior to the new model. All test updates are appropriate, and the documentation updates accurately reflect the new mental model.
  • No files require special attention

Important Files Changed

Filename Overview
corelib/dynamicemb/dynamicemb/dynamicemb_config.py Removed caching parameter from DynamicEmbTableOptions, updated documentation to reflect unified memory model, added helper functions for memory calculations
corelib/dynamicemb/dynamicemb/key_value_table.py Replaced dual dev_table/uvm_table design with single _table, simplified memory allocation logic, updated all load/store operations to use unified table interface
corelib/dynamicemb/dynamicemb/batched_dynamicemb_tables.py Refactored cache/storage creation logic to automatically determine mode based on HBM budget vs total table size, maintaining same functionality with cleaner logic
corelib/dynamicemb/src/optimizer.cu Removed split_index logic and dual table handling, simplified all optimizer update functions to work with single unified table
corelib/dynamicemb/src/dynamic_emb_op.cu Renamed load_from_combined_table/store_to_combined_table to load_from_table/store_to_table, removed dual table handling and split logic from kernels
corelib/dynamicemb/test/unit_tests/test_embedding_dump_load.py Removed caching parameter, updated HBM budget calculation to use cache_capacity_ratio directly, changed default non-caching ratio from 0.1 to 1.0

Sequence Diagram

sequenceDiagram
    participant API as DynamicEmbTableOptions
    participant BatchedTables as BatchedDynamicEmbeddingTablesV2
    participant Table as DynamicEmbeddingTable
    participant Optimizer as Optimizer (Python)
    participant OptimizerCU as Optimizer (CUDA)
    participant Kernel as CUDA Kernels

    Note over API,Kernel: Before: Hybrid Mode with caching parameter
    API->>BatchedTables: caching=True/False
    alt caching=True
        BatchedTables->>Table: Create GPU cache table (dev_table)
        BatchedTables->>Table: Create host storage table (uvm_table)
    else caching=False
        BatchedTables->>Table: Create single hybrid table (dev_table + uvm_table)
    end
    Optimizer->>OptimizerCU: fused_update(grads, indices, dev_table, uvm_table)
    OptimizerCU->>Kernel: launch with split_index (determines dev vs uvm)
    
    Note over API,Kernel: After: Unified table model
    API->>BatchedTables: local_hbm_for_values (budget)
    alt budget == 0
        BatchedTables->>Table: Create host-only table (_table on UVM)
    else budget >= total_table_bytes
        BatchedTables->>Table: Create GPU-only table (_table on device)
        Note over BatchedTables: Same table serves as cache & storage
    else budget < total_table_bytes
        BatchedTables->>Table: Create GPU cache (_table on device)
        BatchedTables->>Table: Create host storage (_table on UVM)
    end
    Optimizer->>OptimizerCU: fused_update(grads, indices, table)
    OptimizerCU->>Kernel: launch with single table pointer (no split_index)
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@tiankongdeguiji
Copy link

What’s the motivation for removing the hybrid mode? Would storing the entire table on the host consume more memory than using the hybrid mode?

@shijieliu
Copy link
Collaborator Author

hi @tiankongdeguiji we’ve added support for caching in HBM and now store the entire table on the host. This approach delivers significantly better performance compared to the hybrid mode. Moreover, with advanced prefetching techniques, we can effectively hide host memory latency and achieve additional performance gains. As for host memory consumption, it may not be a major concern since HBM holds only a small portion of the embeddings in hybrid mode. So we plan to remove hybrid mode and make caching as default when using HBM + Host

@tiankongdeguiji
Copy link

hi @tiankongdeguiji we’ve added support for caching in HBM and now store the entire table on the host. This approach delivers significantly better performance compared to the hybrid mode. Moreover, with advanced prefetching techniques, we can effectively hide host memory latency and achieve additional performance gains. As for host memory consumption, it may not be a major concern since HBM holds only a small portion of the embeddings in hybrid mode. So we plan to remove hybrid mode and make caching as default when using HBM + Host

How much performance improvement did you see in your benchmarks? And how do we enable/use the prefetch?

@shijieliu
Copy link
Collaborator Author

https://github.com/NVIDIA/recsys-examples/tree/main/corelib/dynamicemb/benchmark#test-results-1 you can reference those number for caching performance. As for prefetch, we are working on some optimization and may release official benchmark results on late March.

@tiankongdeguiji
Copy link

hi @tiankongdeguiji we’ve added support for caching in HBM and now store the entire table on the host. This approach delivers significantly better performance compared to the hybrid mode. Moreover, with advanced prefetching techniques, we can effectively hide host memory latency and achieve additional performance gains. As for host memory consumption, it may not be a major concern since HBM holds only a small portion of the embeddings in hybrid mode. So we plan to remove hybrid mode and make caching as default when using HBM + Host

Hi @shijieliu, My understanding is that hybrid mode is similar to the UVM kernel in TorchRec, while caching mode is similar to UVMCaching. I suggest we keep both modes and make caching the default.

In some scenarios (e.g., online learning), host memory may be insufficient. Scaling out to multiple machines just to meet host memory requirements can be prohibitively expensive. Instead, we could use a planner that takes HBM and host memory limits as inputs and decides when to use hybrid versus caching mode.

Also, removing hybrid mode outright could cause existing users to hit OOM after upgrading DynamicEmb.

@shijieliu
Copy link
Collaborator Author

I see. thanks for letting us know. We will keep hybrid mode and make caching=True as default. FYI @jiashuy

@shijieliu shijieliu closed this Feb 13, 2026

auto kernel = update_with_index_kernel<GradType, WeightType, IndexType,
OptimizerType>;
kernel<<<grid_size, block_size, smem_size_f(block_size), stream>>>(
Copy link
Collaborator

@jiashuy jiashuy Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shijieliu , here we should configure the external shared memory usage.
you know we already done in the main branch, but the commit has some conflicts with main branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants