Skip to content

Commit 094eeb2

Browse files
yupadhyayfacebook-github-bot
authored andcommitted
FAQ file for TorchRec (#3222)
Summary: Pull Request resolved: #3222 docs: Add FAQ for TorchRec This commit introduces a new FAQ.md file to address common questions regarding TorchRec for large model and embedding training. The FAQ covers: - General concepts and use cases for TorchRec and FSDP. - Sharding strategies and distributed training in TorchRec. - Memory management and performance optimization for large embedding tables. - Integration with existing systems. - Common technical challenges encountered by users. - Best practices for model design and evaluation. The goal is to provide a comprehensive resource for users facing challenges with large-scale recommendation systems and distributed training, improving clarity and reducing common pain points. Reviewed By: kausv Differential Revision: D78769752 fbshipit-source-id: b44e09b7ff7de3d62883337eae0bf562bfaf86ad
1 parent e3d5e36 commit 094eeb2

File tree

1 file changed

+112
-0
lines changed

1 file changed

+112
-0
lines changed

docs/FAQ.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# TorchRec FAQ
2+
3+
Frequently asked questions about TorchRec
4+
5+
## Table of Contents
6+
7+
- [General Concepts](#general-concepts)
8+
- [Sharding and Distributed Training](#sharding-and-distributed-training)
9+
- [Memory Management and Performance](#memory-management-and-performance)
10+
- [Integrating with Existing Systems](#integrating-with-existing-systems)
11+
- [Technical Challenges](#technical-challenges)
12+
- [Model Design and Evaluation](#model-design-and-evaluation)
13+
14+
## General Concepts
15+
16+
### What are TorchRec and FSDP, and when should they be used?
17+
18+
**TorchRec** is a PyTorch domain library with primitives for large-scale distributed embeddings, usually used in recommendation systems. Use it when dealing with models containing massive embedding tables that exceed single-GPU memory.
19+
20+
**FSDP (Fully Sharded Data Parallel)** is a PyTorch distributed training technique that shards dense model parameters, gradients, and optimizer states across GPUs, reducing memory footprint for large models. Use it for training large language models or other general deep learning architectures that require scaling across multiple GPUs.
21+
22+
### Can TorchRec do everything FSDP can do for sparse embeddings, and vice versa?
23+
24+
- **TorchRec** offers specialized sharding strategies and optimized kernels designed for sparse embeddings, making it more efficient for this specific task.
25+
- **FSDP** can work with models containing sparse embeddings, but it might not be as optimized or feature-rich as TorchRec for this specific task. For recommendation systems, TorchRec's methods are often more memory efficient due to their focus on sparse data characteristics.
26+
- For optimal results in recommendation systems with large sparse embeddings, combine TorchRec for embeddings and FSDP for the dense parts of the model.
27+
28+
### Does TorchRec support DTensor?
29+
30+
Yes, TorchRec models can benefit from DTensor support in PyTorch distributed components, like FSDP2. This improves distributed training performance, efficiency, and interoperability between TorchRec and other DTensor-based components.
31+
32+
## Sharding and Distributed Training
33+
34+
### How do you choose the best sharding strategy for embedding tables?
35+
36+
TorchRec offers multiple sharding strategies:
37+
- Table-Wise (TW)
38+
- Row-Wise (RW)
39+
- Column-Wise (CW)
40+
- Table-Wise-Row-Wise (TWRW)
41+
- Grid-Shard (GS)
42+
- Data Parallel (DP)
43+
44+
Consider factors like embedding table size, memory constraints, communication patterns, and load balancing when selecting a strategy.
45+
46+
The TorchRec Planner can automatically find an optimal sharding plan based on your hardware and settings.
47+
48+
### How does the TorchRec planner work, and can it be customized?
49+
50+
The Planner aims to balance memory and computation across devices. You can influence the planner using ParameterConstraints, providing information like pooling factors. TorchRec also features automated sharding based on cost modeling and deep reinforcement learning called AutoShard.
51+
52+
53+
## Memory Management and Performance
54+
55+
### How do you manage the memory footprint of large embedding tables?
56+
57+
- Choose an optimal sharding strategy
58+
- If GPU memory is not sufficient, TorchRec provides options to offload embeddings to CPU (UVM) and SSD memory
59+
60+
61+
## Integrating with Existing Systems
62+
63+
### Can TorchRec modules be easily converted to TorchScript for deployment and inference in C++ environments?
64+
65+
Yes, TorchRec modules can be traced and scripted for TorchSCript inference in C++ environments. However, it's recommended to script only the non-embedding layers for better performance and to handle potential limitations with sharded embedding modules in TorchScript.
66+
67+
## Technical Challenges
68+
69+
### Why are you getting row-wise alltoall errors when combining different pooling types?
70+
71+
This can occur due to incompatible sharding and pooling types, resulting in communication mismatches during data aggregation. Ensure your sharding and pooling choices align with the communication patterns required.
72+
73+
### How do you handle floating point exceptions when using quantized embeddings with float32 data types?
74+
75+
- Implement gradient clipping
76+
- Monitor gradients and weights for numerical issues
77+
- Consider using different scaling strategies like amp
78+
- Accumulate gradients over mini-batches
79+
80+
### What are best practices for handling scenarios with empty batches for EmbeddingCollection?
81+
82+
Handle empty batches by filtering them out, skipping lookups, using default values, or padding and masking them accordingly.
83+
84+
### What are common causes of issues during the forward() graph and optimizer step()?
85+
86+
- Incorrect input data format, type, or device
87+
- Invalid embedding lookups (out-of-range indices, mismatched names)
88+
- Issues in the computational graph preventing gradient flow
89+
- Incorrect optimizer setup, learning rate, or fusion settings
90+
91+
### What is the role of fused optimizers in TorchRec?
92+
TorchRec uses fused optimizers, often with DistributedModelParallel, where the optimizer update is integrated into the backward pass. This prevents the materialization of embedding gradients, leading to significant memory savings. You can also opt for a dense optimizer for more control.
93+
94+
## Model Design and Evaluation
95+
96+
### What are best practices for designing recommendation models with TorchRec?
97+
98+
- Carefully select and preprocess features
99+
- Choose suitable model architectures for your recommendation task
100+
- Leverage TorchRec components like EmbeddingBagCollection and optimized kernels
101+
- Design the model with distributed training in mind, considering sharding and communication patterns
102+
103+
### What are the most effective methods for evaluating recommendation systems built with TorchRec?
104+
105+
**Offline Evaluation**:
106+
- Use metrics like AUC, Recall@K, Precision@K, and NDCG@K
107+
- Employ train-test splits, cross-validation, and negative sampling
108+
109+
**Online Evaluation**:
110+
- Conduct A/B tests in production
111+
- Measure metrics like click-through rate, conversion rate, and user engagement
112+
- Gather user feedback

0 commit comments

Comments
 (0)