add documentation for AllToAllvDynamicSplitNonContig

cenzhaometa · meta-codesync[bot] · commit 29f056381284 · 2025-11-25T11:02:55.000-08:00
Summary: found AllToAllvDynamicSplitNonContig() API quite complex to understand, add a detailed documentation per my understanding.

Reviewed By: GirasoleY

Differential Revision: D87554056

fbshipit-source-id: 99d3243c3fa76b5bba8a16a7df7cf500e53df4b4
diff --git a/comms/ctran/Ctran.h b/comms/ctran/Ctran.h
@@ -146,13 +146,103 @@ commResult_t ctranAlltoallvDynamicSplit(
     CtranComm* comm,
     cudaStream_t stream);
 
-// Note: we support the combine and dispatch APIs by keeping both recvbuffs (for
-// dispatch) and recvbuff (for combine), for implementation simplicity, as we
-// now only have two moving variables for the two APIs: recvbuffs and
-// recvAllSplitLengths.
-// In the future, if the moving varibles increase and it requires more
-// extendiblity, we should implement a new class/struct like Hints, to support
-// various metadata type.
+/**
+ * Note: we support the combine and dispatch APIs by keeping both recvbuffs (for
+ * dispatch) and recvbuff (for combine), for implementation simplicity, as we
+ * now only have two moving variables for the two APIs: recvbuffs and
+ * recvAllSplitLengths.
+ * In the future, if the moving varibles increase and it requires more
+ * extendiblity, we should implement a new class/struct like Hints, to support
+ * various metadata type.
+ *
+ * All-to-all communication with dynamic split lengths and non-contiguous
+ * receive buffers. Designed for expert-parallel workloads (e.g., Mixture of
+ * Experts) where data needs to be routed to different experts across ranks.
+ *
+ * EXAMPLE SCENARIO:
+ * ----------------
+ * 4 ranks
+ *
+ * Rank 0 wants to send data to other ranks with different amounts:
+ *
+ * SEND-SIDE (What Rank 0 sends):
+ * ------------------------------
+ * sendbuff: [100 ints | 200 ints | 150 ints | 50 ints | 300 ints | 100 ints]
+ * chunk-id:       0        1         2         3         4         5
+ *
+ * sendSplitLengths = [100, 200, 150, 50, 300, 100]
+ *   Meaning: how many elements are in each chunk
+ * numSendSplitLengths = 6 (total number of chunks)
+ *
+ * sendIndices = [0, 1, 4, 3, 5]
+ *   Meaning: A list of chunk ids to send
+ * (NOTE: we don't send chunk2 at all)
+ *
+ * sendIndicesBlockLengths = [2, 0, 1, 2] (#entries = #ranks)
+ *   Meaning: chunk 0,1 goes to rank0
+ *            NOTHING goes to rank1
+ *            chunk 4 goes to rank2
+ *            chunk 3,5 goes to rank3
+ *
+ *
+ * RECEIVE-SIDE (What Rank 0 receives):
+ * ------------------------------------
+ * recvbuffs[0]: Data from Rank 0
+ * recvbuffs[1]: Data from Rank 1
+ * recvbuffs[2]: Data from Rank 2
+ * recvbuffs[3]: Data from Rank 3
+ *
+ * recvAllSplitLengths (output): allgathered sendSplitLengths from all ranks.
+ * e.g [sendSplitLengths-rank0, sendSplitLengths-rank1, sendSplitLengths-rank2,
+ * sendSplitLengths-rank3]
+ * this is used in Dispatch(), but not in Combine().
+ *
+ *
+ * PARAMETERS:
+ * -----------
+ * @param sendbuff               GPU buffer containing all data to send
+ *                               Layout: concatenated chunks in order of
+ *                               sendSplitLengths
+ *
+ * @param sendSplitLengths       GPU array of size numSendSplitLengths
+ *                               Specifies number of elements in each chunk
+ *
+ * @param numSendSplitLengths    Total number of chunks
+ *
+ *
+ * @param sendIndices            GPU array of chunk ids to send
+ *
+ *
+ * @param sendIndicesBlockLengths GPU array of size numRanks
+ *                               Number of chunks sent to each rank
+ *                               Partitions sendIndices array by destination
+ *
+ * @param recvbuffs              Array of numRanks GPU buffer pointers
+ *                               recvbuffs[i] receives all data from rank i
+ *                               Non-contiguous: separate buffer per sender
+ *
+ * @param maxSendcount           Maximum total elements that can be sent
+ *                               (buffer capacity)
+ *
+ * @param maxRecvcount           Maximum total elements that can be received
+ *                               (buffer capacity)
+ *
+ * @param hints                  Communication hints for optimization
+ *
+ * @param datatype               Data type of elements (e.g., commInt32)
+ *
+ * @param comm                   Ctran communicator
+ *
+ * @param stream                 CUDA stream for async operations
+ *
+ * @param combine                false = dispatch mode (scatter TO experts)
+ *                               true = combine mode (gather FROM experts)
+ *
+ * @param recvAllSplitLengths    Optional GPU output buffer of size
+ *                               (numRanks * numSendSplitLengths)
+ *                               Stores actual received sizes for each chunk
+ *                               from each sender
+ */
 commResult_t ctranAlltoallvDynamicSplitNonContig(
     const void* sendbuff,
     const size_t* sendSplitLengths,