address comments: improve API description

XilunWu · XilunWu · commit 1a4c12500be4 · 2025-04-14T14:36:58.000-07:00
diff --git a/prototype_source/context_parallel.rst b/prototype_source/context_parallel.rst
@@ -32,7 +32,7 @@ Two Ring Attention variants have been implemented: `the all-gather based pass-KV
 1.  The all-gather based pass-KV algorithm is used in Llama3 training, which initially performs an all-gather on the key and value tensors, followed by computing the attention output for the
     local query tensor chunk. Our modified all-gather based pass-KV algorithm concurrently all-gathers KV shards and computes attention output for the local query tensor chunk
     using local key and value tensor chunks, followed by a final computation of attention output for the local query tensor and remaining KV shards. This allows some degree of
-    overlap between the attention computation and the all-gather collective.
+    overlap between the attention computation and the all-gather collective. For example, in the case of Llama3 training, we also shard ``freq_cis`` over the sequence dimension.
 2.  The all-to-all approach uses interleaved all-to-all collectives to ring shuffle KV shards to overlap the SDPA (Scaled Dot Product Attention) computation and the all-to-all communication
     necessary for the next SDPA.