Add mHC kernel documentation to README and API reference (#1132)

yukiu00 · web-flow · commit 5841280d38e9 · 2026-03-05T14:17:54.000Z
## Summary - Add `mHC (Hyper-Connections)` entry to the Model Kernels table in `README.md` and `docs/Low-Level-APIs.md` - Add detailed description section in `docs/Low-Level-APIs.md` with architecture overview, usage example, and functional API reference ## Reference Issue - Follows up on #1065 which added the mHC fused kernels but did not update documentation
diff --git a/README.md b/README.md
@@ -293,6 +293,7 @@ loss.backward()
 | Multi Token Attention           | `liger_kernel.transformers.LigerMultiTokenAttention`        |
 | Softmax                         | `liger_kernel.transformers.LigerSoftmax`                    |
 | Sparsemax                       | `liger_kernel.transformers.LigerSparsemax`                  |
+| mHC (Hyper-Connections)         | `liger_kernel.transformers.LigerMHC`                        |
 
 
 ### Alignment Kernels
diff --git a/docs/Low-Level-APIs.md b/docs/Low-Level-APIs.md
@@ -12,6 +12,7 @@
 | Multi Token Attention           | `liger_kernel.transformers.LigerMultiTokenAttention`        |
 | Softmax                         | `liger_kernel.transformers.LigerSoftmax`                    |
 | Sparsemax                       | `liger_kernel.transformers.LigerSparsemax`                  |
+| mHC (Hyper-Connections)         | `liger_kernel.transformers.LigerMHC`                        |
 
 
 ### RMS Norm
@@ -72,6 +73,41 @@ Sparsemax is a sparse alternative to softmax that produces sparse probability di
 
 The implementation achieves significant speed improvements and memory savings compared to standard PyTorch implementations, particularly for large input tensors.
 
+### mHC (Manifold-Constrained Hyper-Connections)
+
+mHC implements fused Triton kernels for Manifold-Constrained Hyper-Connections ([arXiv:2512.24880](https://arxiv.org/abs/2512.24880)). It wraps an arbitrary layer `F: [..., C] -> [..., C]` with multiple residual streams, constraining the residual routing matrix `H_res` onto the Birkhoff polytope (doubly-stochastic matrices) via Sinkhorn-Knopp iterations to stabilize training.
+
+The `LigerMHC` module takes input of shape `[..., HC, C]` where `HC` is the number of residual streams, and performs:
+
+1. **Coefficients** -- Compute data-dependent routing coefficients (`h_pre`, `h_post`, `h_res`) via fused matmul + RMS normalization + Sinkhorn-Knopp iterations.
+2. **Pre-aggregate** -- `x_in = sum_i h_pre[i] * x[i]`  (shape: `[..., C]`)
+3. **Layer** -- `f_out = layer(x_in)`  (shape: `[..., C]`)
+4. **Post + residual** -- `x_out[o] = sum_i h_res[o,i] * x[i] + h_post[o] * f_out`  (shape: `[..., HC, C]`)
+
+Usage:
+
+```python
+import torch
+import torch.nn as nn
+from liger_kernel.transformers import LigerMHC
+
+# Wrap a linear layer with 4 residual streams of dimension 256
+layer = nn.Linear(256, 256, bias=False, device="cuda", dtype=torch.bfloat16)
+mhc = LigerMHC(layer, hc=4, c=256, phi_dtype=torch.bfloat16).cuda()
+
+# Input: [batch, seq_len, num_streams, channels] in BF16/FP16
+x = torch.randn(2, 128, 4, 256, device="cuda", dtype=torch.bfloat16)
+out = mhc(x)  # shape: [2, 128, 4, 256]
+```
+
+Functional APIs are also available:
+
+- `liger_kernel.transformers.functional.liger_mhc_coeffs` -- Compute routing coefficients
+- `liger_kernel.transformers.functional.liger_mhc_pre` -- Pre-aggregation
+- `liger_kernel.transformers.functional.liger_mhc_post_res` -- Post-aggregation + residual
+- `liger_kernel.transformers.functional.liger_mhc_apply` -- Combined pre + post_res
+- `liger_kernel.transformers.functional.liger_mhc_forward` -- Full forward pass (coeffs + pre + layer + post_res)
+
 ## Alignment Kernels
 
 | **Kernel**                      | **API**                                                     |