Enhance explanation of transformer benefits and TabPFN design

StatMixedML · web-flow · commit 2c2f109dedfe · 2025-09-12T12:02:44.000+02:00
Clarify the advantages of transformers for tabular datasets and explain the role of the MLP in TabPFN.
diff --git a/annotated/annotated nanoTabPFN.md b/annotated/annotated nanoTabPFN.md
@@ -164,8 +164,8 @@ Through in-context learning, transformers can:
 
 ### **Scalable Parallelization**
 Unlike sequential models, transformers offer:
-- Parallelization across sequence length during training
-- Efficient batch processing on modern hardware (GPUs)
+- One-step computation across the full sequence: self-attention eliminates sequential dependencies, enabling all token–token interactions to be computed in a single set of matrix multiplications
+- Full exploitation of GPU parallelism: large batched matrix multiplications allow efficient use of modern hardware, with parallelism across batches and attention heads
 
 These advantages make transformers well-suited for foundation models for diverse tabular datasets without task-specific modifications. While challenges remain - particularly the quadratic complexity for large datasets - the flexibility, and expressiveness make transformers the architecture of choice for tabular foundation models. It is important to note that TabPFN uses only the transformer encoder because tabular prediction is a task where we need to classify/regress all test samples simultaneously based on the provided context, not generate outputs sequentially like in language generation. The "decoder" in TabPFN is simply a MLP that maps the enriched target embeddings from the transformer encoder to final predictions - it's not a transformer decoder at all. This design mirrors architectures where transformer encoders extract rich representations that are then passed through task-specific heads, rather than GPT-style decoders that generate tokens autoregressively.