You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -502,7 +508,20 @@ The rules are simpler than they look. Compare shapes from right to left. At each
502
508
|`(3, 4)`|`(3,)`| Error | ✗ (3 ≠ 4) |
503
509
|`(2, 3, 4)`|`(3, 4)`|`(2, 3, 4)`| ✓ |
504
510
505
-
The memory savings are dramatic. Adding a `(768,)` vector to a `(32, 512, 768)` tensor would require copying the vector 32×512 times without broadcasting, allocating 50 MB of redundant data (12.5 million float32 numbers). With broadcasting, you store just the original 3 KB vector.
The memory savings are dramatic. Adding a `(768,)` vector to a `(32, 512, 768)` tensor would require copying the vector 32×512 times without broadcasting, allocating {glue:text}`bcast_mb` of redundant data ({glue:text}`bcast_elements` float32 numbers). With broadcasting, you store just the original {glue:text}`bcast_vec_kb` vector.
506
525
507
526
### Views vs. Copies
508
527
@@ -803,16 +822,45 @@ Broadcasting rules, shape semantics, and API design patterns. When you debug PyT
803
822
804
823
### Why Tensors Matter at Scale
805
824
825
+
```{code-cell} python3
826
+
:tags: [remove-input, remove-output]
827
+
828
+
# LLM parameter storage (fp16 = 2 bytes per param)
To appreciate why tensor operations matter, consider the scale of modern ML systems:
807
839
808
-
-**Large language models**: 175 billion numbers stored as tensors = **350 GB** (like storing 70,000 full-resolution photos)
809
-
-**Image processing**: A batch of 128 images = **77 MB** of tensor data
840
+
-**Large language models**: 175 billion numbers stored as tensors = **{glue:text}`llm_gb`** (like storing 70,000 full-resolution photos)
841
+
-**Image processing**: A batch of 128 images = **{glue:text}`batch128_mb`** of tensor data
810
842
-**Self-driving cars**: Process tensor operations at **36 FPS** across multiple cameras (each frame = millions of operations in 28 milliseconds)
811
843
812
844
A single matrix multiplication can consume **90% of computation time** in neural networks. Understanding tensor operations isn't just academic; it's essential for building and debugging real ML systems.
Test yourself with these systems thinking questions. They're designed to build intuition for the performance characteristics you'll encounter in production ML.
817
865
818
866
**Q1: Memory Calculation**
@@ -822,7 +870,7 @@ A batch of 32 RGB images (224×224 pixels) stored as float32. How much memory?
@@ -693,16 +699,47 @@ Let's walk through the key similarities and differences:
693
699
Mathematical functions, numerical stability techniques (max subtraction in softmax), and the concept of element-wise transformations. When you debug PyTorch activation issues, you'll understand exactly what's happening because you implemented the same logic.
To appreciate why activation choice matters, consider the scale of modern ML systems:
699
717
700
-
-**Large language models**: GPT-3 has 96 transformer layers, each with 2 GELU activations. That's **192 GELU operations per forward pass** on billions of parameters.
718
+
-**Large language models**: GPT-3 has 96 transformer layers, each with 2 GELU activations. That's **{glue:text}`prose_gelu_ops` GELU operations per forward pass** on billions of parameters.
701
719
-**Image classification**: ResNet-50 has 49 convolutional layers, each followed by ReLU. Processing a batch of 256 images at 224×224 resolution means **12 billion ReLU operations** per batch.
702
-
-**Production serving**: A model serving 1000 requests per second performs **86 million activation computations per day**. A 20% speedup from ReLU vs GELU saves hours of compute time.
720
+
-**Production serving**: A model serving 1000 requests per second performs **{glue:text}`prose_daily_activations` activation computations per day**. A 20% speedup from ReLU vs GELU saves hours of compute time.
703
721
704
722
Activation functions account for **5-15% of total training time** in typical networks (the rest is matrix multiplication). But in transformer models with many layers and small matrix sizes, activations can account for **20-30% of compute time**. This is why GELU vs ReLU is a real trade-off: slower computation but potentially better accuracy.
This is the activation memory for ONE layer. A 100-layer network needs 50 MB just to store activations for one forward pass. This is why activation memory dominates training memory usage — activations must be cached for backpropagation.
756
+
This is the activation memory for ONE layer. A 100-layer network needs {glue:text}`q1_100layer_mb` just to store activations for one forward pass. This is why activation memory dominates training memory usage — activations must be cached for backpropagation.
720
757
```
721
758
722
759
**Q2: Computational Cost**
@@ -764,8 +801,8 @@ For a standard normal distribution N(0, 1), approximately **50% of values are ne
764
801
765
802
ReLU zeros all negative values, so approximately **50% of outputs will be exactly zero**.
766
803
767
-
Total elements: 128 × 1024 = 131,072
768
-
Zeros: ≈ 65,536
804
+
Total elements: 128 × 1024 = {glue:text}`q4_total`
805
+
Zeros: {glue:text}`q4_zeros`
769
806
770
807
This sparsity has major implications:
771
808
- **Speed**: Multiplying by zero is free, so downstream computations can skip ~50% of operations
@@ -839,7 +876,7 @@ Implement Linear layers that combine your Tensor operations with your activation
839
876
840
877
```{tip} Interactive Options
841
878
842
-
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/02_activations/02_activations.ipynb)** - Run interactively in browser, no setup required
879
+
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/02_activations/activations.ipynb)** - Run interactively in browser, no setup required
843
880
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/02_activations/02_activations.py)** - Browse the implementation code
0 commit comments