- Dot Product: a·b = Σ aᵢbᵢ = |a||b|cos(θ)
- L2 Norm: ||v|| = √(Σ vᵢ²)
- L1 Norm: ||v||₁ = Σ |vᵢ|
- Cosine Similarity: cos(θ) = (a·b)/(||a|| ||b||)
- Euclidean Distance: d(a,b) = ||a - b||
- Matrix Multiplication: (AB)ᵢⱼ = Σₖ Aᵢₖ Bₖⱼ
- Dimensions: (m×n)(n×p) = (m×p)
- Transpose: (Aᵀ)ᵢⱼ = Aⱼᵢ
- Trace: tr(A) = Σᵢ Aᵢᵢ
- Determinant: det(A) - measures volume scaling
- Inverse: AA⁻¹ = I (only for square, non-singular matrices)
- (AB)ᵀ = BᵀAᵀ
- (AB)⁻¹ = B⁻¹A⁻¹
- tr(AB) = tr(BA)
- det(AB) = det(A)det(B)
- Definition: Av = λv
- v: eigenvector (direction)
- λ: eigenvalue (scaling)
- Decomposition: A = VΛVᵀ (for symmetric A)
- V: eigenvectors (columns)
- Λ: diagonal matrix of eigenvalues
- Formula: A = UΣVᵀ
- U (m×m): left singular vectors
- Σ (m×n): singular values (diagonal)
- V (n×n): right singular vectors
- Rank-k Approximation: Aₖ = Σᵢ₌₁ᵏ σᵢ uᵢvᵢᵀ
- Connection to Eigendecomposition:
- AᵀA = VΣ²Vᵀ (right singular vectors = eigenvectors of AᵀA)
- AAᵀ = UΣ²Uᵀ (left singular vectors = eigenvectors of AAᵀ)
- Center data: X̃ = X - mean(X)
- Compute covariance: C = (X̃ᵀX̃)/(n-1)
- Eigendecomposition: C = VΛVᵀ
- Project: Z = X̃V
- Variance explained: λᵢ/Σλⱼ
Python:
from sklearn.decomposition import PCA
pca = PCA(n_components=k)
Z = pca.fit_transform(X)
variance_ratio = pca.explained_variance_ratio_- Power Rule: d/dx(xⁿ) = nxⁿ⁻¹
- Exponential: d/dx(eˣ) = eˣ
- Logarithm: d/dx(ln x) = 1/x
- Product Rule: (fg)' = f'g + fg'
- Chain Rule: d/dx f(g(x)) = f'(g(x))·g'(x)
- d/dx(sin x) = cos x
- d/dx(cos x) = -sin x
- d/dx(1/(1+e⁻ˣ)) = e⁻ˣ/(1+e⁻ˣ)² = σ(x)(1-σ(x))
- d/dx(tanh x) = 1 - tanh²(x)
- ∂f/∂x: derivative w.r.t. x (hold other variables constant)
- Gradient: ∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]
- Points in direction of steepest ascent
- Perpendicular to level curves
If z = f(y) and y = g(x):
- dz/dx = (∂f/∂y)(dy/dx)
For multiple variables:
- ∂z/∂x = Σⱼ (∂z/∂yⱼ)(∂yⱼ/∂x)
Backpropagation is just chain rule!
- ∇ₓ(aᵀx) = a
- ∇ₓ(xᵀAx) = (A + Aᵀ)x
- ∇ₓ(||Ax - b||²) = 2Aᵀ(Ax - b)
- First-order (necessary): ∇f(x*) = 0
- Second-order (sufficient): ∇²f(x*) ≻ 0 (Hessian positive definite)
H = [∂²f/∂xᵢ∂xⱼ]
- Positive definite → local minimum
- Negative definite → local maximum
- Indefinite → saddle point
- Sample Space (Ω): All possible outcomes
- Event (A): Subset of sample space
- Probability: P(A) ∈ [0,1], P(Ω) = 1
- Addition Rule: P(A∪B) = P(A) + P(B) - P(A∩B)
- Multiplication Rule: P(A∩B) = P(A)P(B|A)
- Independence: P(A∩B) = P(A)P(B)
- Conditional: P(A|B) = P(A∩B)/P(B)
- Bayes Theorem: P(A|B) = P(B|A)P(A)/P(B)
- P(A|B): posterior
- P(B|A): likelihood
- P(A): prior
- P(B): evidence/normalization
Python:
def bayes(prior, likelihood, evidence):
return (likelihood * prior) / evidence- Expectation: E[X] = Σ x·P(X=x) (discrete) or ∫ x·f(x)dx (continuous)
- Variance: Var[X] = E[(X-μ)²] = E[X²] - (E[X])²
- Standard Deviation: σ = √Var[X]
- Covariance: Cov[X,Y] = E[(X-μₓ)(Y-μᵧ)]
- Correlation: ρ = Cov[X,Y]/(σₓσᵧ) ∈ [-1,1]
Bernoulli (single trial):
- P(X=1) = p, P(X=0) = 1-p
- E[X] = p, Var[X] = p(1-p)
Binomial (n trials):
- P(X=k) = C(n,k)pᵏ(1-p)ⁿ⁻ᵏ
- E[X] = np, Var[X] = np(1-p)
Poisson (count of rare events):
- P(X=k) = (λᵏe⁻λ)/k!
- E[X] = Var[X] = λ
Normal/Gaussian:
- f(x) = (1/√(2πσ²))exp(-(x-μ)²/(2σ²))
- E[X] = μ, Var[X] = σ²
- 68% within 1σ, 95% within 2σ, 99.7% within 3σ
Exponential (waiting time):
- f(x) = λe⁻λˣ
- E[X] = 1/λ, Var[X] = 1/λ²
Python:
from scipy import stats
# Binomial
stats.binom.pmf(k, n, p) # P(X=k)
stats.binom.cdf(k, n, p) # P(X≤k)
# Normal
stats.norm.pdf(x, mu, sigma) # Probability density
stats.norm.cdf(x, mu, sigma) # Cumulative probability
# Poisson
stats.poisson.pmf(k, lambda) # P(X=k)Goal: Find θ that maximizes P(data|θ)
Steps:
- Write likelihood: L(θ) = P(x₁,x₂,...,xₙ|θ)
- Take log: ℓ(θ) = log L(θ) = Σᵢ log P(xᵢ|θ)
- Differentiate: ∂ℓ/∂θ = 0
- Solve for θ
Examples:
- Normal: μ̂ = (1/n)Σxᵢ, σ̂² = (1/n)Σ(xᵢ-μ̂)²
- Bernoulli: p̂ = k/n
- Poisson: λ̂ = (1/n)Σxᵢ
Z-score: z = (x - μ)/σ t-statistic: t = (x̄ - μ)/(s/√n) Chi-square: χ² = Σ(Observed - Expected)²/Expected p-value: P(observing data at least as extreme | H₀ true)
Update Rule: θₜ₊₁ = θₜ - α∇L(θₜ)
- α: learning rate
- ∇L: gradient of loss
Variants:
- Batch GD: Use all data
- Stochastic GD: Use one sample
- Mini-batch GD: Use small batch
vₜ₊₁ = βvₜ - α∇L(θₜ) θₜ₊₁ = θₜ + vₜ₊₁
- β ∈ [0,1): momentum coefficient (typically 0.9)
mₜ = β₁mₜ₋₁ + (1-β₁)∇L(θₜ) # First moment (mean)
vₜ = β₂vₜ₋₁ + (1-β₂)(∇L(θₜ))² # Second moment (variance)
m̂ₜ = mₜ/(1-β₁ᵗ) # Bias correction
v̂ₜ = vₜ/(1-β₂ᵗ)
θₜ₊₁ = θₜ - α·m̂ₜ/(√v̂ₜ + ε)
- β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸ (typical)
- Constant: α(t) = α₀
- Step Decay: α(t) = α₀·γᵏ (every k epochs)
- Exponential: α(t) = α₀e⁻ᵏᵗ
- 1/t: α(t) = α₀/(1+kt)
- Cosine Annealing: α(t) = α_min + 0.5(α_max-α_min)(1+cos(πt/T))
Function f is convex if:
- f(λx + (1-λ)y) ≤ λf(x) + (1-λ)f(y) for all λ∈[0,1]
- Equivalently: Hessian ∇²f ⪰ 0
Properties:
- Local minimum = global minimum
- Gradient descent converges to global minimum
Mean Squared Error (MSE):
- L = (1/n)Σ(yᵢ - ŷᵢ)²
- Gradient: ∂L/∂ŷᵢ = -2(yᵢ - ŷᵢ)/n
Mean Absolute Error (MAE):
- L = (1/n)Σ|yᵢ - ŷᵢ|
- More robust to outliers
Huber Loss (smooth MAE):
- L = {½(y-ŷ)² if |y-ŷ| ≤ δ {δ|y-ŷ|-½δ² otherwise
Binary Cross-Entropy:
- L = -(1/n)Σ[yᵢlog(ŷᵢ) + (1-yᵢ)log(1-ŷᵢ)]
- Gradient: ∂L/∂ŷᵢ = -(yᵢ/ŷᵢ - (1-yᵢ)/(1-ŷᵢ))/n
Categorical Cross-Entropy (multi-class):
- L = -(1/n)Σᵢ Σⱼ yᵢⱼlog(ŷᵢⱼ)
- yᵢⱼ: one-hot encoded labels
Hinge Loss (SVM):
- L = (1/n)Σ max(0, 1 - yᵢŷᵢ)
- yᵢ ∈ {-1, +1}
Shannon Entropy: H(X) = -Σᵢ p(xᵢ)log₂(p(xᵢ))
- Measures uncertainty/information content
- Units: bits (log₂) or nats (ln)
- Maximum for uniform distribution: log₂(n)
H(p,q) = -Σᵢ p(xᵢ)log(q(xᵢ))
- Cost of encoding p using code for q
- Used as loss function in classification
D_KL(p||q) = Σᵢ p(xᵢ)log(p(xᵢ)/q(xᵢ))
- Measures difference between distributions
- D_KL(p||q) ≥ 0, equals 0 iff p = q
- Not symmetric: D_KL(p||q) ≠ D_KL(q||p)
- Relationship: D_KL(p||q) = H(p,q) - H(p)
I(X;Y) = H(X) + H(Y) - H(X,Y)
- Measures dependency between X and Y
- I(X;Y) = 0 ⟺ X and Y independent
- Used in feature selection
Python:
from sklearn.metrics import mutual_info_score
mi = mutual_info_score(X, Y)| Function | Formula | Derivative | Range |
|---|---|---|---|
| Sigmoid | σ(x) = 1/(1+e⁻ˣ) | σ(x)(1-σ(x)) | (0,1) |
| Tanh | tanh(x) = (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ) | 1-tanh²(x) | (-1,1) |
| ReLU | max(0,x) | {0 if x<0, 1 if x≥0} | [0,∞) |
| Leaky ReLU | max(αx,x) | {α if x<0, 1 if x≥0} | (-∞,∞) |
| Softmax | eˣⁱ/Σⱼeˣʲ | σᵢ(1-σᵢ) for i=j | (0,1), Σσᵢ=1 |
When to use:
- Sigmoid: Binary classification (output layer)
- Tanh: Hidden layers (centered at 0)
- ReLU: Hidden layers (most common, fast)
- Softmax: Multi-class classification (output layer)
- Graph: G = (V,E) - vertices and edges
- Degree: Number of edges connected to a node
- Path: Sequence of vertices connected by edges
- Connected: Path exists between all vertex pairs
- Cycle: Path that starts and ends at same vertex
Adjacency Matrix: A[i,j] = 1 if edge (i,j) exists Adjacency List: dict of neighbors for each node
Degree Centrality: Number of connections
- C_D(v) = degree(v)/(n-1)
Betweenness Centrality: Fraction of shortest paths through node
- C_B(v) = Σₛ≠ᵥ≠ₜ σₛₜ(v)/σₛₜ
Eigenvector Centrality: Importance of neighbors
- Av = λv where A is adjacency matrix
PageRank: Iterative importance
- PR(v) = (1-d)/n + d·Σᵤ PR(u)/outdegree(u)
Message Passing: h_v^(l+1) = σ(W^(l)·AGG({h_u^(l) : u∈N(v)}))
- Aggregate information from neighbors
- Transform with learned weights
- Apply non-linearity
Layer i: zᵢ = Wᵢaᵢ₋₁ + bᵢ
aᵢ = σ(zᵢ)
Output layer: δᴸ = ∂L/∂zᴸ
Hidden layer: δˡ = (Wˡ⁺¹)ᵀδˡ⁺¹ ⊙ σ'(zˡ)
Gradients: ∂L/∂Wˡ = δˡ(aˡ⁻¹)ᵀ
∂L/∂bˡ = δˡ
(⊙ = element-wise multiplication)
- Zero: Bad (symmetry problem)
- Random: Normal(0, σ²)
- Xavier: σ = √(2/(n_in + n_out))
- He: σ = √(2/n_in) - for ReLU
L1: L_total = L_data + λ·Σ|wᵢ| L2: L_total = L_data + λ·Σwᵢ² Dropout: Randomly set activations to 0 (p probability) Batch Norm: Normalize layer inputs
❌ Vanishing Gradients: Gradients → 0 in deep networks ✓ Fix: ReLU, Batch Norm, Residual connections
❌ Exploding Gradients: Gradients → ∞ ✓ Fix: Gradient clipping, lower learning rate
❌ Local Minima: Stuck in suboptimal solution ✓ Fix: Momentum, Adam, multiple random restarts
❌ Learning Rate too high: Divergence ✓ Fix: Reduce α, use learning rate schedule
❌ Learning Rate too low: Slow convergence ✓ Fix: Increase α, use adaptive methods (Adam)
❌ Unscaled Features: Different ranges ✓ Fix: StandardScaler, MinMaxScaler
❌ Imbalanced Classes: Model ignores minority ✓ Fix: Resampling, class weights, SMOTE
❌ Overfitting: High train acc, low test acc ✓ Fix: Regularization, more data, dropout, early stopping
❌ Underfitting: Low train and test acc ✓ Fix: More complex model, more features, less regularization
- Always check shapes: print(X.shape)
- Visualize loss curve
- Start with small dataset
- Test gradient computation (finite differences)
- Monitor gradient norms
- Check for NaN/Inf values
import numpy as np
from scipy import stats, linalg
# Linear Algebra
np.dot(A, B) # Matrix multiplication
np.linalg.inv(A) # Inverse
np.linalg.eig(A) # Eigenvalues/vectors
np.linalg.svd(A) # SVD
np.linalg.norm(v) # L2 norm
np.linalg.norm(v, ord=1) # L1 norm
# Statistics
np.mean(X, axis=0) # Mean along axis
np.std(X, axis=0) # Standard deviation
np.cov(X.T) # Covariance matrix
np.corrcoef(X.T) # Correlation matrix
# Probability
stats.norm.pdf(x, mu, sigma) # Normal PDF
stats.norm.cdf(x, mu, sigma) # Normal CDF
stats.binom.pmf(k, n, p) # Binomial PMF
stats.poisson.pmf(k, lam) # Poisson PMF
# Optimization
from scipy.optimize import minimize
result = minimize(fun, x0, method='BFGS', jac=gradient)Cauchy-Schwarz: |⟨x,y⟩| ≤ ||x|| ||y|| Triangle: ||x + y|| ≤ ||x|| + ||y|| Jensen: For convex f: f(E[X]) ≤ E[f(X)] Markov: P(X ≥ a) ≤ E[X]/a Chebyshev: P(|X-μ| ≥ kσ) ≤ 1/k² Hoeffding: Concentration inequality for bounded variables
| Problem | Math Concepts | Common Methods |
|---|---|---|
| Gene Expression Analysis | PCA, SVD, Clustering | Dimensionality reduction, Differential expression |
| Sequence Alignment | Dynamic programming, Scoring matrices | Smith-Waterman, BLAST |
| Protein Structure | Optimization, Energy minimization | Molecular dynamics, Homology modeling |
| Population Genetics | Probability, Hardy-Weinberg | Coalescent theory, Selection models |
| Phylogenetics | Graph theory, Maximum likelihood | Neighbor-joining, Maximum parsimony |
| Drug Discovery | Regression, Classification, GNNs | QSAR, Virtual screening |
| Single-cell RNA-seq | Manifold learning, Clustering | t-SNE, UMAP, Louvain clustering |
| Protein-Protein Interactions | Graph theory, Centrality | Network analysis, Community detection |
| Variant Calling | Bayesian inference, Hypothesis testing | GATK, FreeBayes |
| Image Analysis | CNNs, Computer vision | Cell segmentation, Tissue classification |
Keep this reference handy! Update with your own notes and insights as you learn.