Skip to content

Commit f61ae02

Browse files
committed
done probability model
1 parent 995ae78 commit f61ae02

File tree

10 files changed

+842
-0
lines changed

10 files changed

+842
-0
lines changed
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
---
2+
title: "Basics of Probability"
3+
sidebar_label: Probability Basics
4+
description: "An intuitive introduction to probability theory, sample spaces, events, and the fundamental axioms that govern uncertainty in Machine Learning."
5+
tags: [probability, mathematics-for-ml, sample-space, axioms, statistics]
6+
---
7+
8+
In Machine Learning, we never have perfect information. Data is noisy, sensors are imperfect, and the future is uncertain. **Probability** is the mathematical framework we use to quantify this uncertainty.
9+
10+
## 1. Key Terminology
11+
12+
Before we calculate anything, we must define the "world" we are looking at.
13+
14+
```mermaid
15+
mindmap
16+
root((Probability Experiment))
17+
Sample Space
18+
All possible outcomes
19+
Denoted by S or Omega
20+
Event
21+
A subset of the Sample Space
22+
The outcome we care about
23+
Random Variable
24+
Mapping outcomes to numbers
25+
26+
```
27+
28+
* **Experiment:** An action with an uncertain outcome (e.g., classifying an image).
29+
* **Sample Space ($S$):** The set of all possible outcomes. For a coin flip, $S = \{Heads, Tails\}$.
30+
* **Event (A):** A specific outcome or set of outcomes. For a die roll, an event could be "rolling an even number" ($A = \{2, 4, 6\}$).
31+
32+
## 2. The Three Axioms of Probability
33+
34+
To ensure our probability system is consistent, it must follow these three rules defined by Kolmogorov:
35+
36+
1. **Non-negativity:** The probability of any event A is at least 0.
37+
$P(A) \ge 0$
38+
2. **Certainty:** The probability of the entire sample space S is exactly 1.
39+
P(S) = 1
40+
3. **Additivity:** For mutually exclusive events (events that cannot happen at the same time), the probability of their union is the sum of their probabilities.
41+
$P(A \cup B) = P(A) + P(B)$
42+
43+
## 3. Calculating Probability
44+
45+
In the simplest case (where every outcome is equally likely), probability is a ratio of counting:
46+
47+
$$
48+
P(A) = \frac{\text{Number of favorable outcomes}}{\text{Total number of outcomes in } S}
49+
$$
50+
51+
### Complement Rule
52+
53+
The probability that an event **does not** occur is 1 minus the probability that it does.
54+
55+
$$
56+
P(A^c) = 1 - P(A)
57+
$$
58+
59+
## 4. Types of Probability
60+
61+
<br />
62+
63+
```mermaid
64+
sankey-beta
65+
%% source,target,value
66+
Probability,Joint Probability,20
67+
Probability,Marginal Probability,20
68+
Probability,Conditional Probability,40
69+
Joint Probability,P(A and B),20
70+
Marginal Probability,P(A),20
71+
Conditional Probability,P(A | B),40
72+
73+
```
74+
75+
<br />
76+
77+
* **Marginal Probability:** The probability of an event occurring ($P(A)$), regardless of other variables.
78+
* **Joint Probability:** The probability of two events occurring at the same time ($P(A \cap B)$).
79+
* **Conditional Probability:** The probability of event A occurring **given** that B has already occurred ($P(A|B)$).
80+
81+
## 5. Why Probability is the "Heart" of ML
82+
83+
Machine Learning models are essentially **probabilistic estimators**.
84+
85+
* **Classification:** When a model says an image is a "cat," it is actually saying: $P(\text{Class} = \text{Cat} \mid \text{Pixels}) = 0.94$.
86+
* **Generative AI:** Large Language Models (LLMs) like GPT predict the "next token" by calculating the probability distribution of all possible words.
87+
* **Anomaly Detection:** We flag data points that have a very low probability of occurring based on the training distribution.
88+
89+
---
90+
91+
Knowing the basics is just the start. In ML, we often need to update our beliefs as new data comes in. This brings us to one of the most famous formulas in all of mathematics.
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
---
2+
title: "Bayes' Theorem"
3+
sidebar_label: "Bayes' Theorem"
4+
description: "A deep dive into Bayes' Theorem: the formula for updating probabilities based on new evidence, and its massive impact on Machine Learning."
5+
tags: [probability, bayes-theorem, inference, mathematics-for-ml, naive-bayes]
6+
---
7+
8+
**Bayes' Theorem** is more than just a formula; it is a philosophy of how to learn. It describes the probability of an event based on prior knowledge of conditions that might be related to the event. In Machine Learning, it is the engine behind **Bayesian Inference** and the **Naive Bayes** classifier.
9+
10+
## 1. The Formula
11+
12+
Bayes' Theorem allows us to find $P(A|B)$ if we already know $P(B|A)$.
13+
14+
$$
15+
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
16+
$$
17+
18+
### Breaking Down the Terms
19+
20+
* **$P(A|B)$ (Posterior):** The probability of our hypothesis $A$ *after* seeing the evidence $B$.
21+
* **$P(B|A)$ (Likelihood):** The probability of the evidence $B$ appearing *given* that hypothesis $A$ is true.
22+
* **$P(A)$ (Prior):** Our initial belief about hypothesis $A$ *before* seeing any evidence.
23+
* **$P(B)$ (Evidence/Marginal Likelihood):** The total probability of seeing evidence $B$ under all possible hypotheses.
24+
25+
## 2. The Logic of Bayesian Updating
26+
27+
Bayesian logic is iterative. Today's **Posterior** becomes tomorrow's **Prior**.
28+
29+
<br />
30+
31+
```mermaid
32+
flowchart LR
33+
A[Initial Belief: Prior] --> B[New Evidence: Likelihood]
34+
B --> C[Updated Belief: Posterior]
35+
C -->|New Data Arrives| A
36+
37+
```
38+
39+
<br />
40+
41+
```mermaid
42+
sankey-beta
43+
%% source,target,value
44+
Prior_Knowledge,Posterior_Probability,50
45+
New_Evidence,Posterior_Probability,50
46+
Posterior_Probability,Final_Prediction,100
47+
48+
```
49+
50+
<br />
51+
52+
## 3. A Practical Example: Medical Testing
53+
54+
Suppose a disease affects **1%** of the population (Prior). A test for this disease is **99%** accurate (Likelihood). If a patient tests positive, what is the probability they actually have the disease?
55+
56+
1. $P(\text{Disease}) = 0.01$
57+
2. $P(\text{Pos} | \text{Disease}) = 0.99$
58+
3. $P(\text{Pos} | \text{No Disease}) = 0.01 (False Positive rate)$
59+
60+
### Using Bayes' Theorem:
61+
62+
Even with a 99% accurate test, the probability of having the disease given a positive result is only **50%**. This is because the disease is so rare (low Prior) that the number of false positives equals the number of true positives.
63+
64+
## 4. Bayes' Theorem in Machine Learning
65+
66+
### A. Naive Bayes Classifier
67+
68+
Naive Bayes is a popular algorithm for text classification (like spam detection). It assumes that every feature (word) is independent of every other feature (the "Naive" part) and uses Bayes' Theorem to calculate the probability of a category:
69+
70+
$$
71+
P(\text{Spam} | \text{Words}) \propto P(\text{Words} | \text{Spam}) P(\text{Spam})
72+
$$
73+
74+
### B. Bayesian Neural Networks
75+
76+
Unlike standard neural networks that have fixed weights, Bayesian Neural Networks represent weights as **probability distributions**. This allows the model to express **uncertainty**, it can say "I think this is a cat, but I'm only 60% sure."
77+
78+
### C. Hyperparameter Optimization
79+
80+
**Bayesian Optimization** is a strategy used to find the best hyperparameters for a model. It builds a probability model of the objective function and uses it to select the most promising hyperparameters to evaluate next.
81+
82+
## 5. Summary Table
83+
84+
| Concept | Traditional (Frequentist) | Bayesian |
85+
| --- | --- | --- |
86+
| **View of Probability** | Long-run frequency of events. | Measure of "degree of belief." |
87+
| **Parameters** | Fixed, unknown constants. | Random variables with distributions. |
88+
| **New Data** | Used to refine the estimate. | Used to update the entire belief (Prior \to Posterior). |
89+
90+
91+
---
92+
93+
94+
Now that we can update our beliefs using Bayes' Theorem, we need to understand how these probabilities are spread across different outcomes. This brings us to Random Variables and Probability Distributions.
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
---
2+
title: "Conditional Probability"
3+
sidebar_label: Conditional Probability
4+
description: "Understanding how the probability of an event changes given the occurrence of another event, and its role in predictive modeling."
5+
tags: [probability, conditional-probability, dependency, mathematics-for-ml, bayes-rule]
6+
---
7+
8+
In the real world, events are rarely isolated. The probability of it raining is higher **given** that it is cloudy. The probability of a user clicking an ad is higher **given** their past search history. This "given" is the essence of **Conditional Probability**.
9+
10+
## 1. The Definition
11+
12+
Conditional probability is the probability of an event $A$ occurring, given that another event $B$ has already occurred. It is denoted as $P(A|B)$.
13+
14+
The formula is:
15+
16+
$$
17+
P(A|B) = \frac{P(A \cap B)}{P(B)}
18+
$$
19+
20+
Where:
21+
* $P(A \cap B)$ is the **Joint Probability** (both $A$ and $B$ happen).
22+
* $P(B)$ is the probability of the condition (the "new universe").
23+
24+
## 2. Intuition: Shrinking the Universe
25+
26+
Think of probability as a "Universe" of possibilities. When we say "given $B$," we are throwing away every part of the universe where $B$ did not happen. Our new total area is just $B$.
27+
28+
<br />
29+
30+
```mermaid
31+
sankey-beta
32+
%% source,target,value
33+
OriginalUniverse,EventB_Happens,60
34+
OriginalUniverse,EventB_DoesNotHappen,40
35+
EventB_Happens,EventA_Happens_GivenB,20
36+
EventB_Happens,EventA_DoesNotHappen_GivenB,40
37+
38+
```
39+
40+
<br />
41+
42+
## 3. Independent vs. Dependent Events
43+
44+
How do we know if one event affects another? We look at their conditional probabilities.
45+
46+
### A. Independent Events
47+
48+
Event A and B are independent if the occurrence of B provides **zero** new information about $A$.
49+
50+
* **Mathematical Check:** $P(A|B) = P(A)$
51+
* **Example:** Rolling a 6 on a die given that you ate an apple for breakfast.
52+
53+
### B. Dependent Events
54+
55+
Event A and B are dependent if knowing B happened changes the likelihood of $A$.
56+
57+
* **Mathematical Check:** $P(A|B) \neq P(A)$
58+
* **Example:** Having a cough $(A)$ given that you have a cold $(B)$.
59+
60+
## 4. The Multiplication Rule
61+
62+
We can rearrange the conditional probability formula to find the probability of both events happening:
63+
64+
This is the foundation for the **Chain Rule of Probability**, which allows ML models to calculate the probability of a long sequence of events (like a sentence in an LLM).
65+
66+
## 5. Application: Predictive Modeling
67+
68+
In Machine Learning, almost every prediction is a conditional probability.
69+
70+
```mermaid
71+
flowchart LR
72+
Input[Data Features X] --> Model[ML Model]
73+
Model --> Output["P(Y | X)"]
74+
style Output fill:#f9f,stroke:#333,color:#333,stroke-width:2px
75+
76+
```
77+
78+
* **Medical Diagnosis:** $P(\text{Disease} \mid \text{Symptoms})$
79+
* **Spam Filter:** $P(\text{Spam} \mid \text{Words in Email})$
80+
* **Self-Driving Cars:** $P(\text{Pedestrian crosses} \mid \text{Camera Image})$
81+
82+
83+
---
84+
85+
86+
If we flip the question—if we know $P(A|B)$ but we want to find $P(B|A)$ we use the most powerful tool in probability theory.
Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
---
2+
title: "PMF vs. PDF"
3+
sidebar_label: PMF & PDF
4+
description: "A deep dive into Probability Mass Functions (PMF) for discrete data and Probability Density Functions (PDF) for continuous data."
5+
tags: [probability, pmf, pdf, statistics, mathematics-for-ml, distributions]
6+
---
7+
8+
To work with data in Machine Learning, we need a mathematical way to describe how likely different values are to occur. Depending on whether our data is **Discrete** (countable) or **Continuous** (measurable), we use either a **PMF** or a **PDF**.
9+
10+
## 1. Probability Mass Function (PMF)
11+
12+
The **PMF** is used for discrete random variables. It gives the probability that a discrete random variable is exactly equal to some value.
13+
14+
### Key Mathematical Properties:
15+
1. **Direct Probability:** $P(X = x) = f(x)$. The "height" of the bar is the actual probability.
16+
2. **Summation:** All individual probabilities must sum to 1.
17+
$$
18+
\sum_{i} P(X = x_i) = 1
19+
$$
20+
3. **Range:** $0 \le P(X = x) \le 1$.
21+
22+
23+
<img className="rounded p-4" src="/tutorial/img/tutorials/ml/probability-mass-function.jpg" alt="Probability Mass Function plot for a Binomial Distribution" />
24+
25+
**Example:** If you roll a fair die, the PMF is $1/6$ for each value $\{1, 2, 3, 4, 5, 6\}$. There is no "1.5" or "2.7"; the probability exists only at specific points.
26+
27+
## 2. Probability Density Function (PDF)
28+
29+
The **PDF** is used for continuous random variables. Unlike the PMF, the "height" of a PDF curve does **not** represent probability; it represents **density**.
30+
31+
### The "Zero Probability" Paradox
32+
In a continuous world (like height or time), the probability of a variable being *exactly* a specific number (e.g., exactly $175.00000...$ cm) is effectively **0**.
33+
34+
Instead, we find the probability over an **interval** by calculating the **area under the curve**.
35+
36+
### Key Mathematical Properties:
37+
1. **Area is Probability:** The probability that $X$ falls between $a$ and $b$ is the integral of the PDF:
38+
$$
39+
P(a \le X \le b) = \int_{a}^{b} f(x) dx
40+
$$
41+
2. **Total Area:** The total area under the entire curve must equal 1.
42+
$$
43+
\int_{-\infty}^{\infty} f(x) dx = 1
44+
$$
45+
3. **Density vs. Probability:** $f(x)$ can be greater than 1, as long as the total area remains 1.
46+
47+
48+
## 3. Comparison at a Glance
49+
50+
```mermaid
51+
graph LR
52+
Data[Data Type] --> Disc[Discrete]
53+
Data --> Cont[Continuous]
54+
55+
Disc --> PMF["PMF: $$P(X=x)$$"]
56+
Cont --> PDF["PDF: $$f(x)$$"]
57+
58+
PMF --> P_Sum["$$\sum P(x) = 1$$"]
59+
PDF --> P_Int["$$\int f(x)dx = 1$$"]
60+
61+
PMF --> P_Val["Height = Probability"]
62+
PDF --> P_Area["Area = Probability"]
63+
```
64+
65+
| Feature | PMF (Discrete) | PDF (Continuous) |
66+
| --- | --- | --- |
67+
| **Variable Type** | Countable (Integers) | Measurable (Real Numbers) |
68+
| **Probability at a point** | $P(X=x) = \text{Height}$ | $P(X=x) = 0$ |
69+
| **Probability over range** | Sum of heights | Area under the curve (Integral) |
70+
| **Visualization** | Bar chart / Stem plot | Smooth curve |
71+
72+
---
73+
74+
## 4. The Bridge: Cumulative Distribution Function (CDF)
75+
76+
The **CDF** is the "running total" of probability. It tells you the probability that a variable is **less than or equal to** $x$.
77+
78+
* **For PMF:** It is a step function (it jumps at every discrete value).
79+
* **For PDF:** It is a smooth S-shaped curve.
80+
81+
$$
82+
F(x) = P(X \le x)
83+
$$
84+
85+
```mermaid
86+
graph LR
87+
PDF["PDF (Density) <br/> $$f(x)$$"] -- " Integrate: <br/> $$\int_{-\infty}^{x} f(t) dt$$ " --> CDF["CDF (Cumulative) <br/> $$F(x)$$"]
88+
CDF -- " Differentiate: <br/> $$\frac{d}{dx} F(x)$$ " --> PDF
89+
90+
style PDF fill:#fdf,stroke:#333,color:#333
91+
style CDF fill:#def,stroke:#333,color:#333
92+
```
93+
94+
## 5. Why this matters in Machine Learning
95+
96+
1. **Likelihood Functions:** When training models (like Logistic Regression), we maximize the **Likelihood**. For discrete labels, this uses the PMF; for continuous targets, it uses the PDF.
97+
2. **Anomaly Detection:** We often flag a data point as an outlier if its PDF value (density) is below a certain threshold.
98+
3. **Generative Models:** VAEs and GANs attempt to learn the underlying **PDF** of a dataset so they can sample new points from high-density regions (creating realistic images or text).
99+
100+
---
101+
102+
Now that you understand how we describe probability at a point or over an area, it's time to meet the most important distribution in all of data science.

0 commit comments

Comments
 (0)