Skip to content

Commit e81cc96

Browse files
authored
Merge pull request #141 from codeharborhub/dev-1
done statiustics model
2 parents 90c7bd9 + 995ae78 commit e81cc96

File tree

7 files changed

+317
-1
lines changed

7 files changed

+317
-1
lines changed

docs/index.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ Select a technology below to dive into our structured tutorials. Each path is de
102102
<p>Learn NoSQL database concepts with MongoDB. Store, query, and manage data efficiently for modern applications.</p>
103103
</DocsCard>
104104

105-
<DocsCard header="AI & Machine Learning" href="#" icon="/icons/ai-chat.svg">
105+
<DocsCard header="AI & Machine Learning" href="/tutorial/machine-learning" icon="/icons/ai-chat.svg">
106106
<p>Explore artificial intelligence, machine learning, and neural networks with beginner-friendly examples.</p>
107107
</DocsCard>
108108

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
---
2+
title: "Basic Statistical Concepts"
3+
sidebar_label: Basic Concepts
4+
description: "Introduction to the fundamental pillars of statistics in ML: Populations vs. Samples, Descriptive vs. Inferential statistics, and Data Types."
5+
tags: [statistics, mathematics-for-ml, data-types, population, sample, descriptive-statistics]
6+
---
7+
8+
Statistics is the science of collecting, analyzing, and interpreting data. In Machine Learning, statistics provides the tools to handle uncertainty, validate models, and understand whether the patterns we find are "real" or just random noise.
9+
10+
## 1. Population vs. Sample
11+
12+
The most fundamental distinction in statistics is between the group we want to know about and the group we actually observe.
13+
14+
* **Population:** The entire group of individuals or instances about whom we want to draw conclusions.
15+
* *Example:* All people who use a specific social media app.
16+
* **Sample:** A subset of the population that we actually collect data from.
17+
* *Example:* 1,000 users who responded to a survey.
18+
19+
:::important The Goal of ML
20+
In Machine Learning, our training data is a **sample**. Our goal is to build a model that generalizes well to the entire **population** (unseen data).
21+
:::
22+
23+
## 2. Descriptive vs. Inferential Statistics
24+
25+
Statistics is generally divided into two main branches:
26+
27+
### A. Descriptive Statistics
28+
This branch focuses on summarizing and describing the characteristics of a dataset. We use numbers and graphs to tell the story of the data we have in hand.
29+
* **Tools:** Mean, Median, Mode, Standard Deviation, Histograms.
30+
31+
### B. Inferential Statistics
32+
This branch focuses on making predictions or generalizations about a population based on a sample.
33+
* **Tools:** Hypothesis testing, P-values, Confidence Intervals, Regression.
34+
35+
## 3. Types of Data
36+
37+
Not all data is created equal. The way we process features in ML depends entirely on their statistical type.
38+
39+
| Data Type | Sub-type | Description | Example |
40+
| :--- | :--- | :--- | :--- |
41+
| **Qualitative** (Categorical) | **Nominal** | Categories with no inherent order. | Eye color, Gender, Zip Code. |
42+
| | **Ordinal** | Categories with a meaningful order. | Education level (Bachelors, Masters, PhD). |
43+
| **Quantitative** (Numerical) | **Discrete** | Values that can be counted (integers). | Number of rooms in a house, number of clicks. |
44+
| | **Continuous** | Values that can be measured (real numbers). | Temperature, Weight, Stock price. |
45+
46+
## 4. Parameters vs. Statistics
47+
48+
* **Parameter:** A numerical value that describes a characteristic of the entire **population**. (Usually denoted by Greek letters like $\mu$ for mean).
49+
* **Statistic:** A numerical value that describes a characteristic of a **sample**. (Usually denoted by Roman letters like $\bar{x}$ for mean).
50+
51+
In ML, we use **Sample Statistics** (like the error on our training set) to estimate the true **Population Parameters** (the true error the model would make on all possible data).
52+
53+
## 5. Why Statistics Matters in the ML Pipeline
54+
55+
1. **Exploratory Data Analysis (EDA):** Before building a model, we use descriptive statistics to find outliers, understand distributions, and identify correlations.
56+
2. **Feature Engineering:** Understanding data types helps us decide how to encode variables (e.g., One-Hot Encoding for Nominal data).
57+
3. **Model Validation:** We use inferential statistics to determine if a model's performance improvement is statistically significant or just due to a lucky split of the data.
58+
59+
---
60+
61+
## References for More Details
62+
63+
* **StatQuest with Josh Starmer - Statistics Fundamentals:**
64+
* [YouTube Link](https://www.youtube.com/playlist?list=PLblh5JKOoLUK0FLuzwntyYI10UQFUhsY9)
65+
* *Best for:* Highly visual and intuitive explanations of population vs. sample and other core concepts.
66+
* **Khan Academy - Summarizing Quantitative Data:**
67+
* [Website Link](https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data)
68+
* *Best for:* Interactive practice with mean, median, and variance.
69+
70+
---
71+
72+
Now that we have the vocabulary, let's look at the specific numerical tools we use to describe the center and spread of our data.
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
---
2+
title: "Data Visualization in Statistics"
3+
sidebar_label: Data Visualization
4+
description: "Exploring the essential plots and charts used in statistical analysis to identify patterns, distributions, and outliers in Machine Learning datasets."
5+
tags: [statistics, data-visualization, eda, histograms, box-plots, scatter-plots, mathematics-for-ml]
6+
---
7+
8+
Numerical summaries like the Mean or Standard Deviation only tell half the story. **Data Visualization** allows us to see the shape, spread, and anomalies in our data that numbers might hide. In Machine Learning, visualization is the primary tool used during **Exploratory Data Analysis (EDA)**.
9+
10+
## 1. Visualizing Distributions (Univariate Analysis)
11+
12+
To understand a single feature, we look at its distribution.
13+
14+
### A. Histograms
15+
A histogram groups continuous data into "bins" and shows the frequency of data points in each bin. It is the best tool for identifying the **shape** of the data (Normal, Skewed, Bimodal).
16+
17+
### B. Box Plots (Whisker Plots)
18+
Box plots are incredible for identifying **outliers** and understanding the quartiles of your data.
19+
* **The Box:** Represents the Interquartile Range (IQR), containing the middle 50% of the data.
20+
* **The Line:** The Median.
21+
* **The Whiskers:** Usually extend to $1.5 \times \text{IQR}$.
22+
* **Dots:** Data points outside the whiskers are considered outliers.
23+
24+
## 2. Visualizing Relationships (Bivariate Analysis)
25+
26+
To understand how two features interact, we use relational plots.
27+
28+
### A. Scatter Plots
29+
30+
<img className="rounded" src="/tutorial/img/tutorials/ml/scatter-plots.jpg" alt="Scatter plots showing positive correlation, negative correlation, and no correlation" />
31+
32+
Scatter plots display individual data points on an XY plane. They are the first step in identifying **Correlation**.
33+
* **Linear Relationship:** Points form a straight line.
34+
* **Non-linear Relationship:** Points form a curve.
35+
* **No Relationship:** Points look like a random cloud.
36+
37+
### B. Bar Charts vs. Pie Charts
38+
* **Bar Charts:** Best for comparing a numerical value across different categories.
39+
* **Pie Charts:** Best for showing parts of a whole (though bar charts are often preferred for readability).
40+
41+
---
42+
43+
## 3. Visualizing Multiple Variables (Multivariate)
44+
45+
### A. Heatmaps (Correlation Matrices)
46+
In ML, we often have dozens of features. A heatmap uses color to represent the correlation coefficient between every pair of features. This helps in **Feature Selection** by identifying redundant variables.
47+
48+
49+
50+
### B. Pair Plots
51+
A grid of scatter plots for every pair of features in a dataset. It allows you to see relationships across the entire dataset at once.
52+
53+
---
54+
55+
## 4. Anscombe's Quartet: Why Visualization Matters
56+
The most famous example of why we visualize is **Anscombe's Quartet**. It consists of four datasets that have nearly identical descriptive statistics (mean, variance, correlation), yet look completely different when graphed.
57+
58+
59+
60+
:::tip ML Best Practice
61+
Never start training a model before visualizing your data. Plots often reveal data quality issues (like sensors being stuck at a maximum value) that summary statistics would miss.
62+
:::
63+
64+
---
65+
66+
Visualizing our data often reveals a specific "bell-shaped" curve that appears everywhere in nature and math. Understanding this curve is our next major step.
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
---
2+
title: "Descriptive Statistics"
3+
sidebar_label: Descriptive Statistics
4+
description: "Mastering measures of central tendency (mean, median, mode) and dispersion (variance, standard deviation, range) to summarize and understand data distributions."
5+
tags: [statistics, mean, median, variance, standard-deviation, descriptive-statistics, mathematics-for-ml]
6+
---
7+
8+
Descriptive statistics allow us to summarize large volumes of raw data into a few meaningful numbers. In Machine Learning, we use these to understand the "center" and the "spread" of our features, which is essential for data cleaning and feature scaling.
9+
10+
## 1. Measures of Central Tendency
11+
12+
These measures tell us where the "middle" of the data lies.
13+
14+
<img className="rounded" src="/tutorial/img/tutorials/ml/measures-central-tendency.jpg" alt="Mean, Median, and Mode in normal and skewed distributions" />
15+
16+
### A. Mean (Average)
17+
The sum of all values divided by the total number of values. It is highly sensitive to **outliers**.
18+
$$ \mu = \frac{\sum x_i}{N} $$
19+
20+
### B. Median
21+
The middle value when the data is sorted. It is **robust** to outliers, making it better for skewed distributions (like house prices or salaries).
22+
23+
### C. Mode
24+
The value that appears most frequently. Useful for categorical data (e.g., finding the most common car color).
25+
26+
---
27+
28+
## 2. Measures of Dispersion (Spread)
29+
30+
Knowing the center isn't enough; we need to know how "spread out" the data is.
31+
32+
### A. Range
33+
The difference between the maximum and minimum values. Simple, but very sensitive to extreme outliers.
34+
35+
### B. Variance ($\sigma^2$)
36+
The average of the squared differences from the Mean. It measures how far each number in the set is from the mean.
37+
$$ \sigma^2 = \frac{\sum (x_i - \mu)^2}{N} $$
38+
39+
### C. Standard Deviation ($\sigma$)
40+
The square root of the variance. It is the most common measure of spread because it is in the **same units** as the original data.
41+
42+
* **Low $\sigma$:** Data points are close to the mean.
43+
* **High $\sigma$:** Data points are spread out over a wide range.
44+
45+
---
46+
47+
## 3. Measures of Shape
48+
49+
Beyond center and spread, we look at the symmetry and "peakedness" of the data.
50+
51+
### A. Skewness
52+
Measures the asymmetry of the distribution.
53+
* **Positive (Right) Skew:** Long tail on the right side.
54+
* **Negative (Left) Skew:** Long tail on the left side.
55+
56+
### B. Kurtosis
57+
Measures how "fat" or "thin" the tails of the distribution are compared to a normal distribution. High kurtosis indicates the presence of frequent outliers.
58+
59+
---
60+
61+
## 4. Why this matters for ML
62+
63+
1. **Handling Outliers:** If the Mean and Median are far apart, you likely have outliers that could skew your model's training.
64+
2. **Missing Value Imputation:** When filling in missing data, we often choose the **Mean** (for normal data), **Median** (for skewed data), or **Mode** (for categorical data).
65+
3. **Feature Scaling:** Techniques like **Z-Score Normalization** (Standardization) directly use the Mean and Standard Deviation to rescale features:
66+
$$ z = \frac{x - \mu}{\sigma} $$
67+
68+
---
69+
70+
Visualizing these numbers is often more intuitive than reading a table. Next, we’ll explore the most important probability distribution in all of science and ML.
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
---
2+
title: "Inferential Statistics"
3+
sidebar_label: Inferential Statistics
4+
description: "Understanding how to make predictions and inferences about populations using samples, hypothesis testing, and p-values."
5+
tags: [statistics, inference, hypothesis-testing, p-value, confidence-intervals, mathematics-for-ml]
6+
---
7+
8+
In Descriptive Statistics, we describe the data we have. In **Inferential Statistics**, we use that data to make "educated guesses" or predictions about data we *don't* have. This is the foundation of scientific discovery and model validation in Machine Learning.
9+
10+
## 1. The Core Workflow
11+
12+
Inferential statistics allows us to take a small sample and project those findings onto a larger population.
13+
14+
```mermaid
15+
sankey-beta
16+
%% source,target,value
17+
Population,Sample,30
18+
Sample,Analysis,30
19+
Analysis,Point Estimates,15
20+
Analysis,Confidence Intervals,15
21+
Point Estimates,Population Inference,15
22+
Confidence Intervals,Population Inference,15
23+
24+
```
25+
26+
## 2. Point Estimation
27+
28+
A **Point Estimate** is a single value (a statistic) used to estimate a population parameter.
29+
30+
* **Sample Mean ($\bar{x}$)** estimates the **Population Mean ($\mu$)**.
31+
* **Sample Variance ($s^2$)** estimates the **Population Variance ($\sigma^2$)**.
32+
33+
However, because samples are smaller than populations, point estimates are rarely 100% accurate. We use **Confidence Intervals** to express our uncertainty.
34+
35+
## 3. Hypothesis Testing
36+
37+
Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics.
38+
39+
### The Two Hypotheses
40+
41+
1. **Null Hypothesis ($H_0$):** The "status quo." It assumes there is no effect or no difference. (e.g., "This new feature does not improve model accuracy.")
42+
2. **Alternative Hypothesis ($H_a$):** What we want to prove. (e.g., "This new feature improves model accuracy.")
43+
44+
### The Decision Process
45+
46+
We use the **P-value** to decide whether to reject the Null Hypothesis.
47+
48+
```mermaid
49+
flowchart TD
50+
Start["State Hypotheses H0 and Ha"] --> Alpha[Set Significance Level α - usually 0.05]
51+
Alpha --> Test[Perform Statistical Test - t-test, Z-test]
52+
Test --> PVal{Calculate P-value}
53+
PVal -- "P < α" --> Reject[Reject H0: Results are Statistically Significant]
54+
PVal -- "P ≥ α" --> Fail[Fail to Reject H0: No significant effect found]
55+
56+
```
57+
58+
## 4. Confidence Intervals
59+
60+
A **Confidence Interval (CI)** provides a range of values that is likely to contain the population parameter.
61+
62+
$$
63+
\text{CI} = \text{Point Estimate} \pm (\text{Critical Value} \times \text{Standard Error})
64+
$$
65+
66+
:::note Example
67+
We are 95% confident that the true accuracy of our model on all future data is between 88% and 92%.
68+
:::
69+
70+
## 5. Common Statistical Tests in ML
71+
72+
| Test | Use Case | Example in ML |
73+
| --- | --- | --- |
74+
| **Z-Test** | Comparing means with a large sample size (n > 30). | Comparing the average spend of two large user groups. |
75+
| **T-Test** | Comparing means with a small sample size (n < 30). | Comparing performance of two model architectures on a small dataset. |
76+
| **Chi-Square Test** | Testing relationships between categorical variables. | Is the "Click" rate independent of the "Device Type"? |
77+
| **ANOVA** | Comparing means across 3 or more groups. | Does the choice of optimizer (Adam, SGD, RMSprop) significantly change accuracy? |
78+
79+
## 6. Type I and Type II Errors
80+
81+
When making inferences, we can be wrong in two ways:
82+
83+
```mermaid
84+
quadrantChart
85+
title Statistical Decision Matrix
86+
x-axis "Null Hypothesis is True" --> "Null Hypothesis is False"
87+
y-axis "Reject Null" --> "Fail to Reject"
88+
quadrant-1 "Type I Error (False Positive)"
89+
quadrant-2 "Correct Decision (True Positive)"
90+
quadrant-3 "Correct Decision (True Negative)"
91+
quadrant-4 "Type II Error (False Negative)"
92+
93+
```
94+
95+
<br />
96+
97+
1. **Type I Error (\alpha):** You claim there is an effect when there isn't. (False Positive).
98+
2. **Type II Error (\beta):** You fail to detect an effect that actually exists. (False Negative).
99+
100+
## 7. Why this matters for ML Engineers
101+
102+
* **A/B Testing:** Inferential statistics is the engine behind A/B testing new model versions in production.
103+
* **Feature Selection:** We use tests like Chi-Square to see if a feature actually has a relationship with the target variable.
104+
* **Model Comparison:** If Model A has 91% accuracy and Model B has 91.5%, is that difference "real" or just luck? Inferential stats tells you if the improvement is **statistically significant**.
105+
106+
---
107+
108+
Understanding inference allows us to trust our model's results. Now, we dive into the specific probability distributions that model the randomness we see in the real world.
98.6 KB
Loading
103 KB
Loading

0 commit comments

Comments
 (0)