Skip to content

Commit 0fe9aa3

Browse files
committed
added description
1 parent 3aa0e68 commit 0fe9aa3

File tree

1 file changed

+50
-0
lines changed

1 file changed

+50
-0
lines changed
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Learn: Gini Impurity and Best Split in Decision Trees
2+
3+
## Overview
4+
5+
A core concept in Decision Trees (and by extension, Random Forests) is how the model chooses where to split the data at each node. One popular criterion used for splitting is **Gini Impurity**.
6+
7+
In this task, you will implement:
8+
- Gini impurity computation
9+
- Finding the best feature and threshold to split on based on impurity reduction
10+
11+
This helps build the foundation for how trees grow in a Random Forest.
12+
13+
---
14+
15+
## Gini Impurity
16+
17+
For a set of samples with class labels \( y \), the Gini Impurity is defined as:
18+
19+
$$
20+
G(y) = 1 - \sum_{i=1}^{k} p_i^2
21+
$$
22+
23+
Where \( p_i \) is the proportion of samples belonging to class \( i \).
24+
25+
A pure node (all one class) has \( G = 0 \), and higher values indicate more class diversity.
26+
27+
---
28+
29+
## Gini Gain for a Split
30+
31+
Given a feature and a threshold to split the dataset into left and right subsets:
32+
33+
$$
34+
G_{\text{split}} = \frac{n_{\text{left}}}{n} G(y_{\text{left}}) + \frac{n_{\text{right}}}{n} G(y_{\text{right}})
35+
$$
36+
37+
We choose the split that **minimizes** $( G_{\text{split}} )$.
38+
39+
---
40+
41+
## Problem Statement
42+
43+
You are given a dataset $( X \in \mathbb{R}^{n \times d} )$ and labels $( y \in \{0, 1\}^n $). Implement the following functions:
44+
45+
### Functions to Implement
46+
47+
```python
48+
def find_best_split(X: np.ndarray, y: np.ndarray) -> Tuple[int, float]:
49+
...
50+
```

0 commit comments

Comments
 (0)