Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions questions/149_adadelta-optimizer/learn.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,18 +18,18 @@ $v_t = \rho v_{t-1} + (1-\rho)g_t^2$ (Exponential moving average of squared grad

The above approximates a window size of $w \approx \dfrac{1}{1-\rho}$

$\Delta\theta_t = -\dfrac{\sqrt{u_{t-1} + \epsilon}}{\sqrt{v_t + \epsilon}} \cdot g_t$ (Parameter update with unit correction)
$\Delta\theta_t = -\dfrac{\sqrt{v_{t-1} + \epsilon}}{\sqrt{u_t + \epsilon}} \cdot g_t$ (Parameter update with unit correction)

$u_t = \rho u_{t-1} + (1-\rho)\Delta\theta_t^2$ (Exponential moving average of squared parameter updates)

Where:
- $v_t$ is the exponential moving average of squared gradients (decay rate ρ)
- $u_t$ is the exponential moving average of squared parameter updates (decay rate ρ)
- $v_t$ is the exponential moving average of squared **parameter updates** (decay rate ρ)
- $u_t$ is the exponential moving average of squared **gradients** (decay rate ρ)
- $\rho$ is the decay rate (typically 0.9) that controls the effective window size w ≈ 1/(1-ρ)
- $\epsilon$ is a small constant for numerical stability
- $g_t$ is the gradient at time step t

The ratio $\dfrac{\sqrt{u_{t-1} + \epsilon}}{\sqrt{v_t + \epsilon}}$ serves as an adaptive learning rate that automatically handles the units of the updates, making the algorithm more robust to different parameter scales. Unlike Adagrad, Adadelta does not require a manually set learning rate, making it especially useful when tuning hyperparameters is difficult. This automatic learning rate adaptation is achieved through the ratio of the root mean squared (RMS) of parameter updates to the RMS of gradients.
The ratio $\dfrac{\sqrt{v_{t-1} + \epsilon}}{\sqrt{u_t + \epsilon}}$ serves as an adaptive learning rate that automatically handles the units of the updates, making the algorithm more robust to different parameter scales. Unlike Adagrad, Adadelta does not require a manually set learning rate, making it especially useful when tuning hyperparameters is difficult. This automatic learning rate adaptation is achieved through the ratio of the root mean squared (RMS) of parameter updates to the RMS of gradients.

Read more at:

Expand All @@ -43,8 +43,8 @@ Implement the Adadelta optimizer update step function. Your function should take
The function should accept:
- parameter: Current parameter value
- grad: Current gradient
- v: Exponentially decaying average of squared gradients
- u: Exponentially decaying average of squared parameter updates
- u: Exponentially decaying average of squared gradients
- v: Exponentially decaying average of squared parameter updates
- rho: Decay rate (default=0.9)
- epsilon: Small constant for numerical stability (default=1e-8)

Expand Down