@@ -319,9 +319,7 @@ network has fewer parameters than the teacher network.
319319
320320 [ @Distill ] proposed KD, which makes the classification result of the
321321student network more closely resembles the ground truth as well as the
322- classification result of the teacher network, that is, Equation
323- [ \[ c2Fcn: distill \] ] ( #c2Fcn:distill ) {reference-type="ref"
324- reference="c2Fcn: distill "}.
322+ classification result of the teacher network, that is, Equation :eqref:` c2Fcn:distill ` .
325323
326324$$ \mathcal{L}_{KD}(\theta_S) = \mathcal{H}(o_S,\mathbf{y}) +\lambda\mathcal{H}(\tau(o_S),\tau(o_T)),
327325$$
330328where $\mathcal{H}(\cdot,\cdot)$ is the cross-entropy function, $o_S$
331329and $o_T$ are outputs of the student network and the teacher network,
332330respectively, and $\mathbf{y}$ is the label. The first item in
333- Equation [ \[ c2Fcn: distill \] ] ( #c2Fcn:distill ) {reference-type="ref"
334- reference="c2Fcn: distill "} makes the classification result of the
331+ Equation :eqref:` c2Fcn:distill ` makes the classification result of the
335332student network resemble the expected ground truth, and the second item
336333aims to extract useful information from the teacher network and transfer
337334the information to the student network, $\lambda$ is a weight parameter
338335used to balance two objective functions, and $\tau(\cdot)$ is a soften
339336function that smooths the network output.
340337
341- Equation [ \[ c2Fcn: distill \] ] ( #c2Fcn:distill ) {reference-type="ref"
342- reference="c2Fcn: distill "} only extracts useful information from the
338+ Equation :eqref:` c2Fcn:distill ` only extracts useful information from the
343339output of the teacher network classifier --- it does not mine
344340information from other intermediate layers of the teacher network.
345341Romero et al. [ @FitNet ] proposed an algorithm for transferring useful
0 commit comments