@@ -229,43 +229,44 @@ by cycling through it several times, saving the final
229229$ python sim_harmonium.py
230230```
231231
232- which will fit/adapt your harmonium to MNIST. This should produce per-training iteration output, printed to I/O,
233- similar to the following:
232+ which will fit/adapt your harmonium to MNIST. Note that the model exhibit code that you will run uses a special
233+ extension of CD learning known as persistent CD (PCD); our PCD implementation[ ^ 3 ] essentially obtains negative-phase
234+ statistics by maintaining a set of Gibbs sampling chains that are never reset but instead sampled from each time the
235+ model parameters are to updated <b >[ 6] </b > (this extension that improves the quality of the samples produced by the RBM).
236+ This should produce per-training iteration output, printed to I/O, similar to the following:
234237
235238``` console
236239--- Initial RBM Synaptic Stats ---
237240W1: min -0.0494 ; max 0.0445 mu -0.0000 ; norm 4.4734
238241b1: min -4.0000 ; max -4.0000 mu -4.0000 ; norm 64.0000
239- c0: min -11.6114 ; max 0.0635 mu -3.8398 ; norm 135.2238
240- -1| Test: err(X) = 54.3889
241- 0| Test : |d.E(X)| = 16.8070 err(X) = 46.8236; Train: err(X) = 52.7418
242- 1| Test : |d.E(X)| = 27.1183 err(X) = 36.8690; Train: err(X) = 41.3630
243- 2| Test : |d.E(X)| = 13.7855 err(X) = 31.8582; Train: err(X) = 34.5511
244- 3| Test : |d.E(X)| = 9.0927 err(X) = 28.6253; Train: err(X) = 30.4615
245- 4| Test : |d.E(X)| = 5.8375 err(X) = 26.2317; Train: err(X) = 27.6882
246- 5| Test : |d.E(X)| = 5.3187 err(X) = 24.3207; Train: err(X) = 25.5485
247- 6| Test : |d.E(X)| = 3.7614 err(X) = 22.8012; Train: err(X) = 23.8361
248- 7| Test : |d.E(X)| = 2.2589 err(X) = 21.6163; Train: err(X) = 22.4523
249- 8| Test : |d.E(X)| = 3.2040 err(X) = 20.5934; Train: err(X) = 21.3355
250- 9| Test : |d.E(X)| = 2.4215 err(X) = 19.7679; Train: err(X) = 20.4297
251- 10| Test : |d.E(X)| = 1.5725 err(X) = 19.0672; Train: err(X) = 19.6835
252- 11| Test : |d.E(X)| = 0.5418 err(X) = 18.4881; Train: err(X) = 19.0372
242+ c0: min -15.2663 ; max 0.1887 mu -4.0560 ; norm 148.4289
243+ -1| Test: err(X) = 66.7563
244+ 0| Dev : |d.E(X)| = 10.0093 err(X) = 64.7762
245+ 1| Dev : |d.E(X)| = 2.5509 err(X) = 57.7121
246+ 2| Dev : |d.E(X)| = 5.0427 err(X) = 53.9887
247+ 3| Dev : |d.E(X)| = 5.1724 err(X) = 52.6923
248+ 4| Dev : |d.E(X)| = 5.0167 err(X) = 51.1648
249+ 5| Dev : |d.E(X)| = 3.4010 err(X) = 49.9060
250+ 6| Dev : |d.E(X)| = 1.2844 err(X) = 48.9477
251+ 7| Dev : |d.E(X)| = 3.8469 err(X) = 48.2278
252+ 8| Dev : |d.E(X)| = 3.2666 err(X) = 47.3158
253+ 9| Dev : |d.E(X)| = 0.7140 err(X) = 46.4883
254+ 10| Dev : |d.E(X)| = 3.5822 err(X) = 45.7021
255+ 11| Dev : |d.E(X)| = 1.9054 err(X) = 45.2206
253256...
254257<shortened for brevity>
255258...
256- 91| Test: |d.E(X)| = 0.4870 err(X) = 11.0443; Train: err(X) = 10.9832
257- 92| Test: |d.E(X)| = 0.0390 err(X) = 11.0118; Train: err(X) = 10.9820
258- 93| Test: |d.E(X)| = 0.5127 err(X) = 11.0013; Train: err(X) = 10.9586
259- 94| Test: |d.E(X)| = 1.9180 err(X) = 10.9874; Train: err(X) = 10.9312
260- 95| Test: |d.E(X)| = 0.0258 err(X) = 10.9906; Train: err(X) = 10.9274
261- 96| Test: |d.E(X)| = 0.4760 err(X) = 10.9712; Train: err(X) = 10.8940
262- 97| Test: |d.E(X)| = 0.6038 err(X) = 10.9589; Train: err(X) = 10.8960
263- 98| Test: |d.E(X)| = 0.2870 err(X) = 10.9563; Train: err(X) = 10.8727
264- 99| Test: |d.E(X)| = 1.6622 err(X) = 10.9347; Train: err(X) = 10.8671
259+ 93| Dev: |d.E(X)| = 0.3789 err(X) = 27.3184
260+ 94| Dev: |d.E(X)| = 0.5906 err(X) = 27.2172
261+ 95| Dev: |d.E(X)| = 0.0461 err(X) = 27.2518
262+ 96| Dev: |d.E(X)| = 1.9164 err(X) = 27.1477
263+ 97| Dev: |d.E(X)| = 2.3997 err(X) = 27.0035
264+ 98| Dev: |d.E(X)| = 2.9253 err(X) = 27.1244
265+ 99| Dev: |d.E(X)| = 1.2569 err(X) = 26.9761
265266--- Final RBM Synaptic Stats ---
266- W1: min -1.8648 ; max 1.3757 mu -0.0012 ; norm 70.6230
267- b1: min -7.5815 ; max 0.2337 mu -2.3395 ; norm 53.3993
268- c0: min -11.6316 ; max -2.4227 mu -5.3259 ; norm 161.5646
267+ W1: min -1.1823 ; max 0.7636 mu -0.0087 ; norm 57.4068
268+ b1: min -4.0943 ; max -2.8031 mu -3.5501 ; norm 56.9961
269+ c0: min -16.0370 ; max -0.8244 mu -4.6686 ; norm 158.2293
269270```
270271
271272You will find, after the training script has finished executing, several outputs in the ` exp/filters/ ` model
@@ -282,7 +283,7 @@ if not). In particular, we remark notice that the filters that our harmonium has
282283to the fact our exhibit employs some weight decay (specifically, Gaussian/L2 decay -- with intensity
283284` l2_lambda=0.01 ` -- to the ` W1 ` synaptic matrix of our RBM).
284285Weight decay of this form is particularly useful to not only mitigate against the harmonium overfitting to its training
285- data but also to ensure that the Markov chain inherent to its negative-phase mixes more effectively [ 5] (which ensures
286+ data but also to ensure that the Markov chain inherent to its negative-phase mixes more effectively < b > [ 5] </ b > (which ensures
286287better-quality samples from the block Gibbs sampler, which we will use next).
287288
288289Finally, you will also find in the ` exp/filters/ ` model sub-folder another grid-plot containing some (about ` 100 ` ) of
@@ -330,8 +331,9 @@ Gibbs sampling process.
330331reading the plot follows the ordering of samples extracted from the specific Markov chain sequence.)
331332Note that, although each chain is run for many total steps, the ` sample_harmonium.py ` script "thins" out each Markov
332333chain by only pulling out a fantasized pattern every ` 20 ` steps (further "burning" in each chain before collecting
333- samples). Each chain is merely initialized with random Bernoulli noise. Note that higher-quality samples can be
334- obtained if one modifies the earlier harmonium to learn with persistent CD or parallel tempering.
334+ samples). <!-- Each chain is merely initialized with random Bernoulli noise. -->
335+ We remark that higher-quality samples can be
336+ obtained if one modifies the earlier harmonium to learn with more advanced forms of CD-learning, such as parallel tempering.
335337
336338### Final Notes
337339
@@ -343,19 +345,25 @@ Boltzmann machine (GRBM).
343345
344346<!-- references -->
345347## References
346- [ 1] Smolensky, P. "Information Processing in Dynamical Systems: Foundations of Harmony Theory" (Chapter 6). Parallel
348+ < b > [ 1] </ b > Smolensky, P. "Information Processing in Dynamical Systems: Foundations of Harmony Theory" (Chapter 6). Parallel
347349distributed processing: explorations in the microstructure of cognition 1 (1986). <br >
348- [ 2] Hinton, Geoffrey. Products of Experts. International conference on artificial neural networks (1999). <br >
349- [ 3] Hinton, Geoffrey E. "Training products of experts by maximizing contrastive likelihood." Technical Report, Gatsby
350+ < b > [ 2] </ b > Hinton, Geoffrey. Products of Experts. International conference on artificial neural networks (1999). <br >
351+ < b > [ 3] </ b > Hinton, Geoffrey E. "Training products of experts by maximizing contrastive likelihood." Technical Report, Gatsby
350352computational neuroscience unit (1999). <br >
351- [ 4] Movellan, Javier R. "Contrastive Hebbian learning in the continuous Hopfield model." Connectionist models. Morgan
353+ < b > [ 4] </ b > Movellan, Javier R. "Contrastive Hebbian learning in the continuous Hopfield model." Connectionist models. Morgan
352354Kaufmann, 1991. 10-17. <br >
353- [ 5] Hinton, Geoffrey E. "A practical guide to training restricted Boltzmann machines." Neural networks: Tricks of the
354- trade. Springer, Berlin, Heidelberg, 2012. 599-619.
355+ <b >[ 5] </b > Hinton, Geoffrey E. "A practical guide to training restricted Boltzmann machines." Neural networks: Tricks of the
356+ trade. Springer, Berlin, Heidelberg. 599-619 (2012). <br >
357+ <b >[ 6] </b > Tieleman, Tijmen. "Training restricted Boltzmann machines using approximations to the likelihood gradient." Proceedings of the 25th international conference on Machine learning. (2008).
355358
356359<!-- footnotes -->
357360[ ^ 1 ] : In fact, it is intractable to compute the partition function $Z$ for any reasonably-sized harmonium; fortunately,
358361we will not need to calculate $Z$ in order to learn and sample from a Harmonium.
359362[ ^ 2 ] : In general, CD-1 means contrastive divergence where the negative phase is only run for one single step, i.e.,
360363` K=1 ` . The more general form of CD is known as CD-K, the K-step CD algorithm where ` K > 1 ` . (Sometimes, CD-1 is just
361364referred to as just "CD".)
365+ [ ^ 3 ] : Note that we have slightly modified the PCD algorithm to include a "chain-swapping" mechanism taken from the
366+ statistical approach known as "parallel tempering". In our implementation, we randomy swap the states of the
367+ set of Gibbs chains we maintain under a fixed mixing probability ` p_mix ` ; we found that this somewhat improved
368+ the quality of our model's confabulations more consistently.
369+
0 commit comments