% Hierarchical Models
Note: This chapter has not been revised for the new format and Church engine. Some content may be incomplete! Some example may not work!
Human knowledge is organized hierarchically into levels of abstraction. For instance, the most common or basic-level categories (e.g. dog, car) can be thought of as abstractions across individuals, or more often across subordinate categories (e.g., poodle, Dalmatian, Labrador, and so on). Multiple basic-level categories in turn can be organized under superordinate categories: e.g., dog, cat, horse are all animals; car, truck, bus are all vehicles. Some of the deepest questions of cognitive development are: How does abstract knowledge influence learning of specific knowledge? How can abstract knowledge be learned? In this section we will see how such hierarchical knowledge can be modeled with hierarchical generative models: generative models with uncertainty at several levels, where lower levels depend on choices at higher levels.
Hierarchical models allow us to capture the shared latent structure underlying observations of multiple related concepts, processes, or systems -- to abstract out the elements in common to the different sub-concepts, and to filter away uninteresting or irrelevant differences. Perhaps the most familiar example of this problem occurs in learning about categories. Consider a child learning about a basic-level kind, such as dog or car. Each of these kinds has a prototype or set of characteristic features, and our question here is simply how that prototype is acquired.
The task is challenging because real-world categories are not homogeneous. A basic-level category like dog or car actually spans many different subtypes: e.g., poodle, Dalmatian, Labrador, and such, or sedan, coupe, convertible, wagon, and so on. The child observes examples of these sub-kinds or subordinate-level categories: a few poodles, one Dalmatian, three Labradors, etc. From this data she must infer what it means to be a dog in general, in addition to what each of these different kinds of dog is like. Knoweldge about the prototype level includes understanding what it means to be a prototypical dog and what it means to be non-prototypical, but still a dog. This will involve understanding that dogs come in different breeds which share features between them, but also differ systematically as well.
As a simplification of this situation consider the following generative process. We will draw marbles out of several different bags. There are five marble colors. Each bag has a certain "prototypical" mixture of colors. This generative process is represented in the following Church example using the Dirichlet distribution (the Dirichlet is the higher-dimensional analogue of the Beta distribution).
(define colors '(black blue green orange red))
(define bag->prototype
(mem (lambda (bag) (dirichlet '(1 1 1 1 1)))))
(define (draw-marbles bag num-draws)
(repeat num-draws
(lambda () (multinomial colors (bag->prototype bag)))))
(hist (draw-marbles 'bag 50) "first sample")
(hist (draw-marbles 'bag 50) "second sample")
(hist (draw-marbles 'bag 50) "third sample")
(hist (draw-marbles 'bag 50) "fourth sample")
'done
Note that we are using the operator mem that we introduced in the first part of the tutorial. mem is particularly useful when writing hierarchical models because it allows us to associate arbitrary random draws with categories across entire runs of the program. In this case it allows us to associate a particular mixture of marble colors with each bag. The mixture is drawn once, and then remains the same thereafter for that bag.
Run the code above multiple times. Each run creates a single bag of marbles with its characteristic distribution of marble colors, and then draws four samples of 50 marbles each. Intuitively, you can see how each sample is sufficient to learn a lot about what that bag is like; there is typically a fair amount of similarity between the empirical color distributions in each of the four samples from a given bag. In contrast, you should see a lot more variation across different runs of the code -- samples from different bags.
Now let's add a few twists: we will generate three different bags, and try to learn about their respective color prototypes by conditioning on observations. We represent the results of learning in terms of the posterior predictive distribution for each bag: a single hypothetical draw from the bag, using the expression (draw-marbles 'bag 1). We will also draw a sample from the posterior predictive distribution on a new bag, for which we have had no observations.
(define colors '(black blue green orange red))
(define samples
(mh-query
200 100
(define bag->prototype
(mem (lambda (bag) (dirichlet '(1 1 1 1 1)))))
(define (draw-marbles bag num-draws)
(repeat num-draws
(lambda () (multinomial colors (bag->prototype bag)))))
;;predict the next sample from each observed bag, and a new one:
(list (draw-marbles 'bag-1 1)
(draw-marbles 'bag-2 1)
(draw-marbles 'bag-3 1)
(draw-marbles 'bag-n 1))
;;condition on observations from three bags:
(and
(equal? (draw-marbles 'bag-1 6) '(blue blue black blue blue blue))
(equal? (draw-marbles 'bag-2 6) '(blue green blue blue blue red))
(equal? (draw-marbles 'bag-3 6) '(blue blue blue blue blue orange))
)))
(hist (map first samples) "bag one posterior predictive")
(hist (map second samples) "bag two posterior predictive")
(hist (map third samples) "bag three posterior predictive")
(hist (map fourth samples) "bag n posterior predictive")
'done
This generative model describes the prototype mixtures in each bag, but it does not attempt learn a common higher-order prototype. It is like learning separate prototypes for subordinate classes poodle, Dalmatian, and Labrador, without learning a prototype for the higher-level kind dog—or learning about any functions that are shared across the different lower-level classes or bags. Specifically, inference suggests that each bag is predominantly blue, but with a fair amount of residual uncertainty about what other colors might be seen. There is no information shared across bags, and nothing significant is learned about bag-n as it has no observations and no structure shared with the bags that have been observed.
Now let us introduce another level of abstraction: a global prototype that provides a prior on the specific prototype mixtures of each bag.
(define colors '(black blue green orange red))
(define samples
(mh-query
200 100
;;we make a global prototype which is a dirichlet sample scaled to total 5.
(define prototype (map (lambda (x) (* 5 x)) (dirichlet '(1 1 1 1 1))))
;;the prototype for each bag uses the global prototype as parameters.
(define bag->prototype
(mem (lambda (bag) (dirichlet prototype))))
(define (draw-marbles bag num-draws)
(repeat num-draws
(lambda () (multinomial colors (bag->prototype bag)))))
(list (draw-marbles 'bag-1 1)
(draw-marbles 'bag-2 1)
(draw-marbles 'bag-3 1)
(draw-marbles 'bag-n 1))
(and
(equal? (draw-marbles 'bag-1 6) '(blue blue black blue blue blue))
(equal? (draw-marbles 'bag-2 6) '(blue green blue blue blue red))
(equal? (draw-marbles 'bag-3 6) '(blue blue blue blue blue orange))
)))
(hist (map first samples) "bag one posterior predictive")
(hist (map second samples) "bag two posterior predictive")
(hist (map third samples) "bag three posterior predictive")
(hist (map fourth samples) "bag n posterior predictive")
'done
Compared with inferences in the previous example, this extra level of abstraction enables faster learning: more confidence in what each bag is like based on the same observed sample. This is because all of the observed samples suggest a common prototype structure, with most of its weight on blue and the rest of the weight spread uniformly among the remaining colors. Statisticians sometimes refer to this phenomenon of inference in hierarchical models as "sharing of statistical strength": it is as if the sample we observe for each bag also provides a weaker indirect sample relevant to the other bags. In machine learning and cognitive science this phenomenon is often called learning to learn or transfer learning. Intuitively, knowing something about bags in general allows the learner to transfer knowledge gained from draws from one bag to other bags. This example is analogous to seeing several examples of different subtypes of dogs and learning what features are in common to the more abstract basic-level dog prototype, independent of the more idiosyncratic features of particular dog subtypes.
A particularly striking example of "sharing statistical strength" or "learning to learn" can be seen if we change the observed sample for bag 3 to have only two examples, one blue and one orange. Replace the line (equal? (draw-marbles 'bag-3 6) '(blue blue blue blue blue orange)) with (equal? (draw-marbles 'bag-3 2) '(blue orange)) in each program above. In a situation where we have no shared higher-order prototype structure, inference for bag-3 from these observations suggests that blue and orange are equally likely. However, when we have inferred a shared higher-order prototype, then the inferences we make for bag 3 look much more like those we made before, with six observations (five blue, one orange), because the learned higher-order prototype tells us that blue is most likely to be highly represented in any bag regardless of which other colors (here, orange) may be seen with lower probability.
Learning about shared structure at a higher level of abstraction also supports inferences about new bags without observing any examples from that bag: a hypothetical new bag could produce any color, but is likely to have more blue marbles than any other color. We can imagine hypothetical, previously unseen, new subtypes of dogs that share the basic features of dogs with more familiar kinds but may differ in some idiosyncratic ways.
Now let's investigate the relative learning speeds at different levels of abstraction. Suppose that we have a number of bags that all have identical prototypes: they mix red and blue in proportion 2:1. But the learner doesn't know this. She observes only one ball from each of N bags. What can she learn about an individual bag versus the population as a whole as the number of bags changes? (Note: this example is too slow to run on Churchserv. Results are shown below.)
(define colors '(red blue))
(define (sample-bags obs-draws)
(mh-query
300 100
;;we make a global prototype which is a dirichlet sample scaled to total 2:
(define phi (dirichlet '(1 1)))
(define global-prototype (map (lambda (x) (* 2 x)) phi))
;;the prototype for each bag uses the global prototype as parameters.
(define bag->prototype
(mem (lambda (bag) (dirichlet global-prototype))))
(define (draw-marbles bag num-draws)
(repeat num-draws
(lambda () (multinomial colors (bag->prototype bag)))))
;;query the inferred bag1 and global prototype:
(list (first (bag->prototype (first (first obs-draws))))
(first phi))
;;condition on getting the right observations from each bag.
;;obs-draws is a list of lists of draws from each bag (first is bag name).
(all (map (lambda (bag) (equal? (rest bag)
(draw-marbles (first bag) (length (rest bag)))))
obs-draws))))
;;;;;;;;;end of the model, below is code to make plots of learning speed for this model.
;;compute the mean squared deviation of samples from truth:
(define (mean-dev true samples)
(mean (map (lambda (s) (expt (- true s) 2)) samples)))
;;now we generate learning curves! we take a single sample from each bag.
;;plot the mean-squared error normalized by the no-observations error.
(define samples (sample-bags '((bag1))))
(define initial-specific (mean-dev 0.66 (map first samples)))
(define initial-global (mean-dev 0.66 (map second samples)))
(lineplot-value (pair 0 1) "specific learning")
(lineplot-value (pair 0 1) "general learning")
(define samples (sample-bags '((bag1 red))))
(lineplot-value (pair 1 (/ (mean-dev 0.66 (map first samples)) initial-specific))
"specific learning")
(lineplot-value (pair 1 (/ (mean-dev 0.66 (map second samples)) initial-global))
"general learning")
(define samples (sample-bags '((bag1 red) (bag2 red) (bag3 blue))))
(lineplot-value (pair 3 (/ (mean-dev 0.66 (map first samples)) initial-specific))
"specific learning")
(lineplot-value (pair 3 (/ (mean-dev 0.66 (map second samples)) initial-global))
"general learning")
(define samples (sample-bags '((bag1 red) (bag2 red) (bag3 blue) (bag4 red) (bag5 red) (bag6 blue))))
(lineplot-value (pair 6 (/ (mean-dev 0.66 (map first samples)) initial-specific))
"specific learning")
(lineplot-value (pair 6 (/ (mean-dev 0.66 (map second samples)) initial-global))
"general learning")
(define samples (sample-bags '((bag1 red) (bag2 red) (bag3 blue) (bag4 red) (bag5 red) (bag6 blue) (bag7 red) (bag8 red) (bag9 blue))))
(lineplot-value (pair 9 (/ (mean-dev 0.66 (map first samples)) initial-specific))
"specific learning")
(lineplot-value (pair 9 ((/ (mean-dev 0.66 (map second samples)) initial-global))
"general learning")
(define samples (sample-bags '((bag1 red) (bag2 red) (bag3 blue) (bag4 red) (bag5 red) (bag6 blue) (bag7 red) (bag8 red) (bag9 blue) (bag10 red) (bag11 red) (bag12 blue))))
(lineplot-value (pair 12 (/ (mean-dev 0.66 (map first samples)) initial-specific))
"specific learning")
(lineplot-value (pair 12 (/ (mean-dev 0.66 (map second samples)) initial-global))
"general learning")
'done
We are plotting learning curves: the mean squared error of the prototype from the true prototype for the specific level (the first bag) and the general (global prototype) level as a function of the number of observed data points. Note that these quantities are directly comparable because they are each samples from a Dirichlet distribution of the same size (this is often not the case in hierarchical models). What we see is that learning is faster at the general level than the specific level—that is that the error in the estimated prototype drops faster in the general than the specific plots. We also see that there is continued learning at the specific level, even though we see no additional samples from the first bag after the first; this is because the evolving knowledge at the general level further constrains the inferences at the specific level. Going back to our familiar categorization example, this suggests that a child could be quite confident in the prototype of "dog" while having little idea of the prototype for any specific kind of dog—learning more quickly at the abstract level than the specific level, but then using this abstract knowledge to constrain expectations about the specific level. This dynamic depends crucially on the fact that we get very diverse evidence: try changing the above example to observe the same N examples, but coming from a single bag (instead of N bags). You should now see that learning for this bag is quick, while global learning (and transfer) is slow.
The first of the plots above shows learning curves when there is one observation per bag. The second plot shows learning curves when all the observations come from the same bag.
In machine learning one often talks of the curse of dimensionality. The curse of dimensionality refers to the fact that as the number of parameters of a model increases (i.e. the dimensionality of the model increases), the size of the hypothesis space increases exponentially. This increase in the size of the hypothesis space leads to two related problems. The first is that the amount of data required to estimate model parameters (called the "sample complexity") increases rapidly as the dimensionality of the hypothesis space increases. The second is that the amount of computational work needed to search the hypothesis space also rapidly increases. Thus, increasing model complexity by adding parameters can result in serious problems for inference.
In contrast, we have seen that adding additional levels of abstraction (and hence additional parameters) in a probabilistic model can sometimes make it possible to learn more with fewer observations. This happens because learning at the abstract level can be quicker than learning at the specific level. Because this ameliorates the curse of dimensionality, we refer to these effects as the blessing of abstraction.
In general, the blessing of abstraction can be surprising because our intuitions often suggest that adding more hierarchical levels to a model increases the model's complexity. More complex models should make learning harder, rather than easier. On the other hand, it has long been understood in cognitive science that learning is made easier by the addition of constraints on possible hypothesis. For instance, proponents of universal grammar have long argued for a highly constrained linguistic system on the basis of learnability. Their theories often have an explicitly hierarchical flavor. Hierarchical Bayesian models can be seen as a way of introducing soft, probabilistic constraints on hypotheses that allow for the transfer of knowledge between different kinds of observations.

