Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
206 CH A P T E R 9: Bernoulli Likelihood with Hierarchical Prior The plots of the posterior distribution, in the lower rows of Figure 9.6, reveal interesting results. Because the biases and the hyperparameter are being simul- taneously estimated, and the biases are strongly dependent on the hyperparame- ter, the posterior estimates are fairly tightly constrained, especially in comparison with Figure 9.5. Essentially, because the prior emphasizes a relatively narrow spindle within the 3D box, the posterior is restricted to a zone within that spin- dle. Not only does this cause the posterior to be relatively peaked on all the parameters, it also pulls all the estimates in toward the focal zone. Notice, in particular, that the posterior on 2 is peaked around 0.4, far from the proportion 4/5 = 0.8 in its coin's data! This shift away from the data proportion is caused by the fact that the other coin had a larger sample size, so it has more influence in deciding which part of the prior's spindle is focussed upon. One of the desirable aspects of using grid approximation to determine the posterior is that we do not rely on any formal analysis of the posterior. Instead, our computer simply keeps track of the values of the prior and likelihood at a large number of grid points and sums over them to determine the denominator of Bayes' rule. Grid approximation can use mathematical formulas for the prior as a convenience for determining the prior values at all those thousands of grid points. What's nice is that we can use, for the prior, any (non-negative) mathematical function we want, without knowing how to formally normalize it, because it will be normalized by the grid approximation. My choice of the priors for this example, summarized in Figure 9.4, was motivated merely by the pedagogical goal of using functions that you are familiar with, not by any formal restriction. The grid approximation displayed in Figures 9.5 and 9.6 used combs of only 50 points on each parameter (µ, 1 , and 2 ). This means that the 3D grid had 50 3 = 125,000 points, which is a size that can be handled easily on an ordinary desktop computer of the early 21st century. It is interesting to remind ourselves that the grid approximation displayed in Figures 9.5 and 9.6 would have been on the edge of computability 50 years ago and would have been impossible 100 years ago. The number of points in a grid approximation can get rather hefty in a hurry. If we were to expand the example by including a third coin, with its parameter 3 , then the grid would have 50 4 = 6,250,000 points, which already strains small computers. Include a fourth coin, and the grid contains more than 312 million points. Grid approximation is not a viable approach to even modestly large problems, which we encounter next. 9.2.2 Posterior via Monte Carlo Sampling The previous sections have used a simplified model (believe it or not) for the purpose of being able to graphically display the parameter space and gain clear intuitions about how Bayesian inference works. In this section, the first thing