Bayesian Estimation of Multinomial Parameter Ratios: A Derivation

A while back, I wrote a blog post about the Bayesian estimation of multinomial parameter ratios. That post used simulation to show that the ratio of posterior means for multinomial parameters is not the same as the posterior mean of the ratio. Symbolically,

\[ \frac{E(\pi_1)}{E(\pi_2)} \neq E\left( \frac{\pi_1}{\pi_2} \right), \]

where \(\pi_1\) and \(\pi_2\) are selected multinomial parameters, and the expectation is taken over the posterior distribution.

In that post, I provided a formula for the posterior mean of the ratio:

\[ E\left( \frac{\pi_1}{\pi_2} \right) = \frac{n_i + a_i}{n_j + a_j -1},\]

where \(a_i\) and \(a_j\) are parameters in the Dirichlet prior distribution, and \(n_i\) and \(n_j\) are the number of observations in categories \(i\) and \(j\). I then wrote: “The proof of this formula is a nice calculus exercise, which I might write up as a future blog post.”

Someone recently wrote to me and asked about the derivation of this formula. I looked back at my notes and realized that I had not written it down. I might have used Mathematica to do it, but my Mathematica license has since expired because I left academia, and I haven’t bothered to buy a new one.

It is an interesting problem, so I decided to go ahead and work through it again, and write it up in case it would be useful to someone else. This won’t be a rigorous proof, but I’ll point to references from which you should be able to construct your desired level of rigor.

As a starting point, I’ll take it as given that since we are only concerned with two multinomial parameters, we can neglect the rest of them, and thus simplify things down to the binomial distribution. An argument for this simplification dealing with a similar case is outlined in Exercise 1 of Chapter 3 in Bayesian Data Analysis, 3rd edition, by Gelman et al.

In the binomial case, the Dirichlet prior reduces down to a Beta prior,

\[ \text{Beta}(a_1, a_2 ) = \frac{\Gamma(a_1 + a_2)}{\Gamma(a_1)\Gamma(a_2)} \pi_1^{a_1 - 1} \pi_2^{a_2 - 1}, \]

which is commonly written in terms of \(\pi_1\) and \(1 - \pi_1\), where \(\pi_1\) is the probability of “success.” The posterior distribution is then

\[ p(\pi | n) = \text{Beta}(a_1 + n_1, a_2 + n_2) = \frac{\Gamma(a_1 + n_1 + a_2 + n_2)}{\Gamma(a_1 + n_1) \Gamma(a_2 + n_2)} \pi_1^{a_1 + n_1 - 1} \pi_2^{a_2 + n_2 - 1}, \]

where \(n_1\) and \(n_2\) are the numbers of observations in categories 1 and 2 (or number of “successes” and “failures” since this is just the binomial case). Remember that this is really a single-variable distribution, since \(\pi_1 + \pi_2 = 1\). I’ll take the “free” variable to be \(\pi_1\).

The next step is to rewrite this as a distribution for the ratio \(\phi = \pi_1 / \pi_2\). In terms of just \(\pi_1\), this ratio is

\[ \phi = \frac{\pi_1}{1 - \pi_1},\]

so \(\pi_1 = \phi/(1 + \phi)\), \(\pi_2 = 1 / (1 + \phi)\), and the Jacobian factor is \(1 / (1 + \phi)^2\). We end up with

\[ p(\phi | n) = \frac{\Gamma(a_1 + n_1 + a_2 + n_2)}{\Gamma(a_1 + n_1) \Gamma(a_2 + n_2)} \frac{\phi^{a_1 + n_1 - 1}}{(1 + \phi)^{a_1 + n_1 + a_2 + n_2}} .\]

Now, to get the posterior mean of \(\phi\), we just have to calculate an expectation (multiply by \(\phi\) and integrate):

\[ E(\phi | n) = \frac{\Gamma(a_1 + n_1 + a_2 + n_2)}{\Gamma(a_1 + n_1) \Gamma(a_2 + n_2)} \int_0^\infty \frac{\phi^{a_1 + n_1} \text{d}\phi}{(1 + \phi)^{a_1 + n_1 + a_2 + n_2}} . \]

The integration range comes from the fact that \(\pi_1 \in [0, 1]\), so the change of variables gives \(\phi \in [0, \infty)\). Of course the ratio is not defined when \(\pi_1 = 1\), so I’m playing a bit fast and loose with the limits (in my excuse, I was trained as a physicist).

In simpler terms, we now need to solve the integral

\[ \int_0^\infty \frac{x^a \text{d}x}{(1 + x)^b},\]

where \(a\) and \(b\) are constants. But this integral is equivalent to a Beta function! I’ll give two references for this. The first is a short book titled The Gamma Function, by Emil Artin, page 28, which shows how to derive the formula

\[ \int_0^\infty \frac{t^{x - 1}}{(1 + t)^{x + y}} \text{d}t = \frac{\Gamma(x) \Gamma(y) }{\Gamma(x + y)}. \]

The second reference is Gradshteyn and Ryzhik, 7th edition, formula 8.380.3, on page 908, which is identical.

In terms of our original variables, we have

\[\int_0^\infty \frac{\phi^{a_1 + n_1} \text{d}\phi}{(1 + \phi)^{a_1 + n_1 + a_2 + n_2}} = \frac{\Gamma(a_1 + n_1 + 1) \Gamma(a_2 + n_2 - 1)}{\Gamma(a_1 + n_2 + a_2 + n_2)} .\]

Multiplying by the prefactor, and canceling out the factors of \(\Gamma(a_1 + n_2 + a_2 + n_2)\), we are left with

\[ E(\phi | n) = \frac{\Gamma(a_1 + n_1 + 1)\Gamma(a_2 + n_2 -1)}{\Gamma(a_1 + n_1) \Gamma(a_2 + n_2)}. \]

Now use the formula \(\Gamma(z + 1) = z \Gamma(z)\) two times to get

\[ E(\phi | n) = E \left( \frac{\pi_1}{\pi_2} \right) = \frac{a_1 + n_1}{a_2 + n_2 - 1}. \]

By contrast,

\[ E(\pi_1 | n) = \frac{a_1 + n_1}{a_1 + a_2 + n_1 + n_2}, \]


\[ E(\pi_2 | n) = \frac{a_2 + n_2}{a_1 + a_2 + n_1 + n_2}, \]

so the ratio of the posterior means is

\[ \frac{E(\pi_1 | n )}{E(\pi_2 | n)} = \frac{a_1 + n_1}{a_2 + n_2}.\]

Something has snuck by unnoticed here (at least I didn’t notice it at first). It seems that we can choose parameters such that the expected value of the ratio is negative! For example, take \(a_2 = 0.5\) and assume that there are no observations in this category: \(n_2 = 0\). A closer look at the derivation reveals restrictions on the parameters in the integral. In terms of the formulas given in the references, \(x\) and \(y\) must both be positive. Translating to our variables, this gives

\[x = a_1 + n_1 + 1 > 0 , \]


\[ y = a_2 + n_2 - 1 > 0 . \]

The first inequality is always true by definition (the parameters of a beta distribution are positive), but the second one can cause some trouble. As long as there is at least one observation in this category, things are fine. If there are no observations in this category, then you could adjust the prior so as to avoid trouble. But this is contrary to the spirit of Bayesian inference!

The resolution of this problem will have to wait for another blog post (in other words, I don’t know how to work things out in this case)! If you have an idea, let me know!

Landon Lehman
Data Scientist

My research interests include data science, statistics, physics, and applied math.