Bayesian statistics: Common sense formalized

Blogger picture

In the last posts (first, second), I outlined a number of common errors in the usage and interpretation of P-values. Due to the base-rate fallacy or the multiple comparisons problem, the significance level $\alpha$ of a null-hypothesis significance test can easily be an order of magnitude lower than the true false positive rate. For example, under $p<0.05$, we could easily have a 50% error rate. These issues are one of the primary causes of the replication crises currently shaking psychology and medical science, where replication studies have found that the majority of significant results examined are insignificant upon replication. Fisheries and marine policy have many of the same risk factors driving the unreliability of scientific findings in those other fields, but no one has actually attempted to do any replication studies yet.

Other common techniques such as confidence intervals and likelihood-ratio tests ameliorate some of the shortcomings of P-values and null-hypothesis significance tests. However, these techniques still suffer from most of the really severe problems, like the base-rate fallacy. The reason behind these deficiencies is that all of these methods follow the frequentist statistical paradigm. In the frequentist paradigm, the probability of some event is considered to be the proportion of times, after repeating the experiment an infinite number of times, that the event occurs.

Suppose we wish to estimate the mean rockfish length in an MPA. In typical frequentist inference, the results of an experiment (here, the observed fish lengths) are assumed to be random. The probability of observing some data is the proportion of times that data occurs after an infinite number of trials. However, the hypotheses (e.g., the true mean fish length being a particular value) that we are trying to infer with our experiment are considered unknown but not random. This is why frequentist methods cannot give you the probabilities of competing hypotheses, since the hypotheses aren't considered to be random in the first place. For the same reason, frequentist techniques make it very hard to use prior information, which is the underlying cause behind the multiple-comparisons problem and the base-rate fallacy.

Fortunately, Bayesian statistical methods offer not only powerful solution to these problems, but also a different, more intuitive way of thinking about probability and evidence than the frequentist paradigm. Bayesian methods are the only way of coherently combining past information with new information, which allows it to more easily avoid issues like this one:

(Don't take this cartoon too seriously. It is a hilariously bad example of the base-rate fallacy, not an indictment of frequentist inference in general.; Randall Munroe, CC Attribution-NonCommercial License 2.5)

In the Bayesian philosophy, probability represents degrees of belief. While this is ultimately just a philosophical difference, the two paradigms do tend to result in very different methods and interpretation in practice.

Bayesian statistics considers the possible hypotheses (here, the true mean fish length) to be random because we don't know which one is true. Before the experiment, we have prior knowledge about the mean fish length (maybe that we know that mean fish lengths in most areas are between 75 and 85 cm), encoded in a prior probability distribution of the mean. We then observe a sample of fish. This gives us more information, which we combine with our already-known prior distribution using Baye's theorem (more about that in a second) to form the posterior distribution encompassing all of our knowledge after doing the experiment. So, after doing our experiment, we don't have a single point estimate of the mean, but a probability distribution (since we still don't know it exactly).

Baye's theorem is the following: Suppose that we wish to find the probability of some event or hypothesis $A$ (for example, the mean length of rockfish in an MPA being a certain value) based on prior information giving the prior probability $P(A)$ of $A$ along with new information: We observed some event $B$ (say, the sample mean length of a sample of fish), and we know the likelihood $P(B|A)$ of observing $B$ given that $A$ is true. This likelihood is what frequentist inference gives us, but with Baye's theorem, we can reverse $P(B|A)$ to get the probability of $A$ given that $B$ occurred (this is what we actually want):

$\mathrm{P}(A|B) = \frac{P(B|A)P(A)}{P(B)}$.

The quantity $\mathrm{P}(B)$ is the marginal probability of observing $B$ independent of $A$. It just acts as a normalizing constant so the probabilities all sum to one. It can be found by summing $P(B|A)P(A)$ over all possible $A$ (or integrating if there are infinite $A$, like if it is a population mean that could be any positive value). The important part is the numerator: We are simply multiplying the prior probability $\mathrm{P}(A)$ by the likelihood $\mathrm{P}(B|A)$ of observing the data $B$ given that the hypothesis $A$ is true. So if we consider the event $A$ to be a priori improbable, we need a high likelihood $\mathrm{P}(B|A)$ from our experiment to overcome our prior beliefs about $A$. For example, if we knew that the average adult rockfish in most areas is nearly always around 90 cm, it might take a very large sample to convince us that it is more like around 110 cm.

Baye's theorem is actually just a generalization of common-sense logical rule of contrapositivity (if $A$ implies $B$, then "not $B$" implies "not $A$") to when we are uncertain about $A$ and $B$. Baye's theorem is something you qualitatively employ every day without realizing it. Bayesian statistics just formalizes it.

A simple example

Let's go back to rockfish lengths. Suppose we wish to estimate the mean rockfish length in an MPA. With Bayesian inference, we start with a model that describes our problem. Let's do that.

First, suppose we know that adult rockfish lengths in any population are approximately normally distributed (this isn’t true, but it makes for a good simple example). Given this, we know that the population inside the MPA is normal with some unknown mean $\mu$ and standard deviation $\sigma$. For simplicity in this example, let's assume that we actually know the standard deviation $\sigma$ of rockfish lengths in all populations is exactly 30 cm, but we don't know the mean.

Unlike the frequentist hypothesis testing, we are not framing it as a choice between a null hypothesis and an alternative hypothesis. Hypothesis testing (whether frequentist or Bayesian) doesn't make very much sense for this problem, since there are more than two alternatives.

The first task is to quantify our prior knowledge of the distribution of fish lengths in the MPA. We also know the standard deviation exactly (30 cm), so we just need to figure out how to represent our uncertainty of the mean fish length. We represent this uncertainty as a prior distribution over the mean. I emphasize that the prior is a distribution of the mean parameter $\mu$ of fish lengths, not a distribution of the fish lengths themselves.

The normal distribution is actually a good choice for the prior distribution of the mean, because of the central limit theorem. So, we need a prior mean $\mu_0$ and a prior standard deviation $\sigma_0$ to encode our prior knowledge of the mean in a normal distribution. Let’s say we know that similar MPAs tend to have average lengths around 80 cm, and have a standard deviation of around 2 cm. In other words, about ⅔ of similar MPAs have average lengths between 78 and 82 cm. From this, we can represent our prior beliefs of mean fish length as being approximately normal with mean $\mu_0 = 80$ cm and standard deviation $\sigma_0 = 2$ cm. The mean fish lengths can't of course be exactly normal, because then fish could have negative lengths. But, it doesn't matter too much here, since there is about a 95% chance that the average fish length $\mu$ in the MPA is between $80 \pm 2 \times 2 = 80 \pm 4$.

Next, we collect our fish length data, which we know is normally distributed. Suppose that we took a sample of 100 fish, yielding a sample mean length of $\bar{x}=90$ cm. We already know the standard deviation $\sigma$ of fish lengths (as opposed to the mean) is exactly 30 cm, so we don’t have to worry about the sample standard deviation. To summarize, we have the following model,

$\mathrm{P}(\mu) \sim \mathrm{Normal}(80,2)$

$\mathrm{P}($length of fish $i = x | \mu) \sim \mathrm{Normal}(\mu,\sigma=30)$

The first line is our prior distribution of mean length $\mu$. The second is the likelihood of observing a fish in our sample with length $x$, given that the mean is some $\mu$ taken from the prior distribution. For our sample of 100 fish, there are 100 such likelihood terms for each fish $i$. Let's just write the total likelihood of observing all the fish that we did, given $\mu$, as $\mathrm{P}(D|\mu)$.

Now, we find the posterior distribution of mean fish lengths using Baye's theorem:

$$\mathrm{P}(\mu|D) =\frac{P(D|\mu)P(\mu)}{\mathrm{P}(D)},$$

Happily, in this case, the normalizing constant $\mathrm{P}(D)$ can be integrated out analytically. It turns out that the posterior is also a normal distribution:

$$\mathrm{P}(\mu|D) = \mathrm{Normal}(\mu_1,\sigma_1),$$


$$\sigma _1^2 = \frac{1}{n/\sigma^2 + 1/\sigma_0^2} = \frac{1}{100/30^2 + 1/2^2}\approx 2.8$$


$$ \mu_1 = \frac{1}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}}\left(\frac{\mu_0}{\sigma_0^2} + \frac{n \bar{x}}{\sigma^2}\right) $$ $$ = \frac{1}{\frac{1}{2^2} + \frac{100}{30^2}}\left(\frac{80}{2^2} + \frac{100 \times 90}{30^2} \right) \approx 83$$

The posterior mean of the fish length mean is an average of the prior mean and the sample mean from the data, weighted by the number of observations and standard deviations $\sigma_0=2$ of the prior, and $\sigma=30$ of the fish lengths. So with this analysis, we now think that the average fish length in the MPA is about 83 cm. Our prior information was that mean fish lengths are close to 80 cm the vast majority of the time. Thus, our observed sample mean of 90 cm was most probably due to happening to observe some large fish by chance from an MPA that probably has somewhat above average fish lengths. The corresponding frequentist estimate is 90 cm (the sample mean). This does not take into account prior information. Naively interpreted, the frequentist estimate jumps to the conclusion that the MPA has an incredibly large average fish length, 90 cm. That is 5 standard deviations above the mean, which should essentially never happen. We would come to a completely wrong conclusion about the effectiveness of this MPA with the frequentist estimate. The Bayesian result avoids the base-rate fallacy with prior information, following the adage "extraordinary claims require extraordinary evidence."

That was a very simple example. In practice, we would also have a prior distribution over the standard deviation of fish lengths, since we don't actually know the standard deviation (there is still an analytical solution in that case). In the previous example, the normal prior for the mean fish length is a conjugate prior to the normal likelihood $\mathrm{P}($length of fish $i = x | \mu)$ of observing a particular fish length given the mean. When using a prior conjugate to the likelihood, the posterior can be expressed analytically in the same distributional form as the prior. Here, the prior was normal, so the posterior was also normal. The likelihood was also normal, but that only holds in this particular case. For example, the conjugate prior for variance $\sigma^2$ of fish lengths (not the variance of the mean) is the inverse gamma distribution, and the posterior is also inverse gamma, rather than normal. In the last example, we assumed we knew that $\sigma^2$ was exactly $30 \times 30=900$. If we weren't sure about the $\sigma^2$, we could have just encoded our less than exact knowledge by using an inverse gamma prior.

For many problems, there are no analytical exact solutions for the posterior distribution as we had here. The bugaboo is the normalization constant $P(A)$, which is often impossible to calculate exactly. Fortunately, there are a number of free and open-source software packages that offer push-button fast and accurate numerical approximations of posterior distributions. Popular ones include Stan (standalone modeling language with interfaces in in R, Python, and Julia), PyMC3 (my favorite, in Python), and Edward (Python).

Notice that we did this example by thinking about our problem domain and our prior knowledge, and specifying a model that that follows our knowledge of the problem. We represent anything we don't know exactly as a probability distribution, which we just choose to look like our state of knowledge.

These characteristics make the Bayesian model intuitive than typical frequentist null-hypothesis statistical testing, where instead of specifying a model that directly represents our prior knowledge and the experiment, we have to choose from a zoo of statistical tests and techniques with confusing assumptions and unclear relationships to the problem at hand. Due to this unintuitive nature, frequentist inference is often done on the basis of lore and tradition than sound mathematics.


Bayesian methods offer the following advantages over frequentist techniques:

  • They provide a logically consistent way of combining previously known information with new information. Thus, Bayesian statistics are a rigorous way of avoiding the base-rate fallacy. The base-rate fallacy only occurs with frequentist methods because they cannot use prior information in a straightforward way.

  • Bayesian models are more intuitive to correctly specify than frequentist tests.

  • Bayesian inference tells us what we want to know. Using Baye's theorem, we get actual probabilities of competing hypotheses. This is what we need to make rational decisions under uncertainty. Frequentist methods (including P-values and confidence intervals) do not give us this.

  • People think in a Bayesian way. This is perhaps a reason why frequentist techniques like P-values and confidence intervals are so consistently misinterpreted in a Bayesian way. For example, the inverse-probability fallacy from the previous post is the misinterpretation of a P-value as a Bayesian posterior probability.

  • Bayesian statistics completely solves the multiple comparisons problem. The MCP is sidestepped if we use prior information of the probability of hypotheses.

  • Bayesian statistics avoids the HARKing problem. In the frequentist paradigm, we should only test hypotheses that we have a prior reason for suspecting might be true. This knowledge is implicit and ill-defined. Bayesian statistics makes it explicit and rigorous by encoding the knowledge in the prior distribution. A hypothesis formed after observation of the data can be assigned a prior reflecting the fact that we have no a-priori reason to believe it might be true.

Bayesian methods are often criticized for being subjective due to their use of priors. However, all statistical methods are subjective. Anyone who thinks they can banish subjectivity from their data analysis is dangerously fooling themselves. All statistical methods require some kind of prior knowledge about the data. Frequentist inference methods require this prior information to be taken into account in ad-hoc and often qualitative ways, either in experimental design, or in interpretation of the results. Bayesian methods just make this more explicit and intuitive, by encoding prior information in the prior distribution. In addition, the use of a prior distribution is actually conservative. A prior distribution reduces the amount that the data informs our belief. A noisy or small sample will change our beliefs less than a large sample.

One may wonder why frequentist methods are dominant in science despite their shortcomings. The reasons behind this have mostly to do with inertia. Bayesian methods tend to be more computationally intensive. Before the advent of fast computers, frequentist methods were the only tractable ways to perform statistical inference in many cases. But now, computers are very fast and they are very cheap, so this advantage of frequentist statistics is largely moot. Apart from computational reasons, scientists are usually only trained in frequentist methods. Most scientists are loathe to spend time learning abstract things not directly related to their work. Lastly, reviewers are often suspicious of methodologies unfamiliar to them, which further discourages the use of Bayesian statistics.

However, this is changing, albeit slowly. In geophysics, Bayesian methods are seeing increased use for inverse problems such as imaging the Earth's interior with seismic waves, due to their resistance to overfitting with noisy data, as well as the well-calibrated uncertainty estimates they provide. For the same reasons, Bayesian methods (or various approximations) are the norm in artificial intelligence and machine learning. Bayesian methods have long been used for weather forecasting: The parameters of the equations of weather physics are considered random, and their distributions are updated using Baye's theorem on a nearly continuous basis with new meteorological data to produce weather forecasts.

Frequentist statistics are a valid tool in many situations. Most of the problems with P-values that I've laid out can be avoided by properly applying and interpreting frequentist methods. However, doing so tends to be complex, error prone, and highly unintuitive. Bayesian methods and their equivalent frequentist techniques also give very similar answers if the data are highly informative compared to the prior. Many frequentist methods do in fact have some kind of Bayesian interpretation. Often things like confidence intervals are equivalent to Bayesian techniques with certain choices of prior distributions, often uninformative priors which attempt to encode a lack of prior information. However, even these priors contain information, thus subjectivity still remains.

Bayesian statistics are a large subject that won't fit into this post. In future posts I will give in-depth examples of Bayesian inference, including how it can be done on a computer.