Top Posts Tagged with #bayesianism

Practical Bayesianism: The Sunrise Problem and Bernoulli Distributions

Hey guys. I’m going to try something a little different and I would really appreciate feedback. This series is going to attempt to cover a few simple formulas that Bayesian epistemologists can use to estimate probabilities in real life. This first post is going to start with the case of estimating Bernoulli random variables. Disclaimer: I’m not a statistician and may not be able to answer all questions around this topic. I am a math major though and I will do my best. Also, while I write some slightly non-rigorous computations involving probability density functions, it’s easy to rigorize with cumulative density functions. Note of advice: This post uses LaTeX! In order to view it properly, you should click over to my blog Prerequisites: This post assumes knowledge of Bayes’ Theorem and integration. TL;DR: If you notice an event with fixed probability happen $k$ times over $n$ trials, the probability that it will happen in the next trial is $\frac{k+1}{n+2}$.

The Problem: Suppose a child wakes up and notices that every day the sun rises. Over the $N$ days he’s been alive, the sun has risen $N$ times. A natural question for the child is “What is the probability that the sun rises tomorrow?” The naïve answer, I’d expect, is $1$ – the sun will almost surely rise. Those who have experience with probability theory may reject this simple answer for one reason – $1$ and $0$ are not truly valid probabilities in a sense. See this LessWrong post for more details as to why. A less naïve answer is that we can calculate it using our certainties in the laws of physics, probability distributions over the sun’s place in the sky, etc. However, in an everyday sense, this is somewhat impractical. While the laws of physics may be well known and allow us to bound our probabilities, they are nevertheless complex computationally. In the long run, we’d like to generalize insights from this problem to other problems – perhaps regarding human psychology, a famously tricky subject. In these other problems, the patterns may not be so well known and what is at least computationally possible, if not feasible, becomes impossible with our current level of technology.

The Model: Thus, let us try to calculate the probability that the sun will rise purely from the measurements we’ve taken – the sun has risen $N$ times on $N$ days. We will model the sun rising as a Bernoulli random variable – Let the sun rise with a probability $p$ while the probability that the sun does not rise remains $1-p$. Formally, the Bernoulli distribution is defined on $\{ 0, 1\}$. $0$ has probability $p$ and $1$ has probability $1-p$. Note that $p$ characterizes the entire Bernoulli distribution – this is nearly the simplest probability distribution we can work with. Let us also consider the problem in slightly greater generality. Suppose $n$ samples are independently drawn from a Bernoulli distribution and $k$ of them are $0$ (which is defined to have probability $p$). What is the probability that the next sample is also $0$? Let’s start with what we know – the probability of drawing $k$ independent $0$ samples out of $n$ from a Bernoulli distribution with probability $p$ is given by:

$$P(k \text{ samples out of } n = 0| p) = {n \choose k} p^k (1-p)^{n-k}$$

This can be seen as every fixed sequence of $0$s and $1$s has the same likelihood of $p^k (1-p)^{n-k}$ and there are ${n \choose k}$ such sequences. The reader is invited to check the details. In order to invert this and find a posterior distribution for $p$ given the information we have, we can use Bayes’ theorem. In order to use it, we must first choose a prior on $p$, the parameter we defined as part of the Bernoulli distribution. Here the post becomes a little complicated intuitively - $p$ should be thought of as a parameter that characterizes the Bernoulli distribution. We can’t get at $p$ directly – we can only estimate it via the trials we’ve measured. So our prior is actually a probability distribution for $p$ which will eventually give us our Bernoulli distribution, which we will use to get our probability. This is a bit of a roundabout way to go and I’ve summarized it in the diagram below:

We’ll talk more about prior selection in a later post. For now, because we know nothing about $p$ except that it is a probability in $(0,1)$, we pick a prior distribution for $p$ that assumes the least information – the uniform distribution on $(0,1)$ denoted $U(0,1)$.

The Math:

Consider now, Bayes’ theorem.

$$P(p | k \text{ samples out of } n = 0) = \left(\frac{ P(k \text{ samples out of } n = 0| p)}{ P(k \text{ samples out of } n = 0)}\right) P(p)$$

By $P(p)$ in this case, we mean the probability that the $p$, the probability parameter which characterizes the BERNOULLI DISTRIBUTION, is equal to $p$. To simplify notation let the prior probability distribution function of $p$ be $\sigma(p)$ and the posterior probability distribution function of $p$ be $\sigma’(p)$. $P(p)$ using differentials is $\sigma(p) dp$. On the other hand, $ P(p | k \text{ samples out of } n = 0) = \sigma’(p)$. So:

$$\sigma’(p) dp = \left(\frac{ P(k \text{ samples out of } n = 0| p)}{ P(k \text{ samples out of } n = 0)}\right) \sigma(p) dp$$ We have $P(k \text{ samples out of } n = 0| p)$ from above so we just need to find $ P(k \text{ samples out of } n = 0)$. This is going to depend on the prior distribution and will come out to:

$$ P(k \text{ samples out of } n = 0) = \int_{0}^{1} P(k \text{ samples out of } n = 0| p) \sigma(p) dp$$

$$ P(k \text{ samples out of } n = 0) = \int_{0}^{1} {n \choose k} p^k (1-p)^{n-k} \sigma(p) dp$$

For the uniform prior we have $\sigma(p) = 1$ and so:

$$ P(k \text{ samples out of } n = 0) = \int_{0}^{1} {n \choose k} p^k (1-p)^{n-k} dp$$ This is a tough integral to evaluate. There are a number of ways to go about it, but here is a probabilistic way. The integral corresponds to the probability that if we pick $n+1$ random variables $X_0, X_1, … , X_n$ where each $X_i ~ U(0,1)$ (each variable is uniformly distributed), $X_0$ will be the $k+1$th element in order. You can see this by considering how you might write a computer simulation of the situation described – first you pick your $p$ uniformly, then you can generate $n$ numbers uniformly from $(0,1)$, mark $0$ if the number is less than $p$, mark $1$ if it’s greater. By symmetry the probability is then $\frac{1}{n+1}$ So in summary: $$\sigma’(p) dp = \left(\frac{{n \choose k} p^k (1-p)^{n-k}}{\frac{1}{n+1}}\right) \sigma(p) dp$$

$$\sigma’(p) dp = (n+1) {n \choose k} p^k (1-p)^{n-k} dp$$

Cancelling out the differentials and rearranging we have:

$$\sigma’(p) = (k+1) {n+1 \choose k+1} p^k (1-p)^{n-k}$$ But wait! This gives us our posterior distribution for $\sigma’(p)$. But what we actually wanted was the probability that the next number picked would be $0$. We can compute this from $\sigma’(p)$:

$$P(\text{next number is }0) = \int_{0}^{1} p \sigma’(p) dp$$

$$ P(\text{next number is }0) = \int_{0}^{1} (k+1) {n+1 \choose k+1} p^{k+1} (1-p)^{n-k} dp$$

$$ P(\text{next number is }0) = (k+1) \left(\int_{0}^{1} {n+1 \choose k+1} p^{k+1} (1-p)^{n-k} dp \right)$$ Wait a second… we just computed this integral! So we have: $$P(\text{next number is }0) = (k+1)\left(\frac{1}{(n+1)+1}\right) = \frac{k+1}{n+2}$$

The Result: So if we want to calculate the probability the sun will rise tomorrow, we get the neat probability: $$P(\text{the sun will rise tomorrow}) = \frac{n+1}{n+2}$$ More generally we have $(k+1)/(n+2)$ - a neat formula that is easy to memorize. Keep this in mind whenever you come across a situation in real life where you want to estimate probabilities. For example:

My friend tends to lie when his super hot girlfriend from Canada is involved. Four out of the five times I’ve asked him about her, he deflected. What’s the probability he’ll deflect the next time I ask?

When I open up a conversation on OkCupid with a question, I’ve been ignored 17 out of 30 times and been given a date 4 out of those 30 times. What’s the probability that the next time I lead with a question I’ll get a date?

If I ask Bob to communicate a message to Alice without Eve finding out, he’s done it correctly 9 out of the 15 times I’ve asked. If he can the net utility change will be 6 utils according to my utility aggregation function. If he can’t or Eve finds out and I tell Bob my message, the net utility change will be -10 utils. What is the net expected utility of asking Bob to undergo this mission and is it worth it to benefit the world?

Credits: Laplace for coming up with the original problem. Bayes for making a kickass theorem. Alison and evolution-is-just-a-theorem for encouragement.

Tags since this is a sideblog: @sinesalvatorem, @evolution-is-just-a-theorem, @proofsaretalk

Trending Tags

Last Seen Tags

#bayesianism

Trending Tags

Last Seen Tags

#bayesianism