Simulation

Random Variate Generation

59 minute read



Notice a tyop typo? Please submit an issue or open a PR.

Random Variate Generation

Inverse Transform Method

In this lesson, we'll go into some additional detail regarding the inverse transform method.

Inverse Transform Method

The inverse transform method states that, if XX is a continuous random variable with cdf F(x)F(x), then F(X)โˆผU(0,1)F(X) \sim \mathcal{U}(0,1). In other words, if we plug a random variable, from any distribution, into its own cdf, we get a Unif(0,1) random variable.

We can prove the inverse transform method. Let Y=F(X)Y = F(X). Since YY is a random variable, it has a cdf, which we can denote G(y)G(y). By definition:

G(y)=P(Yโ‰คy)G(y) = P(Y \leq y)

Since Y=F(X)Y = F(X):

G(y)=P(F(X)โ‰คy)G(y) = P(F(X) \leq y)

Since XX is a continuous random variable, its cdf is continuous. Therefore, we can apply the inverse, Fโˆ’1F^{-1}, to both sides of the inequality:

G(y)=P(Fโˆ’1(F(X))โ‰คFโˆ’1(y))G(y) = P(F^{-1}(F(X)) \leq F^{-1}(y))

What is Fโˆ’1(F(X))F^{-1}(F(X))? Simply, XX:

G(y)=P(Xโ‰คFโˆ’1(y))G(y) = P(X \leq F^{-1}(y))

Notice that we have an expression of the form P(Xโ‰คxP(X \leq x), where x=Fโˆ’1(y)x = F^{-1}(y). We know, by definition, F(x)=P(Xโ‰คx)F(x) = P(X \leq x), so:

G(y)=F(Fโˆ’1(y))=yG(y) = F(F^{-1}(y)) = y

In summary, the cdf of YY is G(y)=yG(y) = y. If we take the derivative of the cdf to get the pdf, we see that g(y)=1g(y) = 1. Let's remember the pdf for a uniform random variable:

f(x)={1bโˆ’axโˆˆ[a,b]0otherwisef(x) = \left\{ \begin{matrix} \frac{1}{b-a} & x \in [a,b] \\ 0 & \text{otherwise} \end{matrix} \right.

If a=0,b=1a = 0, b = 1, then f(x)=1=g(y)f(x) = 1 = g(y). Therefore, YโˆผU(0,1)Y \sim \mathcal{U}(0,1).

How Do We Use This Result

Let UโˆผU(0,1)U \sim \mathcal{U}(0,1). Since F(X)=UF(X) = U, then Fโˆ’1(U)=XF^{-1}(U) = X. In other words, if we set F(X)=UF(X) = U, and then solve for XX, we get a random variable from the same distribution as XX, but in terms of UU.

Here is how we use the inverse transform method for generating a random variable XX having cdf F(x)F(x): 1. Sample UU from U(0,1)\mathcal{U}(0,1) 2. Return X=Fโˆ’1(U)X = F^{-1}(U)

Illustration

Consider the following cdf, F(x)F(x).

Let's think about the range of F(x)F(x). Since we are dealing with a cdf, we know that 0โ‰คF(x)โ‰ค10 \leq F(x) \leq 1. Note that, correspondingly, 0โ‰คU<10 \leq \mathcal{U} < 1. This means that, for a given Unif(0,1) random variable, UU, we can generate an xx, such that F(x)=UF(x) = U, by taking the inverse of FF with UU: x=Fโˆ’1(U)x = F^{-1}(U).

Let's look at an example. Suppose we generate U=0.9U = 0.9. Let's draw a horizontal line (0,0.9)(0, 0.9) to the graph of F(x)F(x). From there, if we draw a vertical line down to the xx-axis, the xx-coordinate of the intersection is Fโˆ’1(U)F^{-1}(U). In this case, Fโˆ’1(0.9)=3F^{-1}(0.9) = 3.

Uniform Example

Consider the U(a,b)\mathcal{U}(a,b) distribution, which has the following cdf:

F(x)=xโˆ’abโˆ’a,aโ‰คxโ‰คbF(x) = \frac{x - a}{b - a}, \quad a \leq x \leq b

Let's set F(X)=UF(X) = U and solve for XX:

U=Xโˆ’abโˆ’a(bโˆ’a)U=Xโˆ’aa+(bโˆ’a)U=X\begin{alignedat}{1} U &= \frac{X - a}{b - a} \\[2ex] (b - a)U &= X - a \\ a + (b - a )U &= X \end{alignedat}

Intuitively, this result makes perfect sense. If we take a Unif(0,1) random variable, and multiply it by bโˆ’ab-a, we end up with a Unif(0, bโˆ’ab-a) random variable. Then, if we add aa to each end, we get a Unif(aa, bb) random variable.

Exponential Example

Consider the Exp(ฮป)\text{Exp}(\lambda) distribution, which has the following cdf:

F(x)=1โˆ’eโˆ’ฮปx,xโ‰ฅ0F(x) = 1 - e^{-\lambda x}, x \geq 0

Let's set F(X)=UF(X) = U and solve for XX:

U=1โˆ’eโˆ’ฮปXUโˆ’1=eโˆ’ฮปXlnโก(Uโˆ’1)=โˆ’ฮปXlnโก(Uโˆ’1)โˆ’ฮป=X\begin{alignedat}{1} U &= 1 - e^{-\lambda X} \\ U - 1 &= e^{-\lambda X} \\ \ln(U - 1) &= -\lambda X \\ \frac{\ln(U - 1)}{-\lambda} &= X \end{alignedat}

Also, we know that the expression Uโˆ’1U-1 is itself uniform, so we can simplify:

X=lnโก(U)โˆ’ฮปX = \frac{\ln(U)}{-\lambda}

Inverse Transform Method - Continuous Examples

In this lesson, we'll apply the inverse transform method to trickier continuous examples.

Weibull Example

Let's start with the Weibull distribution, which has the following cdf:

F(x)=1โˆ’e(โˆ’ฮปx)ฮฒF(x) = 1 - e ^{(-\lambda x)^\beta}

We can solve F(X)=UF(X) = U for XX:

U=1โˆ’e(โˆ’ฮปX)ฮฒ1โˆ’U=e(โˆ’ฮปX)ฮฒlnโก(1โˆ’U)=โˆ’ฮปXฮฒlnโก(1โˆ’U)1/ฮฒ=โˆ’ฮปXโˆ’1ฮปlnโก(1โˆ’U)1/ฮฒ=X\begin{alignedat}{1} U &= 1 - e ^{(-\lambda X)^\beta} \\ 1 - U &= e ^{(-\lambda X)^\beta} \\ \ln(1 - U) &= -\lambda X^\beta \\ \ln(1 - U)^{1/\beta} &= -\lambda X \\ \frac{-1}{\lambda}\ln(1 - U)^{1/\beta} &= X \\ \end{alignedat}

As we saw before, we know that the expression Uโˆ’1U-1 is itself uniform, so we can simplify:

X=โˆ’1ฮปlnโก(U)1/ฮฒX = \frac{-1}{\lambda}\ln(U)^{1/\beta}

Notice that, if ฮฒ=1\beta = 1, X=โˆ’lnโก(U)/ฮปโˆผExp(ฮป)X = -\ln(U) / \lambda \sim \text{Exp}(\lambda). We say that the Weibull distribution generalizes the exponential distribution.

Triangular Example

Now let's look at the triangular distribution. This distribution is important when we don't have a lot of data. For example, suppose we ask someone how long it takes to complete a particular job. Without much data, all they might be able to tell us is the minimum, maximum, and most common (mode) amounts of time. A triangular distribution - parameterized by these three values - is often a good first approximation.

Let's specifically consider the triangular(0,1,2) distribution, which has the following pdf:

f(x)={x0โ‰คx<12โˆ’x1โ‰คxโ‰ค2f(x) = \left\{ \begin{matrix} x & 0 \leq x < 1 \\ 2 - x & 1 \leq x \leq 2 \end{matrix} \right.

If we integrate, we get the following cdf:

F(x)={x2/20โ‰คx<11โˆ’(xโˆ’2)2/21โ‰คxโ‰ค2F(x) = \left\{ \begin{matrix} x^2 / 2 & 0 \leq x < 1 \\ 1 - (x-2)^2 / 2 & 1 \leq x \leq 2 \end{matrix} \right.

Since the cdf contains distinct expressions for different ranges of xx, we need to transform each expression independently. How do we know which uniforms belong to which expressions? Well, the cdf evaluates the top expression for the first half of the possible xx-values and the bottom expression for the second half, so we can similarly divide the unit interval in half.

Specifically, if U<1/2U < 1/2, we solve X2/2=UX^2/2 = U to get X=2UX = \sqrt{2U}. Note that the square root technically contains both negative and positive roots, but we only consider the positive root since we know that XX must be between zero and one.

If Uโ‰ฅ1/2U \geq 1/2, we solve the second expression for XX:

U=1โˆ’(Xโˆ’2)2/21โˆ’U=(Xโˆ’2)2/22lnโก(1โˆ’U)=(Xโˆ’2)22lnโก(1โˆ’U)=Xโˆ’22โˆ’2lnโก(1โˆ’U)=X\begin{alignedat}{1} U &= 1 - (X-2)^2 / 2 \\ 1 - U &= (X-2)^2 / 2 \\ 2\ln(1 - U) &= (X-2)^2 \\ \sqrt{2\ln(1 - U)} &= X-2 \\ 2 - \sqrt{2\ln(1 - U)} &= X \\ \end{alignedat}

Technically, we could have a ยฑ\pm in front of the square root, but it doesn't make sense to consider the positive root since XX must be between one and two. In this case, we only consider the negative root to keep XX within bounds.

Be aware that we cannot replace 1โˆ’U1-U with UU in this example. We need 1โˆ’U1-U to keep XX bounded appropriately. For example, suppose we made the replacement and drew U=1U = 1. Then, X=2โˆ’2<1X = 2 - \sqrt{2} < 1, which doesn't make sense since we know that XX must be between one and two.

In any event, let's look at two simple examples. If U=0.4U = 0.4, we evaluate X=2U=0.8X = \sqrt{2U} = \sqrt{0.8}. Alternatively, if U=0.5U = 0.5, then X=2โˆ’2(1โˆ’0.5)=2โˆ’1=1X = 2 - \sqrt{2(1-0.5)} = 2 - 1 = 1.

Inverse Transform Method - Continuous Examples - DEMO 1

In this demo, we'll look at some Matlab code that applies the inverse transform method to simulate a triangular(0,1,2) random variable. Here's the code.

m = 100;
X = [];
for i=1:m
  U = rand;
  if U < 0.5
    X(i) = sqrt(2 * U);
  else
    X(i) = 2 - sqrt(2 * (1-U));
  end
  i = i+1;
end
histogram(X)

First, we initialize an empty array, X, and then we iterate one hundred times. On each iteration, we sample a uniform random variable with the rand function, perform the appropriate transform depending on whether U < 0.5, and then store the result in X. After we finish iterating, we plot a histogram of the elements in X.

Let's take a look at the histogram we created, which slightly resembles a triangle.

Now, let's iterate one hundred thousand times and look at the resulting histogram. This result looks much nicer.

Inverse Transform Method - Continuous Examples - DEMO 2

Now, let's talk about generating standard normal random variates. Remember our general formula: to compute Fโˆ’1(U)F^{-1}(U), we first set F(X)=UF(X) = U and then solve for XX.

Unfortunately, the inverse cdf of the normal distribution, ฮฆโˆ’1(โ‹…)\Phi^{-1}(\cdot), does not have an analytical form, which means we can't use our normal approach. This obstacle frequently occurs when working with the inverse transform method.

Generating Standard Normal Random Variables

One easy solution to get around this issue is to look up the values in a standard normal table. For instance, if U=0.975U = 0.975, then the value of ZZ, for which ฮฆ(Z)=0.975\Phi(Z) = 0.975, is 1.961.96. If we want to compute ฮฆโˆ’1(0.975)\Phi^{-1}(0.975) in Excel, we can execute this function: NORMSINV(0.975).

We can also use the following crude approximation for ZZ, which gives at least one decimal place of accuracy for 0.00134โ‰คUโ‰ค0.988650.00134 \leq U \leq 0.98865:

Z=ฮฆโˆ’1(U)โ‰ˆU0.135โˆ’(1โˆ’U)0.1350.1975Z = \Phi^{-1}(U) \approx \frac{U^{0.135} - (1-U)^{0.135}}{0.1975}

Here's a more accurate, albeit significantly more complicated, approximation, which has an absolute error โ‰ค0.45ร—10โˆ’3\leq 0.45 \times 10^{-3}:

Z=sign(Uโˆ’1/2)(tโˆ’c0+c1t+c2t21+d1t+d2t2+d3t3)wheresign(x)={โˆ’1x<00x=01x>0,t=โˆ’lnโก[minโก(U,1โˆ’U)]2,c0=2.515517,c1=0.802853,0.010328d1=1.432788,d2=0.189269d3=0.001308Z = \text{sign}(U - 1/2)\left(t- \frac{c_0 + c_1t + c_2t^2}{1 + d_1t + d_2t^2 + d_3t^3}\right) \\[2ex] \text{where} \\[2ex] \text{sign}(x) = \left\{ \begin{matrix} -1 & x < 0 \\ 0 & x = 0 \\ 1 & x > 0 \end{matrix} \right., \\[3ex] t = \sqrt{-\ln[\min(U, 1-U)]^2}, \\[3ex] c_0 = 2.515517, \quad c_1 = 0.802853, \quad 0.010328 \\ d_1 = 1.432788, \quad d_2 = 0.189269 \quad d_3 = 0.001308

Transforming Standard Normals to Other Normals

Now, suppose we have ZโˆผNor(0,1)Z \sim \text{Nor}(0,1), and we want XโˆผNor(ฮผ,ฯƒ2)X \sim \text{Nor}(\mu, \sigma^2). We can apply the following transformation:

Xโ†ฮผ+ฯƒZX \leftarrow \mu + \sigma Z

Note that we multiply ZZ by ฯƒ\sigma, not ฯƒ2\sigma^2!

Let's look at an example. Suppose we want to generate XโˆผNor(3,16)X \sim \text{Nor}(3,16), and we start with U=0.59U = 0.59. We know that:

X=ฮผ+ฯƒZX = \mu + \sigma Z

Remember that Z=ฮฆโˆ’1(U)Z = \Phi^{-1}(U):

X=ฮผ+ฯƒฮฆโˆ’1(U)=ฮผ+ฯƒฮฆโˆ’1(0.59)X = \mu + \sigma \Phi^{-1}(U) = \mu + \sigma \Phi^{-1}(0.59)

If we compute ฮฆโˆ’1(0.59)\Phi^{-1}(0.59), using one of the methods above, we get ฮฆโˆ’1(0.59)=0.2275\Phi^{-1}(0.59) = 0.2275. Now we can plug and chug:

X=3+4(0.2275)=3.91X = 3 + 4(0.2275) = 3.91

Matlab Demonstration

Let's look at some Matlab code to generate standard normal random variables from Unif(0,1) observations. Consider the following:

m = 100;
x = [];
for i=1:m
  X(i) = norminv(rand, 0, 1);
  i = i + 1;
end
histogram(X)

First, we initialize an empty array, X, and then we iterate one hundred times. On each iteration, we first sample a Unif(0,1) random variable with the rand function. Next, we transform it into a Nor(0,1) random variable using the norminv function, which receives a real number, a mean, and a standard deviation and returns the transformed value. Finally, we store the result in X. After we finish iterating, we plot a histogram of the elements in X.

Let's look at the resulting histogram, which doesn't look like the famous bell curve since we are only drawing one hundred observations.

Let's now sample one hundred thousand uniforms and plot the corresponding histogram. This result looks much prettier.

Inverse Transform Method - Discrete Examples

In this lesson, we will use the inverse transform method to generate discrete random variables.

Discrete Examples

It's often best to construct a table for discrete distributions when looking to apply the inverse transform method, especially if the distribution doesn't take on too many values.

Bernoulli Example

Let's look at the Bern(pp) distribution. XโˆผBern(p)X \sim \text{Bern}(p) takes on the value of one with probability pp and the value zero with probability 1โˆ’p1 - p.

The cdf, F(x)F(x), takes on the value 1โˆ’p1-p for x=0x = 0 since P(Xโ‰ค0)=P(0)=1โˆ’pP(X \leq 0) = P(0) = 1 - p. For x=1x = 1, F(x)=1F(x) = 1, since P(Xโ‰ค1)=P(X=0)+P(X=1)=1โˆ’p+p=1P(X \leq 1) = P(X = 0) + P(X = 1) = 1 - p + p = 1.

Let's construct a table to see how we might map the unit interval onto these two values of XX:

xP(X=x)F(x)U(0,1)โ€™s01โˆ’p1โˆ’p[0,1โˆ’p]1p1(1โˆ’p,1]\begin{array}{cccc} x & P(X = x) & F(x) & \mathcal{U}(0,1)\text{'s} \\ \hline 0 & 1 - p & 1 - p & [0, 1 - p] \\ 1 & p & 1 & (1 - p, 1] \end{array}

Since P(X=0)P(X = 0) occurs with probability 1โˆ’p1-p, we transform all uniforms on the range [0,1โˆ’p][0, 1-p] to X=0X = 0. This leaves a range of uniforms of width pp, which map to X=1X = 1, from which we draw values with probability pp.

Here's a simple implementation. If we draw a uniform Uโ‰ค1โˆ’pU \leq 1 - p, then we take X=0X = 0; otherwise, we take X=1X = 1. For example, if p=0.75p = 0.75, and Uโˆ’0.13U - 0.13, we take X=0X = 0 since Uโ‰ค(1โˆ’p=0.25)U \leq (1 - p = 0.25).

Alternatively, we can construct the following "backwards" table, starting with X=1X = 1 instead of X=0X = 0, which isn't a strict application of the inverse transform method. However, this approach slices the uniforms in a much more intuitive manner. Here's the table:

xP(X=x)F(x)U(0,1)โ€™s1p1[0,p]01โˆ’p1โˆ’p(p,1]\begin{array}{cccc} x & P(X = x) & F(x) & \mathcal{U}(0,1)\text{'s} \\ \hline 1 & p & 1 & [0, p] \\ 0 & 1 - p & 1 - p & (p, 1] \end{array}

Here, we take X=1X = 1 if Uโ‰คpU \leq p, which occurs with probability pp, and we take X=0X = 0 if U>pU > p, which occurs with probability 1โˆ’p1 - p.

Generic Example

Consider this slightly less-trivial pmf:

xP(X=x)F(x)U(0,1)โ€™sโˆ’10.60.6[0.0,0.6]2.50.30.9(0.6,0.9]40.11.0(0.9,1.0]\begin{array}{cccc} x & P(X = x) & F(x) & \mathcal{U}(0,1)\text{'s} \\ \hline -1 & 0.6 & 0.6 & [0.0, 0.6] \\ 2.5 & 0.3 & 0.9 & (0.6, 0.9] \\ 4 & 0.1 & 1.0 & (0.9, 1.0] \end{array}

Remember that, for a discrete distribution, F(xi)=P(X=xi)+F(xiโˆ’1)F(x_i) = P(X = x_i) + F(x_{i-1}), where F(x0)=0F(x_0) = 0. Correspondingly, the range of uniforms that maps to xix_i is bounded by [F(xiโˆ’1),F(xi)][F(x_{i-1}), F(x_i)]. For example, if U=0.63U = 0.63, we take X=2.5X = 2.5.

Discrete Uniform Example

Sometimes, there's an easy way to avoid constructing a table. For example, consider the discrete uniform distribution over {1,2,...,n}\{1,2,...,n\}, which has the following pmf:

P(X=k)=1n,k=1,2,...,nP(X = k) = \frac{1}{n}, \quad k = 1,2,...,n

We can think of this random variable as an nn-sided die toss. How do we transform a Unif(0,1) correctly? If we take a real number between zero and one and multiply it by nn, we get a real number between zero and nn. If we round up, we get an integer between one and nn. For example, if n=10n = 10 and U=0.376U = 0.376, then X=โŒˆ3.76โŒ‰=4X = \lceil 3.76 \rceil = 4, where โŒˆโ‹…โŒ‰\lceil\cdot\rceil is the ceiling/"round up" function.

Geometric Example

Now let's look at the geometric distribution. Intuitively, this distribution represents the number of Bern(pp) trials until the first success. Therefore, the pmf, f(k)f(k), represents the probability that the first success occurs on the kk-th try: kโˆ’1k - 1 failures, with probability qkโˆ’1q^{k-1}, followed by a single success, with probability pp.

Accordingly, this distribution has the following pmf and cdf:

f(k)=qkโˆ’1p,F(k)=1โˆ’qk,k=1,2,...,f(k) = q^{k-1}p, \quad F(k) = 1 - q^k, k = 1,2,...,

Let's set F(X)=UF(X) = U and solve for XX, finding the minimum kk such that 1โˆ’qkโ‰ฅU1 - q^k \geq U. After some algebra, we see that:

X=minโก[k:1โˆ’qkโ‰ฅU]=โŒˆlnโก(1โˆ’U)lnโก(1โˆ’p)โŒ‰X = \min[k : 1 - q^k \geq U] = \left\lceil\frac{\ln(1-U)}{\ln(1-p)}\right\rceil

As we've seen in the past, we can also replace 1โˆ’U1-U with UU:

X=minโก[k:1โˆ’qkโ‰ฅU]=โŒˆlnโก(U)lnโก(1โˆ’p)โŒ‰X = \min[k : 1 - q^k \geq U] = \left\lceil\frac{\ln(U)}{\ln(1-p)}\right\rceil

For instance, if p=0.3p = 0.3, and U=0.72U = 0.72, we obtain:

X=โŒˆlnโก(1โˆ’U)lnโก(1โˆ’p)โŒ‰=โŒˆlnโก(0.28)lnโก(0.7)โŒ‰=4X = \left\lceil\frac{\ln(1- U)}{\ln(1-p)}\right\rceil = \left\lceil\frac{\ln(0.28)}{\ln(0.7)}\right\rceil = 4

We can also generate a Geom(pp) random variable by counting Bern(pp) trials until we see a success. Suppose we want to generate XโˆผGeom(1/6)X \sim \text{Geom}(1/6). We can generate Bern(1/61/6) random variables until we see a one. For instance, if we generate 0,0,0,0,10,0,0,0,1, then X=5X = 5.

More generally, we generate a Geom(pp) random variable by counting the number of trials until Uiโ‰คpU_i \leq p, which occurs with probability pp for each UiU_i. For example, if p=0.3p=0.3, and we draw U1=0.71U_1 = 0.71, U2=0.96U_2 = 0.96, and U3=0.12U_3 = 0.12, then X=3X=3.

Poisson Example

If we have a discrete distribution, like Pois(ฮป\lambda), which has an infinite number of values, we could write out table entries until the cdf is nearly one, generate a uniform, UU, and then search the table to find X=xi=Fโˆ’1(U)X = x_i = F^{-1}(U), such that Uโˆˆ(F(xiโˆ’1),F(xi)]U \in (F(x_{i-1}), F(x_i)]. Consider the following table:

xP(X=x)F(x)U(0,1)โ€™sx1f(x1)F(x1)[0.0,F(x1)]x2f(x2)F(x2)(F(x1),F(x2)]x3f(x3)F(x3)(F(x2),F(x3)]โ‹ฎ\begin{array}{cccc} x & P(X = x) & F(x) & \mathcal{U}(0,1)\text{'s} \\ \hline x_1 & f(x_1) & F(x_1) & [0.0, F(x_1)] \\ x_2 & f(x_2) & F(x_2) & (F(x_1), F(x_2)] \\ x_3 & f(x_3) & F(x_3) & (F(x_2), F(x_3)] \\ \vdots \end{array}

For example, if we generate a uniform in the range (F(x1),F(x2)](F(x_1), F(x_2)], we select X=x2X = x_2.

Let's look at a concrete example, in which we will generate XโˆผPois(2)X \sim \text{Pois}(2), which has the following pmf:

f(x)=eโˆ’22xx!,x=0,1,2,...f(x) = \frac{e^{-2}2^x}{x!}, \quad x = 0,1,2,...

We can construct the following table:

xP(X=x)F(x)U(0,1)โ€™s00.13530.1353[0.0,0.1353]10.27060.4059(0.1353,0.4059]20.27060.6765(0.4059,0.6765]โ‹ฎ\begin{array}{cccc} x & P(X = x) & F(x) & \mathcal{U}(0,1)\text{'s} \\ \hline 0 & 0.1353 & 0.1353 & [0.0, 0.1353] \\ 1 & 0.2706 & 0.4059 & (0.1353, 0.4059] \\ 2 & 0.2706 & 0.6765 & (0.4059, 0.6765] \\ \vdots \end{array}

For example, if U=0.313U = 0.313, then X=1X = 1.

Inverse Transform Method - Empirical Distributions (OPTIONAL)

In this lesson, we will look at how to apply the inverse transform method to unknown, empirical, continuous distributions. Here, we don't know F(x)F(x), so we have to approximate it. We will blend continuous and discrete strategies to transform these random variables.

Continuous Empirical Distributions

If we can't find a good theoretical distribution to model a certain random variable, or we don't have enough data, we can use the empirical cdf of the data, X1,X2,...,XnX_1, X_2,...,X_n:

F^n(x)โ‰กnumberย ofย Xiโ€™sโ‰คxn\hat F_n(x) \equiv \frac{\text{number of } X_i\text{'s} \leq x}{n}

Each XiX_i has a probability 1/n1/n of being drawn from our unknown continuous distribution. As a result, we can construct a cdf, F^n\hat F_n, where F^n(x)\hat F_n(x) is equal to the number of observations less than or equal to xx, divided by the total number of observations.

Note that F^n(x)\hat F_n(x) is a step function, which jumps by 1/n1/n at every XiX_i.

Even though XX is continuous, the Glivenko-Cantelli Lemma states that F^n(x)โ†’F(x)\hat F_n(x) \to F(x) for all xx as xโ†’โˆžx \to \infty. As a result, F^n(x)\hat F_n(x) is a good approximation to the true cdf, F(x)F(x).

We can use the ARENA functions DISC and CONT to generate random variables from the empirical cdf's of discrete and random distributions, respectively.

To generate these random variables ourselves, let's first define the ordered points X(1)โ‰คX(2)โ‰คX(n)X_{(1)} \leq X_{(2)} \leq X_{(n)}, also called order statistics. Note that we wrap the subscripts in parentheses to differentiate them from the order in which they arrived. For example, if X1=4X_1 = 4, X2=1X_2 = 1, and X3=6X_3 = 6, then X(1)=1X_{(1)} = 1, X(2)=4X_{(2)} = 4, and X(3)=6X_{(3)} = 6.

Here's what the empirical cdf looks like, shown in red below:

Notice that we step from zero to 1/31/3 at X=1X=1, from 1/31/3 to 2/32/3 at X=4X=4, and 2/32/3 to 11 at X=6X=6. If we were to take many observations, then the empirical cdf would approach the true cdf, shown above in green.

Linear Interpolation

Given that we only have a finite number, nn, of data points, we can attempt to transform the empirical cdf into a continuous function by using linear interpolation between the X(i)X_{(i)}'s. Here is the corresponding interpolated cdf (which is not the true cdf even though we denote it as F(x)F(x)):

F(x)={0ifย x<X(1)iโˆ’1nโˆ’1+xโˆ’X(i)(nโˆ’1)(X(n+1)โˆ’X(i))ifย X(i)โ‰คxโ‰คX(i+1),โˆ€i1ifย xโ‰ฅX(n)F(x) = \left\{ \begin{matrix} 0 & \text{if } x < X_{(1)} \\ \frac{i - 1}{n - 1} + \frac{x - X_{(i)}}{(n-1)(X_{(n+1)} - X_{(i)})} & \text{if } X_{(i)} \leq x \leq X_{(i+1)}, \forall i \\ 1 & \text{if } x \geq X_{(n)} \end{matrix} \right.

If xx is less than the first order statistic, then F(x)=0F(x) = 0. Given our current observations, we have no evidence that we will encounter a value smaller than our current minimum. Similarly, if xx is greater than our last order statistic, then F(x)=1F(x) = 1. Again, we have no evidence that we will encounter a value greater than our current maximum. For all other points, we interpolate.

Here's the interpolation algorithm. First, we set F(X)=UโˆผU(0,1)F(X) = U \sim \mathcal U(0,1). Next, we set P=(nโˆ’1)UP = (n-1)U and I=โŒˆPโŒ‰I = \lceil P \rceil. Note that II is a discrete uniform random variable over {1,2,...,n}\{1,2,...,n\}. Finally, we evaluate the following equation for XX:

X=X(I)+(Pโˆ’I+1)(X(I+1)โˆ’X(I))X = X_{(I)} + (P - I + 1)(X_{(I+1)} - X_{(I)})

Let's break down the above equation. X(I)X_{(I)} corresponds to our random starting order statistic. The expression Pโˆ’I+1P - I + 1 turns out to be Unif(0,1): we subtract an integer from the uniform from which it was rounded up and then add one. We multiply that uniform random value by X(I+1)โˆ’X(I)X_{(I+1)} - X_{(I)} to move the appropriate distance between our starting point and the next order statistic.

For example, suppose X(1)=1X_{(1)} = 1, X(2)=4X_{(2)} = 4 and X(3)=6X_{(3)} = 6. If U=0.73U = 0.73, then P=(nโˆ’1)U=1.46P = (n-1)U = 1.46 and I=โŒˆPโŒ‰=2I = \lceil P \rceil = 2. Then:

X=X(I)+(Pโˆ’I+1)(X(I+1)โˆ’X(I))=X(2)+(1.46โˆ’2+1)(X(3)โˆ’X(2))=4+(0.46)(6โˆ’4)=4.92\begin{alignedat}{1} X &= X_{(I)} + (P - I + 1)(X_{(I+1)} - X_{(I)}) \\ &= X_{(2)} + (1.46 - 2 + 1)(X_{(3)} - X_{(2)}) \\ &= 4 + (0.46)(6 - 4) \\ &= 4.92 \end{alignedat}

Let's see what the linearly interpolated cdf looks like, shown below in orange. Notice that, between X=1X = 1 and X=6X=6, we have two line segments: one from X=1X = 1 to X=4X=4, and one from X=4X=4 to X=6X=6.

Suppose we draw a uniform, U=0.73U = 0.73.

If we drop a line from the intersection point with the cdf vertically down, the xx-coordinate of the intersection with the xx-axis is Fโˆ’1(U)F^{-1}(U).

Alternatively, we can write the interpolated cdf explicitly by using the equation we saw previously and enumerating the two [X(i),X(i+1)][X_{(i)}, X_{(i+1)}] ranges:

F(x)={0+xโˆ’12(4โˆ’1)ifย 1โ‰คx<4(iย =ย 1ย case)12+xโˆ’42(6โˆ’4)ifย 4โ‰คx<6(iย =ย 2ย case)F(x) = \left\{ \begin{matrix} 0 + \frac{x-1}{2(4-1)} & \text{if } 1 \leq x < 4 \quad \text{(i = 1 case)} \\[2ex] \frac{1}{2} + \frac{x-4}{2(6-4)} & \text{if } 4 \leq x < 6 \quad \text{(i = 2 case)} \\ \end{matrix} \right.

If we set F(X)=UF(X) = U and solve for both cases, we have:

X={1+6Uifย U<1/22+4Uifย Uโ‰ฅ1/2X = \left\{\begin{matrix} 1 + 6U & \text{if } U < 1/2 \\[2ex] 2 + 4U & \text{if } U \geq 1/2 \\ \end{matrix} \right.

Again, if U=0.73U = 0.73, then X=2+4(0.73)=4.92X = 2 + 4(0.73) = 4.92.

Convolution Method

In this lesson, we'll look at an alternative to the inverse transform method: the convolution method.

Binomial Example

The term convolution refers to adding things up. For example, if we add up nn iid Bern(pp) random variables, we get a Binomial(nn, pp) random variable:

Y=โˆ‘i=1n(XiโˆผBern(p))โˆผBin(n,p)Y = \sum_{i=1}^n (X_i \sim \text{Bern}(p)) \sim \text{Bin}(n,p)

We already know how to generate Bern(pp) random variables using the inverse transform method. Suppose we have a collection of uniforms, U1,...,UnโˆผiidU(0,1)U_1,...,U_n \overset{\text{iid}}{\sim} \mathcal{U}(0,1). If Uiโ‰คpU_i \leq p, we set Xi=1X_i = 1; otherwise, Xi=0X_i = 0. If we repeat this process for all XiX_i and then add up the XiX_i's, we get YY.

For instance, suppose we want to generate YโˆผBin(3,0.4)Y \sim \text{Bin}(3, 0.4). Let's draw three uniforms: U1=0.63U_1 = 0.63, U2=0.17U_2 = 0.17, and U3=0.81U_3 = 0.81. We transform these uniforms into the respective Bern(pp) random variables: X1=0X_1 = 0, X2=1X_2 = 1, and X3=0X_3 = 0. If we add these values together, we get Y=1Y = 1.

Triangular Example

Now let's turn to the Tria(0,1,2) distribution. Remember, we generated triangular random variables using the inverse transform method previously. Alternatively, we can add two iid uniforms together to generate a Tria(0,1,2). In other words, if U1,U2โˆผiidU(0,1)U_1, U_2 \overset{\text{iid}}{\sim} \mathcal U(0,1), then U1+U2โˆผTria(0,1,2)U_1 + U_2 \sim \text{Tria}(0,1,2).

Let's look at a pictorial representation of this result.

In the Tria(0,1,2) distribution, values close to either end of the range are quite rare. Indeed, it is highly unlikely that two iid uniforms will both have a value very close to zero or one. Values closer to 0.5 are much more likely, and the resulting distribution, which has a maximum probability at X=1X = 1, reflects this fact.

This result should remind us of dice toss outcomes. Suppose we toss two six-sided dice. It's highly unlikely for us to see a two or a twelve, each of which occurs with probability 1/361/36, but it's much more common to see a seven, which occurs with probability 1/61/6. There are just more ways to roll a seven.

Erlang Example

Suppose X1,...,XnโˆผiidExp(ฮป)X_1,..., X_n \overset{\text{iid}}{\sim} \text{Exp}(\lambda). By definition, the sum of these exponential random variables forms an Erlangn_n(ฮป\lambda) random variable:

Y=โˆ‘i=1nXiโˆผErlangn(ฮป)Y = \sum_{i=1}^n X_i \sim \text{Erlang}_n(\lambda)

Let's use the inverse transform method with convolution to express YY in terms of uniform random variables. First, let's remember how we transform uniforms into exponential random variables:

F(x)=1โˆ’eโˆ’ฮปxU=1โˆ’eโˆ’ฮปXโ‹ฎX=โˆ’1ฮปlnโก(U)\begin{alignedat}{1} F(x) &= 1 - e^{-\lambda x} \\ U &= 1 - e^{-\lambda X} \\ \vdots \\ X &= \frac{-1}{\lambda}\ln(U) \end{alignedat}

Let's rewrite our summation expression to reflect this transformation:

Y=โˆ‘i=1nXi=โˆ’1ฮปโˆ‘i=1nlnโก(Ui)Y = \sum_{i=1}^n X_i = \frac{-1}{\lambda}\sum_{i=1}^n \ln(U_i)

Now, we can take advantage of the fact that lnโก(a)+lnโก(b)=lnโก(ab)\ln(a) + \ln(b) = \ln(ab) and replace nn natural log invocations with just one. Why might we do this? Natural logarithms are expensive to evaluate on a computer, so we want to reduce them where possible. Consider the following manipulation:

Y=โˆ‘i=1n[โˆ’1ฮปlnโก(Ui)]=โˆ’1ฮปlnโก(โˆi=1nUi)Y = \sum_{i=1}^n \left[\frac{-1}{\lambda}\ln(U_i)\right] = \frac{-1}{\lambda}\ln\left(\prod_{i=1}^n U_i\right)

Desert Island Nor(0,1) Approximate Generator

Suppose we have the random variables U1,...,UnโˆผiidU(0,1)U_1,...,U_n \overset{\text{iid}}{\sim} \mathcal U(0,1), and we let YY equal the summation of the UiU_i's: Y=โˆ‘i=1nUiY = \sum_{i=1}^n U_i. The central limit theorem tells us that, for large enough nn, YY is approximately normal. Furthermore, we know that the mean of YY is nE[Ui]nE[U_i] and the variance of YY is nVar(Ui)n\text{Var}(U_i). As a result, Yโ‰ˆNor(n/2,n/12)Y \approx \text{Nor}(n/2, n/12).

Let's choose n=12n=12 and assume that it's "large enough". Then Yโ‰ˆNor(6,1)Y \approx \text{Nor}(6, 1). If we subtract 66 from YY, then the resulting mean is 00, and we end up with a standard normal random variable:

Yโˆ’6=โˆ‘i=112Uiโˆ’6โ‰ˆNor(0,1)Y - 6 = \sum_{i=1}^{12} U_i - 6 \approx \text{Nor}(0,1)

If X1,...,XnX_1,...,X_n are iid Geom(pp) random variables, then the sum of the XiX_i's is a negative binomial random variable:

โˆ‘i=1nXiโˆผNegBin(n,p)\sum_{i=1}^{n} X_i \sim \text{NegBin}(n,p)

If Z1,...,ZnZ_1,..., Z_n are iid Nor(0, 1) random variables, then sum of the squares of the ZiZ_i's is a ฯ‡2\chi^2 random variable:

โˆ‘i=1nZi2โˆผฯ‡2(n)\sum_{i=1}^{n} Z_i^2 \sim \chi^2(n)

If Xi,...,XnX_i,...,X_n are iid Cauchy random variables, then the sample mean, Xห‰\bar X, is also a Cauchy random variable. We might think that Xห‰\bar X is normal for large nn, but this is not the case as the Cauchy distribution violates the central limit theorem.

Convolution Method DEMO

In this demo, we will look at some of the random variables we can generate using convolution. First, we'll look at how we generate a uniform distribution, which, admittedly, involves no convolution. Consider the following code:

n = 1;
m = 1000000;
Z = 0 * unifrnd(0,1,[1 m]));

for i=1:n
  Z=Z+(unifrnd(0,1,[1 m]));
  i=i+1;
end
Z=Z/n;
hist(Z,100);

This code generates the following histogram, which indeed appears uniform.

If we change n from one to two, then Z becomes the sum of two uniforms. This code generates the following histogram, which appears triangular, as expected.

Let's change n to three and repeat. The following histogram appears to be taking on a normal shape.

Let's change n to 12 to emulate the "desert island" normal random variable generator we discussed previously. This histogram resembles the normal distribution even more closely.

Now, let's look at generating Exp(1) random variables. Consider the following code:

n = 1;
m = 1000000;
Z = 0 * unifrnd(0,1,[1 m]));

for i=1:n
  Z=Z-log(unifrnd(0,1,[1 m])); % This is the only line that changes
  i=i+1;
end
Z=Z/n;
hist(Z,100);

Here's the corresponding histogram.

Now, if we change n from one to two, then Z becomes the sum of two exponential random variables, itself an Erlang random variable. Here's the corresponding histogram.

Let's add n=30 exponential random variables, which the central limit theorem tells us will produce an approximately normal result. Here's the histogram. This histogram is clearly not the normal distribution, as it contains considerable right skew.

For a much better approximation, we likely need to add several hundred observations together. Here's the histogram for n=200.

Finally, let's look at generating Cauchy random variables. Consider the following code:

n = 1;
m = 1000;
Z = 0 * unifrnd(0,1,[1 m]));

for i=1:n
  Z=Z + normrnd(0.0, 1.0, [1 m]) ./normrnd(0.0, 1.0, [1 m]) % This is the only line that changes
  i=i+1;
end
Z=Z/n;
hist(Z,100);

If we generate m = 1000 of these random variables, we get the following histogram. The Cauchy distribution has a large variance so, even though most of the observations are close to zero, there are a few observations far out on either side.

If we add n = 1000 Cauchy random variables, we might expect to see a histogram approximating the normal distribution. However, as we can see from the following plot, the Cauchy distribution violates the central limit theorem.

Acceptance-Rejection Method

In this lesson, we'll talk about the acceptance-rejection method, which is an alternative to the inverse transform method and the convolution technique we studied previously. In contrast to those two methods, acceptance-rejection is very general, and it always works regardless of the random variate we want to generate.

Acceptance-Rejection Method

The motivation for the acceptance-rejection method is that most cdf's cannot be inverted efficiently, which means that the inverse transform method has limited applicability.

The acceptance-rejection method samples from an "easier" distribution, which is close to but not identical to the one we want. Then, it adjusts for this discrepancy by "accepting" (keeping) only a certain proportion of the variates it generates from the approximating distribution.

The acceptance proportion is based on how close the approximate distribution is to the one we want. If the sampling distribution resembles the desired distribution closely, we keep a high proportion of the samples.

Let's look at a simple example. Suppose we want to generate XโˆผU(2/3,1)X \sim \mathcal U(2/3, 1). Of course, we'd usually use the inverse transform method for something so trivial, but let's see how we'd generate XX using acceptance-rejection.

Here's the algorithm. First, we generate UโˆผU(0,1)U \sim \mathcal U(0, 1). Next, we check to see if Uโ‰ฅ2/3U \geq 2/3. If so, we accept UU and set X=UX=U, because the conditional distribution of XX given that Uโ‰ฅ2/3U \geq 2/3 is U(2/3,1)\mathcal U (2/3, 1). If U<2/3U < 2/3, we reject the observation, and we return to step one, where we draw another uniform. Eventually, we'll get Uโ‰ฅ2/3U \geq 2/3, at which point we stop.

Notation

Suppose we want to simulate a continuous random variable, XX, with pdf f(x)f(x), but XX is difficult to generate directly. Also suppose that we can easily generate a random variable that has a distinct pdf, h(x)โ‰กt(x)/ch(x) \equiv t(x)/c, where cc is a constant and t(x)t(x) majorizes f(x).

When we say that t(x)t(x) majorizes f(x)f(x), we mean that t(x)โ‰ฅf(x)t(x) \geq f(x), for all xx.

Note that only h(x)h(x), not t(x)t(x), is a pdf. Since t(x)โ‰ฅf(x)t(x) \geq f(x), and the integral of f(x)f(x) over the real line equals one - by virtue of f(x)f(x) being a pdf - then the corresponding integral of t(x)t(x) must be greater than or equal to one. Since t(x)t(x) doesn't strictly integrate to one, it cannot be a pdf.

Let's look at the constant, cc. We define cc, which we assume to be finite, as the integral of t(x)t(x) over the real line, which we just said was greater than or equal to the corresponding integral of f(x)f(x):

cโ‰กโˆซRt(x)dxโ‰ฅโˆซRf(x)dx=1,c<โˆžc \equiv \int_\mathbb R t(x)dx \geq \int_\mathbb R f(x)dx = 1, \quad c < \infty

Theorem

Let's define a function g(x)โ‰กf(x)/t(x)g(x) \equiv f(x)/t(x). We can notice that, since f(x)โ‰คt(x)f(x) \leq t(x), 0โ‰คg(x)โ‰ค10 \leq g(x) \leq 1 for all xx. Additionally, let's sample UโˆผU(0,1)U \sim \mathcal U(0,1), and let YY be a random variable (independent of UU), with the pdf we defined earlier: h(y)โ‰กt(y)/ch(y) \equiv t(y) / c.

If Uโ‰คg(Y)U \leq g(Y), then the conditional distribution of YY has pdf f(y)f(y). Remember, the random variable we are trying to generate has pdf f(y)f(y). In other words, we accept that YY has the pdf of the random variable we want when Uโ‰คg(Y)U \leq g(Y).

This result suggests the following acceptance-rejection algorithm for generating a random variable XX with pdf f(x)f(x). First, we generate UโˆผU(0,1)U \sim \mathcal U(0,1) and YY with pdf h(y)h(y), independent of UU. If Uโ‰คg(Y)U \leq g(Y), we accept YY and we return X=YX = Y; otherwise, we reject YY and start over, continuing until we produce an acceptable (U,Y)(U, Y) pair. We refer to the event Uโ‰คg(Y)U \leq g(Y) as the acceptance event, which always happens eventually.

Visual Walkthrough

Consider the following pdf, outlined in green, which is the pdf of the random variable we want to generate.

Now consider t(x)t(x), shown in red below, which majorizes f(x)f(x). As we can see: t(x)โ‰ฅf(x)t(x) \geq f(x), for all xx.

Finally, let's consider h(x)h(x), which is defined as t(x)/ct(x) / c, where cc is the area under t(x)t(x). As a result, h(x)h(x) is proportional to t(x)t(x), but has an area equal to one, which makes it a valid pdf.

Let's generate a point, YY, uniformly under t(x)t(x). We accept that point with probability g(Y)=f(Y)/t(Y)=f(Y)/ch(Y)g(Y) = f(Y)/t(Y) = f(Y)/ch(Y). For the point below, we can see the corresponding values of t(Y)t(Y) and f(Y)f(Y). As the distance between t(Y)t(Y) and f(Y)f(Y) decreases, the probability of accepting YY as a valid value of XX increases.

Proof of Acceptance-Rejection Method (OPTIONAL)

In this lesson, we will prove the acceptance-rejection method.

A-R Method Recap

Let's define g(x)โ‰กf(x)/t(x)g(x) \equiv f(x)/t(x). Remember that f(x)f(x) is the pdf of the random variable, XX, that we wish to generate, and t(x)t(x) majorizes f(x)f(x). Let UโˆผU(0,1)U \sim \mathcal U(0,1), and let YY be a random variable, independent of UU, with pdf h(y)=t(y)/ch(y) = t(y)/c.

Note that t(y)t(y) is not a pdf, but we can transform it into a pdf by dividing it by cc - the area under t(y)t(y) - which guarantees that it integrates to one.

If Uโ‰คg(Y)U \leq g(Y), we set X=YX=Y because the conditional distribution of YY, given Uโ‰คg(Y)U \leq g(Y), is f(x)f(x). Again, we refer to Uโ‰คg(Y)U \leq g(Y) as the acceptance event. We continue sampling UU and YY until we have an acceptance, at which point we set X=YX=Y and stop.

Proof

We need to prove that the value that comes out of this algorithm, which we claim has pdf f(x)f(x), indeed has pdf f(x)f(x).

Let AA be the acceptance event. The cdf of XX is, by definition:

P(Xโ‰คx)P(X \leq x)

Given that we have experienced AA, we have set Y=XY = X. As a result, we can express the cdf of XX with a conditional probability concerning YY and AA:

P(Xโ‰คx)=P(Yโ‰คxโˆฃA)P(X \leq x) = P(Y \leq x | A)

We can expand the conditional probability:

P(Xโ‰คx)=P(Yโ‰คxโˆฃA)=P(A,Yโ‰คX)P(A)(1)P(X \leq x) = P(Y \leq x | A) = \frac{P(A, Y \leq X)}{P(A)} \quad (1)

Now, what's the probability of the acceptance event, AA, given YY? Well, from the definition of AA, we see that:

P(AโˆฃY=y)=P(Uโ‰คg(Y)โˆฃY=y)P(A|Y=y) = P(U \leq g(Y)|Y=y)

Since Y=yY=y, we can substitute yy for YY:

P(Uโ‰คg(Y)โˆฃY=y)=P(Uโ‰คg(y)โˆฃY=y)P(U \leq g(Y)|Y=y) = P(U \leq g(y)|Y=y)

Earlier, we stated that UU and YY are independent. As a result, information about YY gives us no information about UU, so:

P(Uโ‰คg(y)โˆฃY=y)=P(Uโ‰คg(y))P(U \leq g(y)|Y=y) = P(U \leq g(y))

Additionally, we know that UU is uniform, so, by definition, P(Uโ‰คx)=x,0โ‰คxโ‰ค1P(U \leq x) = x, 0 \leq x \leq 1. Since g(y)g(y) also has a range of [0,1][0,1], by definition, then:

P(Uโ‰คg(y))=g(y)(2)P(U \leq g(y)) = g(y) \quad (2)

Now let's consider the joint probability P(A,Yโ‰คx)P(A, Y \leq x). Here, we can use the law of total probability, also known as the standard conditioning argument:

P(A,Yโ‰คx)=โˆซโˆ’โˆžโˆžP(A,Yโ‰คxโˆฃY=y)h(y)dyP(A, Y \leq x) = \int_{-\infty}^\infty P(A, Y \leq x | Y=y)h(y)dy

We can express Yโ‰คxY \leq x directly in the limits of integration, instead of in the conditional probability expression inside the integral. If we integrate over all yy, such that โˆ’โˆžโ‰คyโ‰คx-\infty \leq y \leq x, then we are still only considering values of YY such that Yโ‰คxY \leq x:

P(A,Yโ‰คx)=โˆซโˆ’โˆžxP(AโˆฃY=y)h(y)dyP(A, Y \leq x) = \int_{-\infty}^x P(A | Y=y)h(y)dy

We can substitute t(y)/c=h(y)t(y)/c = h(y):

P(A,Yโ‰คx)=1cโˆซโˆ’โˆžxP(AโˆฃY=y)t(y)dyP(A, Y \leq x) = \frac{1}{c}\int_{-\infty}^x P(A | Y=y)t(y)dy

We might remember, from result (2)(2) above, that P(AโˆฃY=y)=g(y)P(A|Y=y) = g(y). Therefore:

P(A,Yโ‰คx)=1cโˆซโˆ’โˆžxg(y)t(y)dyP(A, Y \leq x) = \frac{1}{c}\int_{-\infty}^x g(y)t(y)dy

Also remember that, by definition, g(y)=f(y)/t(x)g(y) = f(y)/t(x). Therefore:

P(A,Yโ‰คx)=1cโˆซโˆ’โˆžxf(y)dy(3)P(A, Y \leq x) = \frac{1}{c}\int_{-\infty}^x f(y)dy \quad (3)

If we let xโ†’โˆžx \to \infty, then we have:

P(A)=1cโˆซโˆ’โˆžโˆžf(y)dyP(A) = \frac{1}{c}\int_{-\infty}^\infty f(y)dy

Notice that two things changed here. First, we changed the upper limit of integration from xx to โˆž\infty. Second, we changed the probability expression from P(A,Yโ‰คx)P(A, Y \leq x) to P(A)P(A). Why? If xโ†’โˆžx \to \infty, P(Yโ‰คx)โ†’1P(Y \leq x) \to 1, so P(A,Yโ‰คx)โ†’P(A)P(A, Y \leq x) \to P(A).

We know that f(x)f(x) is a pdf, and the integral of a pdf over the real line is equal to one, so:

P(A)=1cโˆซโˆ’โˆžโˆžf(y)dy=1c(4)P(A) = \frac{1}{c}\int_{-\infty}^\infty f(y)dy = \frac{1}{c} \quad (4)

Together, facts (1)(1), (3)(3), and (4)(4) imply that:

P(Xโ‰คx)=P(A,Yโ‰คx)P(A)=โˆซโˆ’โˆžxf(y)dyP(X \leq x) = \frac{P(A, Y \leq x)}{P(A)} = \int_{-\infty}^x f(y)dy

Essentially, we have shown that the cdf of XX is equal to the integral of f(y)f(y), from โˆ’โˆž-\infty to xx. If we take the derivative of both sides, we see that the pdf of XX is equal to f(x)f(x).

There are two main issues here. First, we need the ability to sample YY from h(y)h(y) quickly since we can't sample XX from f(x)f(x) easily. Second, cc must be small since the probability of the acceptance event is:

P(Uโ‰คg(Y))=1cP(U \leq g(Y)) = \frac{1}{c}

Now, cc is bounded by one from below - f(x)=t(x),c=1f(x) = t(x), c = 1 - so we want cc to be as close to one as possible. If so, then:

P(Uโ‰คg(Y))=1cโ‰ˆ1P(U \leq g(Y)) = \frac{1}{c} \approx 1

If cโ‰ˆ1c \approx 1, then we accept almost every YY we sample. If cc is very large, then 1/c1/c is very small, and we will likely have to draw many (U,Y)(U, Y) pairs before acceptance. In fact, the number of trials until "success", defined as Uโ‰คg(Y)U \leq g(Y), is Geom(1/c1/c), which has an expected value of cc trials.

A-R Method - Continuous Examples

In this lesson, we will look at some examples of applying the acceptance-rejection method on continuous distributions. As we said, acceptance-rejection is a general method that always works, even when other methods may be difficult to apply.

Theorem Refresher

Define g(x)โ‰กf(x)/t(x)g(x) \equiv f(x)/t(x), and note that 0โ‰คg(x)โ‰ค10 \leq g(x) \leq 1 for all xx. We sample UโˆผU(0,1)U \sim \mathcal U(0,1) and YY from pdf h(y)=t(y)/ch(y) = t(y)/c, where t(y)t(y) majorizes f(x)f(x) and cc is the integral of t(y)t(y) over the real line. If Uโ‰คg(Y)U \leq g(Y), then YY has the conditional pdf f(y)f(y).

Here's the basic algorithm, which we repeat until acceptance. First, we draw UU from U(0,1)\mathcal U(0,1) and YY from h(y)h(y) independent of UU. If Uโ‰คg(Y)U \leq g(Y), we return Xโ†YX \leftarrow Y. Otherwise, we repeat, sampling UU and YY again.

Example

Let's generate a random variable with pdf f(x)=60x3(1โˆ’x)2,0โ‰คxโ‰ค1f(x) = 60x^3(1-x)^2, 0 \leq x \leq 1. This pdf is shown below.

We can't invert this pdf analytically, so we cannot use the inverse transform method; we must use acceptance-rejection. We can use some calculus to determine that the maximum of f(x)f(x) occurs at x=0.6x = 0.6: f(0.6)=2.0736f(0.6) = 2.0736. With this knowledge, we can generate a basic majorizer, t(x)=2.0736t(x) = 2.0736.

Remember that we said we wanted cc to be as close to one as possible so that we accept samples from h(y)h(y) with high probability. We know that cc equals the integral of t(x)t(x) from zero to one, so, in this case, c=2.0736c = 2.0736. All this to say, t(x)t(x) is a relatively inefficient majorizer.

Like we said, h(x)=t(x)/ch(x) = t(x)/c. In this case, h(x)=1h(x) = 1, since t(x)=2.0736=ct(x) = 2.0736 = c. This result means that YY is a Unif(0,1) random variable, since it's pdf is one.

Finally, let's compute g(x)g(x):

g(x)=f(x)t(x)=60x3(1โˆ’x)22.0736g(x) = \frac{f(x)}{t(x)} = \frac{60x^3(1-x)^2}{2.0736}

Let's look at a simple example. Let's draw two uniforms, U=0.13U = 0.13 and Y=0.25Y = 0.25. If we plug and chug, we see that g(Y)โ‰ˆ0.25g(Y) \approx 0.25. Therefore, Uโ‰คg(Y)U \leq g(Y) and we take Xโ†0.25X \leftarrow 0.25.

A-R Method - Continuous Examples DEMO

Previous Example

Let's demo the acceptance-rejection example we looked at previously. We wanted to generate XX with pdf f(x)=60x3(1โˆ’x)2,0โ‰คxโ‰ค1f(x) = 60x^3(1-x)^2, 0 \leq x \leq 1. Let's recall g(x)g(x):

g(x)=f(x)t(x)=60x3(1โˆ’x)22.0736g(x) = \frac{f(x)}{t(x)} = \frac{60x^3(1-x)^2}{2.0736}

Additionally, remember that our acceptance event is Uโ‰คg(Y)U \leq g(Y). Consider the following code:

m = 10000;
X = [];
for i=1:m
  U = 1; % Initialize to known failing condition
  Y = 0; % Initialize to known failing condition
  while U > 60 * (Y^3) * (1-Y)^2 / 2.0736
    U = rand;
    Y = rand;
  end
  X(i) = Y;
  i=i+1
end
histogram(X)

This code iterates for m=10000 iterations and, on each iteration, generates Unif(0,1) random variables, Y and U, until the acceptance event occurs. Once it does, we store away the accepted value of Y and proceed to the next iteration. Finally, we generate a histogram of the m accepted values of Y.

Let's look at the generated histogram after 10,000 samples.

Here's the histogram after 100,000 samples. As we can see, the histogram of generated values looks quite close to the graph of f(x)f(x).

Half-Normal Example

In this example, we are going to generate a standard half-normal random variable, which has the following pdf:

f(x)=22ฯ€eโˆ’x22,xโ‰ฅ0f(x) = \frac{2}{\sqrt{2\pi}}e^{\frac{-x^2}{2}}, \quad x \geq 0

This random variable closely resembles a standard normal random variable, except that we've "flipped" the negative portion of the distribution over the yy-axis and doubled f(x)f(x), for all xโ‰ฅ0x \geq 0.

We can use the following majorizing function, t(x)t(x), as it is always greater than f(x)f(x):

t(x)=2eฯ€eโˆ’xโ‰ฅf(x)t(x) = \sqrt{\frac{2e}{\pi}}e^{-x} \geq f(x)

If we take the integral of t(x)t(x) over the domain of f(x)f(x), we get cc:

c=โˆซ0โˆžt(x)dx=2eฯ€c = \int_0^{\infty} t(x)dx = \sqrt{\frac{2e}{\pi}}

Now, let's compute h(x)h(x):

h(x)=t(x)c=eโˆ’xh(x) = \frac{t(x)}{c} = e^{-x}

Let's remember the pdf for an exponential random variable:

f(x)=eโˆ’ฮปxf(x) = e^{-\lambda x}

We can see that h(x)h(x) is simply the pdf of an Exp(1) random variable, which we can generate easily!

Finally, let's look at g(x)g(x):

g(x)=f(x)t(x)=eโˆ’(xโˆ’1)2/2g(x) = \frac{f(x)}{t(x)} = e^{-(x-1)^2/2}

To generate a half normal, we simply generate UโˆผU(0,1)U \sim \mathcal U(0,1) and YโˆผExp(1)Y \sim \text{Exp}(1) and accept YY if Uโ‰คg(Y)U \leq g(Y).

We can use the half-normal result to generate a Nor(0,1) random variable. We simply have to "flip back" half of the XX values over the yy-axis. Given UโˆผU(0,1)U \sim \mathcal U(0,1) and XX from the half-normal distribution, we can see that:

Z={โˆ’XUโ‰ค1/2XU>1/2โˆผNor(0,1)Z = \left\{ \begin{matrix} -X & U \leq 1/2 \\ X & U > 1/2 \end{matrix} \right. \sim \text{Nor}(0,1)

As always, we can generate a Nor(ฮผ\mu, ฯƒ2\sigma^2) by applying the transformation ฮผ+ฯƒZ\mu + \sigma Z.

A-R Method - Poisson Distribution

In this lesson, we will use a method similar to acceptance-rejection to generate a discrete random variable.

Poisson Distribution

The Poisson distribution has the following pmf:

P(X=n)=eโˆ’ฮปฮปnn!,n=0,1,...P(X=n) = e^{-\lambda}\frac{\lambda^n}{n!}, \quad n = 0, 1,...

We'll use a variation of the acceptance-rejection method here to generate a realization of XX. The algorithm will go through a series of equivalent statements to arrive at a rule that gives X=nX = n for some nn.

Remember that, by definition, X=nX = n if and only if we observe exactly nn arrivals from a Poisson(ฮป\lambda) process in one time unit. We can define AiA_i as the iith interarrival time from a Pois(ฮป\lambda) process: the time between arrivals iโˆ’1i-1 and ii.

Like we said, X=nX=n if and only if we see exactly nn Pois(ฮป\lambda) arrivals by t=1t = 1. We can express this requirement as an inequality of sums:

X=nโ€…โ€ŠโŸบโ€…โ€Šโˆ‘i=1nAiโ‰ค1<โˆ‘i=1n+1AiX = n \iff \sum_{i=1}^n A_i \leq 1 < \sum_{i=1}^{n+1} A_i

In other words, the sum of the first nn interarrival times must be less than or equal to one, and, correspondingly, the sum of the first n+1n+1 interarrival times must be greater than one. Put another way, the nnth arrival occurs by time t=1t=1, but the (n+1)(n+1)th arrival occurs after time t=1t=1.

We might recall that interarrival times from a Pois(ฮป\lambda) are generated from an Exp(ฮป\lambda) distribution, and we know how to transform uniforms into Exp(ฮป\lambda) random variables:

X=nโ€…โ€ŠโŸบโ€…โ€Šโˆ‘i=1n[โˆ’1ฮปlnโกUi]โ‰ค1<โˆ‘i=1n+1[โˆ’1ฮปlnโกUi]X = n \iff \sum_{i=1}^n \left[\frac{-1}{\lambda}\ln U_i\right] \leq 1 < \sum_{i=1}^{n+1} \left[\frac{-1}{\lambda}\ln U_i\right]

As we saw previously, we can move the natural logarithm outside of the sum and convert the sum to a product: lnโก(a)+lnโก(b)=lnโก(ab)\ln(a) + \ln(b) = \ln(ab). Consider:

X=nโ€…โ€ŠโŸบโ€…โ€Šโˆ’1ฮปlnโก(โˆi=1nUi)โ‰ค1<โˆ’1ฮปlnโก(โˆi=1n+1Ui)X = n \iff \frac{-1}{\lambda}\ln\left(\prod_{i=1}^nU_i\right) \leq 1 < \frac{-1}{\lambda}\ln\left(\prod_{i=1}^{n+1}U_i\right)

Let's multiply through by โˆ’ฮป-\lambda - make sure to flip the inequalities when multiplying by a negative - and raise the whole thing by ee:

X=nโ€…โ€ŠโŸบโ€…โ€Šโˆi=1nUiโ‰ฅeโˆ’ฮป>โˆi=1n+1UiX = n \iff \prod_{i=1}^nU_i \geq e^{-\lambda} > \prod_{i=1}^{n+1}U_i

To generate a Pois(ฮป\lambda) random variable, we will generate n+1n+1 uniforms until the above inequality holds, and then we'll set X=nX=n.

Poisson Generation Algorithm

Let's set a=eโˆ’ฮปa = e^{-\lambda}, p=1p=1, and X=โˆ’1X=-1. Note that even though we set XX to โˆ’1-1, the first thing we do is increment it. As long as p<ap < a, we generate a uniform, UU. Next, we update our running product, p=pUp = pU, and we increment XX by one. Once p<ap < a, we return XX.

Example

Let's obtain a Pois(2) random variable. We will sample uniforms until the following condition holds:

eโˆ’ฮป=eโˆ’2=0.1353>โˆi=1n+1Uie^{-\lambda} = e^{-2} = 0.1353 > \prod_{i=1}^{n+1}U_i

Consider the following table:

nUn+1โˆi=1n+1Ui