Simulation

Random Variate Generation, Continued

54 minute read



Notice a tyop typo? Please submit an issue or open a PR.

Random Variate Generation, Continued

Composition (OPTIONAL)

In this lesson, we will learn about the composition method. This technique only works for random variables that are themselves combinations or mixtures of random variables, but it works well when we can use it.

Composition

We use composition when working with a random variable that comes from a combination of two random variables and can be decomposed into its constituent parts.

Let's consider a concrete example. An airplane might be late leaving the gate for two reasons, air traffic delays and maintenance delays, and these two delays compose the overall delay time. Most of the time, the total delay is due to a small air traffic delay, while, less often, the total delay is due to a larger maintenance delay.

If we can break down the total delay into the component delays - air traffic and maintenance - and those components are easy to generate, we can produce an expression for the total delay that combines the individual pieces.

The goal is to generate a random variable with a cdf, F(x)F(x), of the following form:

F(x)=βˆ‘j=1∞pjFj(x)F(x) = \sum_{j=1}^\infty p_jF_j(x)

In plain English, we want to generate a random variable whose cdf is a linear combination of other, "easier" cdf's, the FjF_j's. Note that pj>0p_j > 0 and the sum of all pjp_j's equals one.

Again, the main idea here is that, instead of trying to generate one complicated random variable, we can generate a collection of simpler random variables, weight them accordingly, and then sum them to realize the single random variable.

Don't fixate on the upper limit being infinity in the summation expression. Infinity is just the theoretical upper limit. In our airplane example, the actual limit was 2.

Here's the algorithm that we will use. First, we generate a positive integer JJ such that P(J=j)=pjP(J=j) = p_j for all jj. Then, we return XX from cdf FJ(x)F_J(x).

Proof

Let's look at a proof that XX has cdf F(x)F(x). First, we know that the cdf of XX, by definition, is P(X≀x)P(X \leq x). Additionally, by the law of total probability:

P(X≀x)=βˆ‘j=1∞P(X≀x∣J=j)P(J=j)P(X \leq x) = \sum_{j=1}^\infty P(X \leq x | J=j)P(J=j)

Now, we expressed P(J=j)P(J=j) as pjp_j, and we know that P(X≀x∣J=j)P(X \leq x|J=j) is equal to FjF_j. Therefore:

P(X≀x)=βˆ‘j=1∞P(X≀x∣J=j)P(J=j)=βˆ‘j=1∞Fj(x)pj=F(x)\begin{alignedat}{1} P(X \leq x) &= \sum_{j=1}^\infty P(X \leq x | J=j)P(J=j) \\ &= \sum_{j=1}^\infty F_j(x)p_j = F(x) \end{alignedat}

Example

Consider the Laplace distribution, with the following pdf and cdf:

f(x)≑{12ex,x<012eβˆ’x,x>0andF(x)≑{12ex,x<01βˆ’12eβˆ’x,x>0f(x) \equiv \left\{ \begin{matrix} \frac{1}{2}e^x, & x < 0 \\ \frac{1}{2}e^{-x}, & x > 0 \\ \end{matrix} \right. \quad \text{and} \quad F(x) \equiv \left\{ \begin{matrix} \frac{1}{2}e^x, & x < 0 \\ 1 - \frac{1}{2}e^{-x}, & x > 0 \\ \end{matrix} \right. \quad

Note that the pdf and cdf look similar to those for exponential random variables, except that xx is allowed to be less than zero and the exponential has a coefficient of one-half. We can think of this distribution as the exponential distribution reflected off the yy-axis.

Let's decompose this XX into a "negative exponential" and regular exponential distributions:

F1(x)≑{ex,x<01,x>0andF2(x)≑{0,x<01βˆ’eβˆ’x,x>0F_1(x) \equiv \left\{ \begin{matrix} e^x, & x < 0 \\ 1, & x > 0 \\ \end{matrix} \right. \quad \text{and} \quad F_2(x) \equiv \left\{ \begin{matrix} 0, & x < 0 \\ 1 - e^{-x}, & x > 0 \\ \end{matrix} \right. \quad

If we multiply each CDF here by one-half and sum them together, we get a Laplace random variable:

F(x)=12F1(x)+12F2(x)F(x) = \frac{1}{2}F_1(x) + \frac{1}{2}F_2(x)

Let's look at both cases. When x<0x < 0:

F(x)=12ex+12(0)=12exF(x) = \frac{1}{2}e^x + \frac{1}{2}(0) = \frac{1}{2}e^x

When x>0x > 0:

F(x)=12(1)+12(1βˆ’eβˆ’x)=1βˆ’12exF(x) = \frac{1}{2}(1) + \frac{1}{2}(1-e^{-x}) = 1 - \frac{1}{2}e^x

As we can see, the composed cdf matches the expected cdf, both for x>0x > 0 and x<0x < 0.

The cdf tells us that we generate XX from F1(x)F_1(x) half the time and from F2(x)F_2(x) the other half of the time. Correspondingly, we'll use inverse transform to solve F1(X)=eX=UF_1(X) = e^X = U for XX half the time and we'll solve F2=1βˆ’eβˆ’X=UF_2 = 1-e^{-X} = U for the other half. Consider:

X←{ln⁑(U),w/Β probabilityΒ 1/2βˆ’ln⁑(U),w/Β probabilityΒ 1/2X \leftarrow \left\{ \begin{matrix} \ln(U), & \text{w/ probability } 1/2 \\ -\ln(U), & \text{w/ probability } 1/2 \\ \end{matrix} \right.

Precisely, we transform our uniform into an exponential with probability one-half and a negative exponential with probability negative one-half.

Box-Muller Normal RVs

In this lesson, we will talk about the Box-Muller method, which is a special-case algorithm used to generate standard normal random variables.

Box-Muller Method

If U1,U2U_1, U_2 are iid (0,1)\mathcal (0,1), then the following quantities are iid Nor(0,1):

Z1=βˆ’2ln⁑(U1)cos⁑(2Ο€U2)Z2=βˆ’2ln⁑(U1)sin⁑(2Ο€U2)Z_1 = \sqrt{-2\ln{(U_1)}}\cos(2\pi U_2) \\ Z_2 = \sqrt{-2\ln{(U_1)}}\sin(2\pi U_2)

Note that we must perform the trig calculations in radians! In degrees, 2Ο€U2\pi U is a very small quantity, and resulting Z1Z_1 and Z2Z_2 will not be iid Nor(0,1).

Interesting Corollaries

We've mentioned before that if we square a standard normal random variable, we get a chi-squared random variable with one degree of freedom. If we add nn chi-squared randoms, we get a chi-squared with nn degrees of freedom:

Z12+Z22βˆΌΟ‡2(1)+Ο‡2(1)βˆΌΟ‡2(2)Z^2_1 + Z^2_2 \sim \chi^2(1) + \chi^2(1) \sim \chi^2(2)

Meanwhile, let's consider the sum of Z12Z^2_1 and Z22Z^2_2 algebraically:

Z12+Z22=βˆ’2ln⁑(U1)(cos⁑2(2Ο€U2)+sin⁑2(2Ο€U2))Z^2_1 + Z^2_2 = -2\ln(U_1)(\cos^2(2\pi U_2) + \sin^2(2\pi U_2))

Remember from trig class the sin⁑2(x)+cos⁑2(x)=1\sin^2(x) + \cos^2(x) = 1, so:

Z12+Z22=βˆ’2ln⁑(U1)Z^2_1 + Z^2_2 = -2\ln(U_1)

Remember also how we transform exponential random variables:

X=βˆ’1Ξ»ln⁑U∼Exp(Ξ»)X = \frac{-1}{\lambda}\ln U \sim \text{Exp}(\lambda)

All this to say, we just demonstrated that Z12+Z22∼Exp(1/2)Z^2_1 + Z^2_2 \sim \text{Exp}(1/2). Furthermore, and more interestingly, we just proved that:

Ο‡2(2)∼Exp(1/2)\chi^2(2) \sim \text{Exp}(1/2)

Let's look at another example. Suppose we take Z1/Z2Z_1/Z_2. One standard normal divided by another is a Cauchy random variable, and also a t(1) random variable.

As an aside, we can see how Cauchy random variables take on such a wide range of values. Suppose the normal in the denominator is very close to zero, and the normal in the numerator is further from zero. In that case, the quotient can take on very large positive and negative values.

Moreover:

Z2/Z1=βˆ’2ln⁑(U1)cos⁑(2Ο€U2)βˆ’2ln⁑(U1)sin⁑(2Ο€U2)=tan⁑(2Ο€U2)Z_2/Z_1 = \frac{\sqrt{-2\ln{(U_1)}}\cos(2\pi U_2)}{\sqrt{-2\ln{(U_1)}}\sin(2\pi U_2)} = \tan(2\pi U_2)

Thus, we've just proven that tan⁑(2Ο€U)∼Cauchy\tan(2\pi U) \sim \text{Cauchy}. Likewise, we can take Z1/Z2Z_1/Z_2, which proves additionally that cot⁑(2Ο€U)∼Cauchy\cot(2\pi U) \sim \text{Cauchy}.

Furthermore:

Z22/Z12=tan⁑2(2Ο€U)∼t2(1)∼F(1,1)Z_2^2 / Z_1^2 = \tan^2(2\pi U) \sim t^2(1) \sim F(1,1)

Polar Method

We can also use the polar method to generate standard normals, and this method is slightly faster than Box-Muller.

First, we generate two uniforms, U1,U2∼iidU(0,1)U_1, U_2 \overset{\text{iid}}{\sim} \mathcal U(0,1). Next, let's perform the following transformation:

Vi=2Uiβˆ’1,i=1,2W=V12+V22V_i = 2U_i - 1, i = 1,2 \\ W = V_1^2 + V_2^2

Finally, we use the acceptance-rejection method. If W>1W > 1, we reject and return to sampling uniforms. Otherwise:

Y=βˆ’2ln⁑(W)/WY = \sqrt{-2\ln(W)/W}

We accept Zi←ViY,i=1,2Z_i \leftarrow V_iY,i=1,2. As it turns out, Z1Z_1 and Z2Z_2 are iid Nor(0,1). This method is slightly faster than Box-Muller because it avoids expensive trigonometric calculations.

Order Statistics and Other Stuff

In this lesson, we will talk about how to generate order statistics efficiently.

Order Statistics

Suppose that we have the iid observations, X1,X2,...,XnX_1, X_2,...,X_n from some distribution with cdf F(x)F(x). We are interested in generating a random variable, YY, such that YY is the minimum of the XiX_i's: Y=min⁑{X1,...,Xn}Y = \min\{X_1,...,X_n\}. Since YY is itself a random variable, let's call its cdf G(y)G(y). Furthermore, since YY refers to the smallest XiX_i, it's called the first order statistic.

To generate YY, we could generate nn XiX_i's individually, which takes nn units of work. Could we generate YY more efficiently, using just one U(0,1)\mathcal U(0,1)?

The answer is yes! Since the XiX_i's are iid, we have:

G(y)=1βˆ’P(Y>y)=1βˆ’P(min⁑iXi>y)G(y) = 1 - P(Y > y) = 1 - P(\min_iX_i > y)

In English, the cdf of YY is P(Y≀y)P(Y \leq y), which is equivalent to the complement: 1βˆ’P(Y>y)1 - P(Y > y). Since Y=min⁑iXiY = \min_i X_i, G(y)=1βˆ’P(min⁑iXi>y)G(y) = 1 - P(\min_i X_i > y).

Now, since the minimum XiX_i is greater than YY, then all of the XiX_i's must be greater than YY:

G(y)=1βˆ’P(allΒ Xi’s>y)=1βˆ’[P(X1>y)βˆ—P(X2>y)βˆ—...βˆ—P(Xn>y)]\begin{alignedat}{1} G(y) &= 1 - P(\text{all } X_i\text{'s} > y) \\ &= 1 - [P(X_1 > y) * P(X_2 > y) * ... * P(X_n > y)] \end{alignedat}

Since the XiX_i's are iid, P(Xi>y)=P(X1>y)P(X_i > y) = P(X_1 > y). Therefore:

G(y)=1βˆ’[P(X1>y)βˆ—P(X1>y)βˆ—...βˆ—P(X1>y)]=1βˆ’[P(X1>y)]n\begin{alignedat}{1} G(y) &= 1 - [P(X_1 > y) * P(X_1 > y) * ... * P(X_1 > y)] \\ &= 1 - [P(X_1 > y)]^n \end{alignedat}

Finally, note that F(y)=X≀yF(y) = X \leq y. Therefore, 1βˆ’F(y)=P(X>y)1 - F(y) = P(X > y), so:

G(y)=1βˆ’[1βˆ’F(y)]nG(y) = 1 - [1- F(y)]^n

At this point, we can use inverse transform to express YY in terms of UU:

U=1βˆ’[1βˆ’F(Y)]n[1βˆ’F(Y)]n=1βˆ’U1βˆ’F(Y)=(1βˆ’U)1/nY=Fβˆ’1(1βˆ’(1βˆ’U)1/n)\begin{alignedat}{1} U &= 1 - [1- F(Y)]^n \\ [1- F(Y)]^n &= 1 - U \\ 1 - F(Y) &= (1-U)^{1/n}\\ Y &= F^{-1}(1 - (1-U)^{1/n}) \end{alignedat}

Example

Suppose X1,...Xn∼Exp(λ)X_1,...X_n \sim \text{Exp}(\lambda). Then:

G(y)=1βˆ’(1βˆ’(1βˆ’eβˆ’Ξ»y))n=1βˆ’eβˆ’nΞ»y\begin{alignedat}{1} G(y) &= 1 - (1 - (1 - e^{-\lambda y}))^n \\ &= 1 - e^{-n\lambda y} \end{alignedat}

So, Y=min⁑iXi∼Exp(nλ)Y = \min_iX_i \sim \text{Exp}(n\lambda). If we apply inverse transform we get:

Y=βˆ’1nΞ»ln⁑(U)Y = \frac{-1}{n\lambda}\ln(U)

We can do the same kind of thing for Z=max⁑iXiZ = \max_i X_i. Let's try it! If Z=max⁑iXiZ = \max_iX_i, then we can express H(z)H(z) as:

P(Z≀z)=P(max⁑iXi≀z)P(Z \leq z) = P(\max_iX_i \leq z)

Now, since the maximum XiX_i is less than or equal to ZZ, then all of the XiX_i's must less than or equal to ZZ:

H(z)=P(allΒ Xi’s≀z)=[P(X1≀z)βˆ—P(X2≀z)βˆ—...βˆ—P(Xn≀z)]\begin{alignedat}{1} H(z) &= P(\text{all } X_i\text{'s} \leq z) \\ &= [P(X_1 \leq z) * P(X_2 \leq z) * ... * P(X_n \leq z)] \end{alignedat}

Since the XiX_i's are iid, P(Xi≀z)=P(X1≀z)P(X_i \leq z) = P(X_1 \leq z). Therefore:

H(z)=[P(X1≀z)βˆ—P(X1≀z)βˆ—...βˆ—P(X1≀z)]=[P(X1≀z)]n\begin{alignedat}{1} H(z) &= [P(X_1 \leq z) * P(X_1 \leq z) * ... * P(X_1 \leq z)] \\ &= [P(X_1 \leq z)]^n \end{alignedat}

Finally, note that F(z)=X≀zF(z) = X \leq z. Therefore:

H(z)=F(z)nH(z) =F(z)^n

Suppose X1,...Xn∼Exp(λ)X_1,...X_n \sim \text{Exp}(\lambda). Then:

H(z)=(1βˆ’eβˆ’Ξ»z)nH(z) = (1 - e^{-\lambda z})^n

Let's apply inverse transform:

U=(1βˆ’eβˆ’Ξ»Z)nU1/n=1βˆ’eβˆ’Ξ»Zeβˆ’Ξ»Z=1βˆ’U1/nβˆ’Ξ»Z=ln⁑(1βˆ’U1/n)Z=βˆ’ln⁑(1βˆ’U1/n)Ξ»\begin{alignedat}{1} U &= (1 - e^{-\lambda Z})^n \\ U^{1/n} &= 1 - e^{-\lambda Z} \\ e^{-\lambda Z} &= 1 - U^{1/n} \\ -\lambda Z &= \ln(1 - U^{1/n}) \\ Z &= \frac{-\ln(1 - U^{1/n})}{\lambda} \end{alignedat}

Other Stuff

If Z1,Z2,...,ZnZ_1, Z_2,...,Z_n are iid Nor(0,1), then the sum of the squares of the ZiZ_i's is a chi-squared random variable with nn degrees of freedom:

βˆ‘i=1nZi2βˆΌΟ‡2(n)\sum_{i=1}^n Z^2_i \sim \chi^2(n)

If Z∼Nor(0,1)Z \sim \text{Nor}(0,1), and YβˆΌΟ‡2(n)Y \sim \chi^2(n), and XX and YY are independent, then:

ZY/n∼t(n)\frac{Z}{\sqrt{Y/n}} \sim t(n)

Note that t(1) is the Cauchy distribution.

If XβˆΌΟ‡2(n)X \sim \chi^2(n), and YβˆΌΟ‡2(m)Y \sim \chi^2(m), and XX and YY are independent, then:

(X/n)(Y/n)∼F(n,m)\frac{(X/n)}{(Y/n)} \sim F(n,m)

If we want to generate random variables from continuous empirical distributions, we'll have to settle for the CONT function in Arena for now.

Multivariate Normal Distribution

In this lesson, we will look at the multivariate normal distribution. This distribution is very important, and people use it all the time to model observations that are correlated with one another.

Bivariate Normal Distribution

Consider a random vector (X,Y)(X, Y). For example, we might draw XX from a distribution of heights and YY from a distribution of weights. This vector has the bivariate normal distribution with means ΞΌX=E[X]\mu_X = E[X] and ΞΌY=E[Y]\mu_Y = E[Y], variances ΟƒX2=Var(X)\sigma^2_X = \text{Var}(X) and ΟƒY2=Var(Y)\sigma^2_Y = \text{Var}(Y), and correlation ρ=Corr(X,Y)\rho = \text{Corr}(X,Y) if it has the following joint pdf:

f(x,y)=12πσXΟƒY1βˆ’Ο2exp⁑{βˆ’[zX2(x)+zY2(y)βˆ’2ρzX(x)zY(y)]2(1βˆ’Ο2)}f(x,y) = \frac{1}{2\pi \sigma_X \sigma_Y\sqrt{1-\rho^2}}\exp\left\{\frac{-\left[z^2_X(x) + z^2_Y(y) - 2 \rho z_X(x)z_Y(y)\right]}{2(1-\rho^2)}\right\}

Note the following definitions for zX(x)z_X(x) and zY(y)z_Y(y), which are standardized versions of the corresponding variables:

zX(x)≑xβˆ’ΞΌXΟƒX,zY(y)≑yβˆ’ΞΌYΟƒYz_X(x) \equiv \frac{x - \mu_X}{\sigma_X}, \quad z_Y(y) \equiv \frac{y - \mu_Y}{\sigma_Y}

Let's contrast this pdf with the univariate normal pdf:

f(x)=1Οƒ2Ο€exp⁑{βˆ’12(xβˆ’ΞΌΟƒ)2}f(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left\{\frac{-1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right\}

The fractions in front of the exponential looks similar in both cases. In fact, we could argue that the fraction for the univariate case might look like this:

1Οƒ2Ο€1βˆ’Ο2\frac{1}{\sigma\sqrt{2\pi}\sqrt{1-\rho^2}}

In this case, however, Corr(X,X)=1\text{Corr}(X, X) = 1, so the expression involving ρ2\rho^2 reduces to one. All this to say, the bivariate case expands on the univariate case by incorporating another Οƒ2Ο€\sigma\sqrt{2\pi} term. Consider:

f(x,y)=1ΟƒX2πσY2Ο€1βˆ’Ο2exp⁑{⋯ }=12πσXΟƒY1βˆ’Ο2exp⁑{⋯ }\begin{alignedat}{1} f(x,y) &= \frac{1}{\sigma_X \sqrt{2\pi} \sigma_Y \sqrt{2\pi}\sqrt{1-\rho^2}}\exp\{\cdots\} \\ &= \frac{1}{2\pi\sigma_X \sigma_Y \sqrt{1-\rho^2}}\exp\{\cdots\} \end{alignedat}

Now let's look at the expression inside the exponential. Consider again the exponential for the univariate case:

exp⁑{βˆ’12(xβˆ’ΞΌΟƒ)2}\exp\left\{\frac{-1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right\}

Again, we might argue that this expression contains a hidden term concerning ρ\rho that gets removed because the correlation of XX with itself is one:

exp⁑{βˆ’12(1βˆ’Ο)2(xβˆ’ΞΌΟƒ)2}\exp\left\{\frac{-1}{2(1-\rho)^2}\left(\frac{x-\mu}{\sigma}\right)^2\right\}

If we look at the bivariate case again, we see something that looks similar. We retain the βˆ’1/2-1/2 coefficient and we square standardized versions of XX and YY, although the standardization part is hidden away in zXz_X and zYz_Y:

exp⁑{βˆ’[zX2(x)+zY2(y)βˆ’2ρzX(x)zY(y)]2(1βˆ’Ο2)}\exp\left\{\frac{-\left[z^2_X(x) + z^2_Y(y) - 2 \rho z_X(x)z_Y(y)\right]}{2(1-\rho^2)}\right\}

As we said, we can model heights and weights of people as a bivariate normal random variable because those two observations are correlated with one another. In layman's terms: shorter people tend to be lighter, and taller people tend to be heavier.

For example, consider the following observations, taken from a bivariate normal distribution where both means are zero, both variances are one, and the correlation between the two variables is 0.9.

Multivariate Normal

The random vector X=(X1,...,Xk)⊀\bold X = (X_1,...,X_k)^\top has the multivariate normal distribution with mean vector ΞΌ=(ΞΌ1,...,ΞΌk)⊀\boldsymbol \mu = (\mu_1,...,\mu_k)^\top and kΓ—kk \times k covariance matrix Ξ£=(Οƒij)\Sigma = (\sigma_{ij}), if it has the following pdf:

f(x)=1(2Ο€)k/2∣Σ∣1/2exp⁑{βˆ’(xβˆ’ΞΌ)βŠ€Ξ£βˆ’1(xβˆ’ΞΌ)2},x∈Rkf(\bold x) = \frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}} \exp\left\{-\frac{(\bold x - \boldsymbol \mu)^\top\Sigma^{-1}(\bold x - \boldsymbol \mu)}{2}\right\}, \quad x \in \mathbb{R}^k

Notice that the vector X\bold X and the vector ΞΌ\boldsymbol \mu have the same dimensions: random variable XiX_i has the expected value ΞΌi\mu_i.

Now let's talk about the covariance matrix Ξ£\Sigma. This matrix has kk rows and kk columns, and the cell at the iith row and jjth column holds the covariance between XiX_i and XjX_j. Of course, the cell at Ξ£ii\Sigma_{ii} holds the covariance of XiX_i with itself, which is just the variance of XiX_i.

Now let's look at the fraction in front of the exponential expression. Remember that, in one dimension, we took 2Ο€\sqrt{2\pi}. In kk dimensions, we take 2Ο€k/22\pi^{k/2}. In one dimension, we take Οƒ2\sqrt{\sigma^2}. In kk dimensions, we take the square root of the determinant of Ξ£\Sigma, where the determinant is a generalization of the variance.

Now let's look at the exponential expression. In one dimension, we computed:

exp⁑{βˆ’12(xβˆ’ΞΌΟƒ)2}\exp\left\{\frac{-1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right\}

In kk dimensions, we compute:

exp⁑{βˆ’(xβˆ’ΞΌ)βŠ€Ξ£βˆ’1(xβˆ’ΞΌ)2}\exp\left\{-\frac{(\bold x - \boldsymbol \mu)^\top\Sigma^{-1}(\bold x - \boldsymbol \mu)}{2}\right\}

Here, (xβˆ’ΞΌ)⊀(xβˆ’ΞΌ)(\bold x - \boldsymbol \mu)^\top(\bold x - \boldsymbol \mu) corresponds to (xβˆ’ΞΌ)2(x - \mu)^2, and Ξ£βˆ’1\Sigma^{-1} is the inverse of the covariance matrix, which corresponds to 1/Οƒ1/\sigma in the univariate case.

All this to say, the multivariate normal distribution generalizes the univariate normal distribution.

The multivariate case has the following properties:

E[Xi]=ΞΌi,Var(Xi)=Οƒii,Cov(Xi,Xj)=ΟƒijE[X_i] = \mu_i, \quad \text{Var}(X_i) = \sigma_{ii}, \text{Cov}(X_i, X_j) = \sigma_{ij}

We can express a random variable coming from this distribution with the following notation: X∼Nork(μ,Σ)\bold X \sim \text{Nor}_k(\boldsymbol{\mu}, \Sigma).

Generating Multivariate Random Variables

To generate X\bold X, let's start out with a vector of iid Nor(0,1) random variables in the same dimension as X\bold X: Z=(Z1,...,Zk)\bold Z = (Z_1,...,Z_k). We can express ZZ as Z∼Nork(0,I)\bold Z \sim \text{Nor}_k(\bold 0, I), where 0\bold 0 is a kk-dimensional vector of zeroes and II is the kΓ—kk \times k identity matrix. Note that the identity matrix makes sense here: since the values are iid, everywhere but the diagonal is zero.

Now, suppose that we can find the lower triangular kΓ—kk \times k Cholesky matrix CC, such that Ξ£=CC⊀\Sigma = CC^\top. We can think of this matrix, informally, as the "square root" of the covariance matrix.

It can be shown that X=ΞΌ+CZ\bold X = \boldsymbol \mu + C \bold Z is multivariate normal with mean ΞΌ\boldsymbol \mu and covariance matrix:

Σ≑E[(CZ)(CZ)⊀]=E[CZZ⊀C⊀]=C(E[ZZ⊀])C⊀=CC⊀\Sigma \equiv E[(C\bold Z)(C \bold Z)^\top] = E[C \bold Z \bold Z^\top C^\top] = C(E[\bold Z \bold Z^\top])C^\top = CC^\top

How do we find this magic matrix CC? For k=2k = 2, we can easily derive CC:

C=(Οƒ110Οƒ12Οƒ11Οƒ22βˆ’Οƒ122Οƒ11)C = \left( \begin{matrix} \sqrt{\sigma_{11}} & 0 \\ \frac{\sigma_{12}}{\sqrt{\sigma_{11}}} & \sqrt{\sigma_{22} - \frac{\sigma^2_{12}}{\sigma_{11}}} \end{matrix} \right)

Again, if we multiply CC by C⊀C^\top, we get the covariance matrix, Σ\Sigma:

Ξ£=CCT=(Οƒ110Οƒ12Οƒ11Οƒ22βˆ’Οƒ122Οƒ11)Γ—(Οƒ11Οƒ12Οƒ110Οƒ22βˆ’Οƒ122Οƒ11)=(Οƒ11Οƒ12Οƒ12Οƒ22)\begin{alignedat}{1} \Sigma &= CC^T \\ &= \left(\begin{matrix} \sqrt{\sigma_{11}} & 0 \\ \frac{\sigma_{12}}{\sqrt{\sigma_{11}}} & \sqrt{\sigma_{22} - \frac{\sigma^2_{12}}{\sigma_{11}}} \end{matrix} \right) \times \left(\begin{matrix} \sqrt{\sigma_{11}} & \frac{\sigma_{12}}{\sqrt{\sigma_{11}}} \\ 0 & \sqrt{\sigma_{22} - \frac{\sigma^2_{12}}{\sigma_{11}}} \end{matrix} \right) \\[4ex] &= \left(\begin{matrix} \sigma_{11} & \sigma_{12} \\ \sigma_{12} & \sigma_{22} \end{matrix} \right) \end{alignedat}

Here's how we generate X\bold X. Since X=ΞΌ+CZ\bold X = \boldsymbol \mu + C \bold Z, if we carry the vector addition and matrix multiplication out, we have:

X1=ΞΌ1+Οƒ11Z1+0Z1X2=ΞΌ2+Οƒ12Οƒ11Z1+Οƒ22βˆ’Οƒ122Οƒ11Z2X_1 = \mu_1 + \sqrt{\sigma_{11}}Z_1 + 0Z_1 \\ X_2 = \mu_2 + \frac{\sigma_{12}}{\sqrt{\sigma_{11}}}Z_1 + \sqrt{\sigma_{22} - \frac{\sigma^2_{12}}{\sigma_{11}}}Z_2

Here's the algorithm for computing a kk-dimensional Cholesky matrix, CC.

Once we compute CC, we can easily generate the multivariate normal random variable X=ΞΌ+CZ\bold X = \boldsymbol \mu + C \bold Z. First, we generate kk iid Nor(0,1) uniforms, Z1,Z2,...,ZkZ_1, Z_2,...,Z_k. Next, we set XiX_i equal to the following expression:

Xi←μi+βˆ‘j=1icijZj,i=1,2,...,kX_i \leftarrow \mu_i + \sum_{j=1}^i c_{ij}Z_j, i = 1,2,...,k

Note that the sum we take above is nothing more than the sum of the standard normals multiplied by the iith row in Ξ£\Sigma.

Finally, we return X=(X1,X2,Xk)\bold X = (X_1, X_2, X_k).

Baby Stochastic Processes (OPTIONAL)

In this lesson, we will talk about how we generate some easy stochastic processes: Markov chains and Poisson arrivals.

Markov Chains

A time series is a collection of observations ordered by time: day to day or minute to minute, for instance. We will look at a time series of daily weather observations and determine the probability of moving between two different states, sunny and rainy.

It could be the case that the days are not independent of one another; in other words, if we have thunderstorms today, there is likely a higher than average probability that we will have thunderstorms tomorrow.

Suppose it turns out the probability that we have a thunderstorm tomorrow only depends on whether we had a thunderstorm today, and nothing else. In that case, the sequence of weather outcomes is a type of stochastic process called a Markov chain.

Example

Let's let Xi=0X_i = 0 if it rains on day ii; otherwise, Xi=1X_i = 1. Note that we aren't dealing with Bernoulli observations here because the trials are not independent; that is, the weather on day i+1i+1 depends on the weather on day ii.

We can denote the day-to-day transition probabilities with the following equation:

Pjk=P(state k on day i+1∣ state j on day i),j,k=0,1P_{jk} = P(\text{state } k \text{ on day } i + 1 | \text{ state } j \text{ on day } i), \quad j,k = 0,1

In other words, PjkP_{jk} refers to the probability of experiencing state X=kX = k on day i+1i+1, given that we experienced state X=jX = j on day ii. For example, P01P_{01} refers to the probability that it is sunny tomorrow, given that it rained today.

Suppose that we know the various transition probabilities, and we have the following probability state matrix:

P=(P00P01P10P11)=(0.70.30.40.6)P = \begin{pmatrix} P_{00} & P_{01} \\ P_{10} & P_{11} \end{pmatrix} = \begin{pmatrix} 0.7 & 0.3 \\ 0.4 & 0.6 \end{pmatrix}

Remember that state zero is rain and state one is sun. Therefore, P00=0.7P_{00} = 0.7 means that the probability that we have rain on day i+1i+1 given that we had rain on day ii equals 0.7. Likewise, P01=0.3P_{01} = 0.3 means that the probability that we have sun on day i+1i+1 given that we had rain on day ii equals 0.3.

Notice that the probabilities add up to one across the rows, signifying that, for a given state on day ii, the probability of moving to some state on day i+1i+1 is one: the process must transition to a new state (of course, this new state could be equivalent to the current state).

Suppose it rains on Monday, and we want to simulate the rest of the week. We will fill out the following table:

dayP(R∣Xiβˆ’1)UiUi<Pβ‹…0?R/SMβˆ’βˆ’βˆ’R\begin{array}{ccccc} \text{day} & P(R|X_{i-1}) & U_i & U_i < P_{\cdot 0}? & \text{R/S} \\ \hline \text{M} & - & - & - & \text{R} \end{array}

Let's talk about the columns. Obviously, the first column refers to the day of the week. The column titled P(R∣Xiβˆ’1)P(\text{R} | X_{i-1}) refers to the probability of rain given yesterday's weather. Note that we also represent this probability with Pβ‹…0P_{\cdot 0}, where the dot refers to yesterday's state.

In the third column, we sample a uniform. In the fourth column, we check whether that uniform is less than the transition probability from yesterday's state to rain. Finally, in the fifth column, we remark whether it rains, based on whether the uniform is less than or equal to the transition probability.

Let's see what happens on Tuesday:

dayP(R∣Xiβˆ’1)UiUi<Pβ‹…0?R/SMβˆ’βˆ’βˆ’RTuP00=0.70.62YR\begin{array}{ccccc} \text{day} & P(R|X_{i-1}) & U_i & U_i < P_{\cdot 0}? & \text{R/S} \\ \hline \text{M} & - & - & - & \text{R} \\ \text{Tu} & P_{00} = 0.7 & 0.62 & \text{Y} & \text{R} \end{array}

Here, we are looking at the probability of transitioning from a rainy day to another rainy day. If we look up that particular probability in our state transition matrix, we get P00=0.7P_{00} = 0.7. Next, we sample a uniform, Ui=0.62U_i = 0.62, and check whether that uniform is less than the transition probability. Since 0.62<0.70.62 < 0.7, we say that it will rain on Tuesday.

Let's see what happens on Wednesday:

dayP(R∣Xiβˆ’1)UiUi<Pβ‹…0?R/SMβˆ’βˆ’βˆ’RTuP00=0.70.62YRWP00=0.70.03YR\begin{array}{ccccc} \text{day} & P(R|X_{i-1}) & U_i & U_i < P_{\cdot 0}? & \text{R/S} \\ \hline \text{M} & - & - & - & \text{R} \\ \text{Tu} & P_{00} = 0.7 & 0.62 & \text{Y} & \text{R} \\ \text{W} & P_{00} = 0.7 & 0.03 & \text{Y} & \text{R} \end{array}

Similarly, we are looking at the probability of again transitioning from a rainy day to another rainy day, so our transition probability is still P00=0.7P_{00} = 0.7. We draw another uniform, Ui=0.03U_i = 0.03, and, since 0.03<0.70.03 < 0.7, we say that it will rain on Wednesday.

Let's see what happens on Thursday:

dayP(R∣Xiβˆ’1)UiUi<Pβ‹…0?R/SMβˆ’βˆ’βˆ’RTuP00=0.70.62YRWP00=0.70.03YRThP00=0.70.77NS\begin{array}{ccccc} \text{day} & P(R|X_{i-1}) & U_i & U_i < P_{\cdot 0}? & \text{R/S} \\ \hline \text{M} & - & - & - & \text{R} \\ \text{Tu} & P_{00} = 0.7 & 0.62 & \text{Y} & \text{R} \\ \text{W} & P_{00} = 0.7 & 0.03 & \text{Y} & \text{R} \\ \text{Th} & P_{00} = 0.7 & 0.77 & \text{N} & \text{S} \end{array}

Here, we have the same transition probability as previously, but our uniform, Ui=0.77U_i = 0.77, happens to be larger than our transition probability: 0.77>0.70.77 > 0.7. As a result, we say that it will not rain on Thursday.

Finally, let's look at Friday:

dayP(R∣Xiβˆ’1)UiUi<Pβ‹…0?R/SMβˆ’βˆ’βˆ’RTuP00=0.70.62YRWP00=0.70.03YRThP00=0.70.77NSFP10=0.40.91NS\begin{array}{ccccc} \text{day} & P(R|X_{i-1}) & U_i & U_i < P_{\cdot 0}? & \text{R/S} \\ \hline \text{M} & - & - & - & \text{R} \\ \text{Tu} & P_{00} = 0.7 & 0.62 & \text{Y} & \text{R} \\ \text{W} & P_{00} = 0.7 & 0.03 & \text{Y} & \text{R} \\ \text{Th} & P_{00} = 0.7 & 0.77 & \text{N} & \text{S} \\ \text{F} & P_{10} = 0.4 & 0.91 & \text{N} & \text{S} \\ \end{array}

On Friday, we have a new transition probability. This time, we are looking at the probability of rain given sun, and P10=0.4P_{10} = 0.4. Again we draw our uniform, which happens to be greater than the transition probability, so we say it will be sunny on Friday.

Poisson Arrivals

Let's suppose we have a Poisson(Ξ»\lambda) process with a constant arrival rate of Ξ»\lambda. The interarrival times of such a process are iid Exp(Ξ»\lambda). Remember that, even though the rate is constant, the arrivals themselves are still random because the interarrival times are random.

Let's look at how we generate the arrival times. Note that we set T0←0T_0 \leftarrow 0 to initialize the process at time zero. From there, we can compute the iith arrival time, TiT_i, as:

Ti←Tiβˆ’1βˆ’1Ξ»ln⁑Ui,iβ‰₯1T_i \leftarrow T_{i-1} - \frac{1}{\lambda}\ln U_i, \quad i \geq 1

In other words, the next arrival time equals the previous arrival time, plus an exponential random variable referring to the interarrival time. Even more basically, the time at which the next person shows up is equal to the time at which the last person showed up, plus some randomly generated interarrival time.

Note that, even though we seem to be subtracting from Tiβˆ’1T_{i-1}, the natural log of 0≀Ui<10 \leq U_i < 1 is a negative number, so we are, in fact, adding two positive quantities here.

We refer to this iterative process as bootstrapping, whereby we build subsequent arrivals on previous arrivals.

Fixed Arrivals

Suppose we want to generate a fixed number, nn, arrivals from a Poisson(Ξ»\lambda) process in a fixed time interval [a,b][a,b]. Of course, we cannot guarantee any number of arrivals in a fixed time interval in a Poisson process because the interarrival times are random.

It turns out that there is a theorem that states the following: the joint distribution of nn arrivals from the Poisson process during some interval [a,b][a,b] is equivalent to the joint distribution of the order statistics of nn iid U(a,b)\mathcal U(a,b) random variables.

Here's the algorithm. First, we generate nn iid U(0,1)\mathcal U(0,1) uniforms, U1,...,UnU_1,...,U_n. Next, we sort them: U(1)<U(2)<...<U(n)U_{(1)} < U_{(2)} < ... < U_{(n)}. Finally, we transform them to lie on the interval [a,b][a,b]:

Ti←a+(bβˆ’a)U(i)T_i \leftarrow a + (b-a)U_{(i)}

Nonhomogeneous Poisson Processes

In this lesson, we will look at nonhomogeneous Poisson processes (NHPPs), where the arrival rate changes over time. We have to be careful here: we can't use the algorithm we discussed last time to generate arrivals from NHPPs.

NHPPs - Nonstationary Arrivals

An NHPP maintains the same assumptions as a standard Poisson process, except that the arrival rate, Ξ»\lambda, isn't constant, so the stationary increments assumption no longer applies.

Let's define the function Ξ»(t)\lambda(t), which describes the arrival rate at time tt. Let's also define the function N(t)N(t), which counts the number of arrivals during [0,t][0,t].

Consider the following theorem to describe the number of arrivals from an NHPP between time ss and time tt:

N(s+t)βˆ’N(s)∼Poisson(∫ss+tΞ»(u)du)N(s+t) - N(s) \sim \text{Poisson}\left(\int_s^{s+t} \lambda(u)du\right)

Example

Suppose that the arrival pattern to the Waffle House over a certain time period is an NHPP with Ξ»(t)=t2\lambda(t) = t^2. Let's find the probability that there will be exactly four arrivals between times t=1t=1 and t=2t=2.

From the equation above, we can see that the number of arrivals in this time interval is:

N(2)βˆ’N(1)∼Pois(∫12t2dt)∼Pois(7/3)N(2) - N(1) \sim \text{Pois}\left(\int_1^{2} t^2dt\right) \sim \text{Pois}(7/3)

Since the number of arrivals is X∼Pois(7/3)X \sim \text{Pois}(7/3), we can calculate the P(X=4)P(X = 4) using the pmf, f(x)f(x):

f(x)=eβˆ’Ξ»(Ξ»)xx!=eβˆ’7/3(7/3)44!=0.120f(x) = \frac{e^{-\lambda}(\lambda)^x}{x!} = \frac{e^{-7/3}(7/3)^4}{4!} = 0.120

Incorrect NHPP Algorithm

Now, let's look at an incorrect algorithm for generating NHPP arrivals. This algorithm is particularly bad because it can "skip" intervals if Ξ»(t)\lambda(t) grows very quickly.

Here's the algorithm.

We initialize the algorithm with T0←0;i←0T_0 \leftarrow 0; i \leftarrow 0. To generate the iith arrival, we first sample U∼U(0,1)U \sim \mathcal U(0,1), and then we perform the following computation:

Ti+1←Tiβˆ’1Ξ»(Ti)ln⁑UT_{i+1} \leftarrow T_i - \frac{1}{\lambda(T_i)}\ln U

Note that, instead of multiplying by βˆ’1/Ξ»-1 / \lambda, we multiply by βˆ’1/Ξ»(Ti)-1 / \lambda(T_i), which is the arrival rate at the time of the previous arrival. The problem arises when the arrival rate radically changes between the time we generate the new arrival and the time that arrival occurs.

Specifically, if Ξ»(Ti+1)>>Ξ»(Ti)\lambda(T_{i+1}) >> \lambda(T_{i}), then the arrival rate we use to generate arrival Ti+1T_{i+1}, evaluated at time TiT_i, is too small. Since we capture the arrival rate of the next arrival at the time of the current arrival, we essentially consider the rate to be fixed between that time and the time the arrival actually occurs.

Thinning Algorithm

Consider the following rate function, Ξ»(t)\lambda(t), graphed below. We can suppose, perhaps, that this function depicts the arrivals in a restaurant, and the peaks correspond to breakfast, lunch, and dinner rushes.

We are going to use the thinning algorithm to generate the iith arrival. First, we assume that Ξ»βˆ—=max⁑tΞ»(t)\lambda^* = \max_t\lambda(t) is finite. We can see the line y=Ξ»βˆ—y = \lambda^* drawn below.

From there, we will generate potential arrivals at that maximum rate Ξ»βˆ—\lambda^*:

Ti+1=T1βˆ’βˆ’1Ξ»βˆ—ln⁑UT_{i+1} = T_1 - \frac{-1}{\lambda^*}\ln U

We can see the potential arrivals in dotted purple below.

We will accept a potential arrival at time tt as a real arrival with probability Ξ»(t)/Ξ»βˆ—\lambda(t) / \lambda^*. Thus, as Ξ»(t)\lambda(t) approaches Ξ»βˆ—\lambda^*, the likelihood that we accept tt as a real arrival increases proportionally to Ξ»(t)/Ξ»βˆ—\lambda(t) / \lambda^*. We call this algorithm the thinning algorithm because it thins out potential arrivals.

Let's look at the algorithm.

We initialize the algorithm T0←0;i←0T_0 \leftarrow 0; i \leftarrow 0. Next, we initialize t←Tit \leftarrow T_i, and then we iterate. We generate two iid Unif(0,1) random variables, UU and VV, and then we generate a potential arrival:

t←tβˆ’βˆ’1Ξ»βˆ—ln⁑Ut \leftarrow t - \frac{-1}{\lambda^*}\ln U

We keep generating UU and VV until this condition holds:

V≀λ(t)/Ξ»βˆ—V \leq \lambda(t)/\lambda^*

In other words, we only keep the potential arrival with probability Ξ»(t)/Ξ»βˆ—\lambda(t)/\lambda^*. After we accept tt, we set ii to i+1i+1, and TiT_i to tt, and repeat.

DEMO

In this demo, we will look at how to implement an NHPP in Excel.

In column A, we have potential NHPP arrival times.

In column B, we have the rate function Ξ»(t)=1+sin⁑(t/5)\lambda(t) = 1 + \sin(t/5). Because the sine function is periodic, so are the arrival rates generated by Ξ»(t)\lambda(t). In fact, Ξ»(t)\lambda(t) never goes below zero, and never goes above two, so Ξ»βˆ—=2\lambda^* = 2.

In column D, we have a sequence of Ui∼U(0,1)U_i \sim \mathcal U(0,1) random variables, which we will use to generate interarrival times.

In column E, we generate the potential interarrival times by transforming UiU_i into an Exp(Ξ»βˆ—\lambda^*) random variable.

In column F, we bootstrap the arrivals, generating the iith arrival by adding the iβˆ’1i-1th arrival and the iith interarrival time. Note that these potential arrivals are the same values in column A.

In column G, we generate another sequence of random variables Vi∼U(0,1)V_i \sim \mathcal U(0,1), which we will use to determine whether we accept the potential arrival as a real arrival.

In column H, we generate Ξ»(t)/Ξ»βˆ—\lambda(t)/\lambda^*, which refers to the probability that we accept the corresponding potential arrival at time tt.

If Vi<Ξ»(ti)/Ξ»βˆ—V_i < \lambda(t_i)/\lambda^*, then we accept the arrival, and we show this boolean value in column I.

In column J, we copy over the real arrival times for all potential arrivals where Vi<Ξ»(ti)/Ξ»βˆ—V_i < \lambda(t_i)/\lambda^*. Otherwise, we mark the cell as "FALSE".

Finally, in column K, we update the arrival rate for the next iteration.

Here's what the plot of arrival rates over time looks like for this NHPP. Every dot represents an accepted arrival. In the space between the dots, we may have generated potential arrivals, but we didn't accept them as real.

We can see that we have many real arrivals when the Ξ»(t)\lambda(t) is close to Ξ»βˆ—\lambda^* and few real arrivals when the two values diverge. Of course, this observation makes sense, given that we accept potential arrivals with probability Ξ»(t)/Ξ»βˆ—\lambda(t) / \lambda^*.

Time Series (OPTIONAL)

In this lesson, we will look at various time series processes, most of which are used in forecasting. In particular, we will be looking at auto-regressive moving average processes (ARMA), which have standard-normal noise, and auto-regressive Pareto (ARP) processes, which have Pareto noise.

First-Order Moving Average Process

A first-order moving average process, or MA(1), is a popular time series for modeling and detecting trends. This process is defined as follows:

Yi=Ο΅i+ΞΈΟ΅iβˆ’1,i=1,2,...,Y_i = \epsilon_i + \theta\epsilon_{i-1}, \quad i = 1,2,...,

Here, ΞΈ\theta is a constant, and the Ο΅i\epsilon_i terms are typically iid Nor(0,1) - in practice, they can be iid anything - and they are independent of Y0Y_0.

Notice that Y1=Ο΅1+ΞΈΟ΅0Y_1 = \epsilon_1 + \theta\epsilon_0, and Y2=Ο΅2+ΞΈΟ΅1Y_2 = \epsilon_2 + \theta\epsilon_1. Both Y1Y_1 and Y2Y_2 contain an Ο΅1\epsilon_1 term; generally, YiY_i and Yiβˆ’1Y_{i-1} pairs are correlated.

Let's derive the variance of YiY_i. Remember, that:

Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X,Y)

Let's consider Yi=Ο΅i+ΞΈΟ΅iβˆ’1Y_i = \epsilon_i + \theta\epsilon_{i-1}:

Var(Ο΅i+ΞΈΟ΅iβˆ’1)=Var(Ο΅i)+Var(ΞΈΟ΅iβˆ’1)+2Cov(Ο΅i,ΞΈΟ΅iβˆ’1)\text{Var}(\epsilon_i + \theta\epsilon_{i-1}) = \text{Var}(\epsilon_i) + \text{Var}(\theta\epsilon_{i-1}) + 2\text{Cov}(\epsilon_i,\theta\epsilon_{i-1})

Now, we know that successive Ο΅i\epsilon_i's are independent, so the covariance is zero:

Var(Ο΅i+ΞΈΟ΅iβˆ’1)=Var(Ο΅i)+Var(ΞΈΟ΅iβˆ’1)\text{Var}(\epsilon_i + \theta\epsilon_{i-1}) = \text{Var}(\epsilon_i) + \text{Var}(\theta\epsilon_{i-1})

Furthermore, we know that Var(aX)=a2Var(X)\text{Var}(aX) = a^2\text{Var}(X), so:

Var(Ο΅i+ΞΈΟ΅iβˆ’1)=Var(Ο΅i)+ΞΈ2Var(Ο΅iβˆ’1)\text{Var}(\epsilon_i + \theta\epsilon_{i-1}) = \text{Var}(\epsilon_i) + \theta^2\text{Var}(\epsilon_{i-1})

Finally, since the Ο΅i\epsilon_i's are Nor(0,1), they both have variance one, leaving us with:

Var(Ο΅i+ΞΈΟ΅iβˆ’1)=1+ΞΈ2\text{Var}(\epsilon_i + \theta\epsilon_{i-1}) = 1 + \theta^2

Now, let's look at the covariance between successive Yi,Yi+1Y_i, Y_{i+1} pairs. Here's a covariance property:

Cov(aX+bY,cW+dV)=acCov(X,W)+adCov(X,V)+bcCov(Y,W)+bdCov(Y,V)\begin{alignedat}{1} & \text{Cov}(aX + bY, cW + dV) = \\ & ac\text{Cov}(X,W) + ad\text{Cov}(X,V) + bc\text{Cov}(Y,W) + bd\text{Cov}(Y,V) \end{alignedat}

So:

Cov(Ο΅i+ΞΈΟ΅iβˆ’1,Ο΅i+1+ΞΈΟ΅i)=Cov(Ο΅i,Ο΅i+1)+ΞΈCov(Ο΅i,Ο΅i)+ΞΈCov(Ο΅iβˆ’1,Ο΅i+1)+ΞΈ2Cov(Ο΅iβˆ’1,Ο΅i)\begin{alignedat}{1} & \text{Cov}(\epsilon_i + \theta\epsilon_{i-1}, \epsilon_{i+1} + \theta\epsilon_i) = \\ & \text{Cov}(\epsilon_i, \epsilon_{i+1}) + \theta\text{Cov}(\epsilon_i,\epsilon_i) + \theta\text{Cov}(\epsilon_{i-1},\epsilon_{i+1}) + \theta^2\text{Cov}(\epsilon_{i-1},\epsilon_i) \end{alignedat}

Since the ϡi\epsilon_i's are independent, we know that Cov(ϡi,ϡj)=0,i≠j\text{Cov}(\epsilon_i, \epsilon_j) = 0, i \neq j. So:

Cov(Ο΅i+ΞΈΟ΅iβˆ’1,Ο΅i+1+ΞΈΟ΅i)=ΞΈCov(Ο΅i,Ο΅i)\text{Cov}(\epsilon_i + \theta\epsilon_{i-1}, \epsilon_{i+1} + \theta\epsilon_i) = \theta\text{Cov}(\epsilon_i,\epsilon_i)

We also know that Cov(X,X)=Var(X)\text{Cov}(X,X) = \text{Var}(X), and, since ϡi∼Nor(0,1)\epsilon_i \sim \text{Nor}(0,1), Var(ϡi)=1\text{Var}(\epsilon_i) = 1. So:

Cov(Ο΅i+ΞΈΟ΅iβˆ’1,Ο΅i+1+ΞΈΟ΅i)=ΞΈ\text{Cov}(\epsilon_i + \theta\epsilon_{i-1}, \epsilon_{i+1} + \theta\epsilon_i) = \theta

We can also demonstrate, using the above formula, that Cov(Yi,Yi+k)=0,kβ‰₯2\text{Cov}(Y_i, Y_{i+k}) = 0, k \geq 2.

As we can see, the covariances die off pretty quickly. We probably wouldn't use an MA(1) process to model, say month-to-month unemployment, because the unemployment rate among three months is correlated, and this type of process cannot express that.

How might we generate an MA(1) process? Let's start with ϡ0∼Nor(0,1)\epsilon_0 \sim \text{Nor}(0,1). Then, we generate ϡ1∼Nor(0,1)\epsilon_1 \sim \text{Nor}(0,1) to get Y1Y_1, ϡ2∼Nor(0,1)\epsilon_2 \sim \text{Nor}(0,1) to get Y2Y_2, and so on. Every time we generate a new ϡi\epsilon_i, we can compute the corresponding YiY_i.

First-Order Autoregressive Process

Now let's look at a first-order autoregressive process, or AR(1), which is used to model many different real-world processes. The AR(1) process is defined as:

Yi=Ο•Yiβˆ’1+Ο΅i,i=1,2,...,Y_i = \phi Y_{i-1} + \epsilon_i, i =1,2,...,

In order for this process to remain stationary, we need to ensure that βˆ’1<Ο•<1-1 < \phi < 1, Y0∼Nor(0,1)Y_0 \sim \text{Nor}(0,1), and that the Ο΅i\epsilon_i's are iid Nor(0,1βˆ’Ο•2)\text{Nor}(0, 1 - \phi^2).

Unlike the MA(1) process, non-consecutive observations in the AR(1) process have non-zero correlation. Specifically, the covariance function between YiY_i and Yi+kY_{i+k} is as follows:

Cov(Yi,Yi+k)=Ο•βˆ£k∣,k=0,Β±1,Β±2\text{Cov}(Y_i, Y_{i+k}) = \phi^{|k|}, k = 0, \pm 1, \pm 2

The correlations start out large for consecutive observations and decrease as kk increases, but they never quite become zero; there is always some correlation between any two observations in the series.

If Ο•\phi is close to one, observations in the series are highly positively correlated. Alternatively, if Ο•\phi is close to zero, observations in the series are highly negatively correlated. If Ο•\phi is close to zero, then the YiY_i's are nearly independent.

How do we generate observations from an AR(1) process? Let's start with Y0∼Nor(0,1)Y_0 \sim \text{Nor}(0,1) and Ο΅1∼1βˆ’Ο•2Nor(0,1)\epsilon_1 \sim \sqrt{1-\phi^2}\text{Nor}(0,1) to get Y1=Ο•Y0+Ο΅1Y_1 = \phi Y_0 + \epsilon_1. Then, we can generate Ο΅2∼1βˆ’Ο•2Nor(0,1)\epsilon_2 \sim \sqrt{1-\phi^2}\text{Nor}(0,1) to get Y2=Ο•Y1+Ο΅2Y_2 = \phi Y_1 + \epsilon_2, and so on.

Remember that:

aX∼Nor(aΞΌ,a2Οƒ2),X∼Nor(0,1)aX \sim \text{Nor}(a\mu , a^2\sigma^2), \quad X \sim \text{Nor}(0,1)

So:

1βˆ’Ο•2Ο΅i∼Nor(0,1βˆ’Ο•2),Ο΅i∼Nor(0,1)\sqrt{1-\phi^2}\epsilon_i \sim \text{Nor}(0, 1-\phi^2), \quad \epsilon_i \sim \text{Nor}(0,1)

Here are the plots of three AR(1) processes, each parameterized with a different value of Ο•\phi.

ARMA(p,q) Process

The ARMA(pp,qq) process is an obvious generalization of the MA(1) and AR(1) processes, which consists of a ppth order AR process and a qqth order MA process, which we will define as:

Yi=βˆ‘j=1p=Ο•jYiβˆ’j+Ο΅i+βˆ‘j=1qΞΈjΟ΅iβˆ’j,i=1,2,...,Y_i = \sum_{j=1}^p = \phi_j Y_{i-j} + \epsilon_i + \sum_{j=1}^q \theta_j\epsilon_{i-j}, \quad i = 1,2,...,

In the first sum, we can see the pp autoregressive components, and, in the second sum, we can see the qq moving average components.

We have to take care to choose the Ο•j\phi_j's and the ΞΈj\theta_j's in such a way to ensure that the process doesn't blow up. In any event, ARMA(pp, qq) processes are used in a variety of forecasting and modeling applications.

Exponential AR Process

An exponential autoregressive process, or EAR, is an autoregressive process that has exponential, not normal, noise. Here's the definition:

Yi={Ο•Yiβˆ’1,w.p. ϕϕYiβˆ’1+Ο΅i,w.p.Β 1βˆ’Ο•i=1,2,...,Y_i = \begin{cases} \phi Y_{i-1}, & \text{w.p. } \phi \\ \phi Y_{i-1} + \epsilon_i, & \text{w.p. } 1 - \phi \\ \end{cases} \quad i = 1,2,...,

Here we need to ensure that 0≀ϕ<10 \leq \phi < 1, Y0∼Exp(1)Y_0 \sim \text{Exp}(1), and the Ο΅i\epsilon_i's are iid Exp(1) random variables.

Believe it or not, the EAR(1) has the same covariance function as the AR(1), except that the bounds of Ο•\phi are different:

Cov(Yi,Yi+k)=Ο•βˆ£k∣,0≀ϕ<1\text{Cov}(Y_i, Y_{i+k}) = \phi^{|k|}, 0 \leq \phi < 1

Let's look at a plot of an EAR(1) process with Ο•=0\phi = 0. These observations are iid exponential, and if we were to make a histogram of these observations, it would resemble the exponential distribution.

Here's a plot of an EAR(1) process with Ο•=0.95\phi = 0.95. We can see that when successive observations decrease, they appear to decrease exponentially. With probability 1βˆ’Ο•=0.051 - \phi = 0.05, they receive a jump from Ο΅i\epsilon_i, which causes the graph to shoot up.

Autoregressive Pareto - ARP

A random variable XX has the Pareto distribution with parameters Ξ»>0\lambda > 0 and Ξ²>0\beta > 0 if it has the cdf:

FX(x)=1βˆ’(Ξ»/x)Ξ²,xβ‰₯Ξ»F_X(x) = 1 - (\lambda/x)^\beta, x \geq \lambda