88 minute read

Notice a tyop typo? Please submit an issue or open a PR.

In this lesson, we are going to continue our exploration of conditional expectation and look at several cool applications. This lesson will likely be the toughest one for the foreseeable future, but don't panic!

Let's revisit the conditional expectation of $Y$ given $X = x$. The definition of this expectation is as follows:

$E[Y|X = x] = \left\{\begin{matrix}
\sum_y y f(y|x) & \text{discrete} \\
\int_\R y f(y|x)dy & \text{continuous}
\end{matrix}\right.$

For example, suppose $f(x,y) \ 21x^2y / 4$ for $x^2 \leq y \leq 1$. Then, by definition:

$f(y|x) = \frac{f(x,y)}{f_X(x)}$

We calculated the marginal pdf, $f_X(x)$, previously, as the integral of $f(x,y)$ over all possible values of $y \in [x^2, 1]$. We can plug in $f_X(x)$ and $f(x, y)$ below:

$f(y|x) = \frac{\frac{21}{4}x^2y}{\frac{21}{8}(1- x^4)} = \frac{2y}{1 - x^4}, \quad x^2 \leq y \leq 1$

Given $f(y|x)$, we can now compute $E[Y | X = x]$:

$E[Y | X = x] = \int_\R y * \left(\frac{2y}{1 - x^4}\right)dy$

We adjust the limits of integration to match the limits of $y$:

$E[Y | X = x] = \int_{x^2}^1 y * \left(\frac{2y}{1 - x^4}\right)dy$

Now, complete the integration:

$E[Y | X = x] = \int_{x^2}^1 \frac{2y^2}{1 - x^4}dy$

$E[Y | X = x] = \frac{2}{1 - x^4} \int_{x^2}^1 y^2dy$

$E[Y | X = x] = \frac{2}{1 - x^4} \frac{y^3}{3}\Big|^1_{x^2}$

$E[Y | X = x] = \frac{2}{3(1 - x^4)} y^3\Big|^1_{x^2} = \frac{2(1 - x^6)}{3(1 - x^4)}$

We just looked at the expected value of $Y$ given a particular value $X = x$. Now we are going to average the expected value of $Y$ over all values of $X$. In other words, we are going to take the average expected value of all the conditional expected values, which will give us the overall population average for $Y$.

The theorem of **double expectations** states that the expected value of the expected value of $Y$ given $X$ is the expected value of $Y$. In other words:

$E[E(Y|X)] = E[Y]$

Let's look at $E[Y|X]$. We can use the formula that we used to calculate $E[Y|X=x]$ to find $E[Y|X]$, replacing $x$ with $X$. Let's look back at our conditional expectation from the previous slide:

$E[Y | X = x] = \frac{2(1 - x^6)}{3(1 - x^4)}$

If we set $X = X$, we get the following expression:

$E[Y | X = X] = E[Y | X] = \frac{2(1 - X^6)}{3(1 - X^4)}$

What does this mean? $E[Y|X]$ is itself a random variable that is a function of the random variable $X$. Let's call this function $h$:

$h(X) = \frac{2(1 - X^6)}{3(1 - X^4)}$

We now have to calculate $E[h(X)]$, which we can accomplish using the definition of LOTUS:

$E[h(X)] = \int_\R h(x)f_X(x)dx$

Let's substitute in for $h(x)$ and $h(X)$:

$E[E[Y|X]] = \int_\R E(Y|x)f_X(x)dx$

Remember the definition for $E[Y|X = x]$:

$E[Y|X = x] = \left\{\begin{matrix}
\sum_y y f(y|x) & \text{discrete} \\
\int_\R y f(y|x)dy & \text{continuous}
\end{matrix}\right.$

Thus:

$E[E[Y|X]] = \int_\R \left(\int_\R y f(y|x)dy\right)f_X(x)dx$

We can rearrange the right-hand side. Note that we can move $y$ outside of the first integral since it is a constant value when we integrate with respect to $dx$:

$E[E[Y|X]] = \int_\R \int_\R y f(y|x)f_X(x)dx dy = \int_\R y \int_\R f(y|x)f_X(x)dx dy$

Remember now the definition for the conditional pdf:

$f(y|x) = \frac{f(x,y)}{f_X(x)}; \quad f(y|x)f_X(x) = f(x, y)$

We can substitute in $f(x,y)$ for $f(y|x)f_X(x)$:

$E[E[Y|X]] = \int_\R y \int_\R f(x,y)dx dy$

Let's remember the definition for the marginal pdf of $Y$:

$f_Y(y) = \int_\R f(x,y)dx$

Let's substitute:

$E[E[Y|X]] = \int_\R y f_Y(y)dy$

Of course, the expected value of $Y$, $E[Y]$ equals:

$E[Y] = \int_\R y f_Y(y) dy$

Thus:

$E[E[Y|X]] = E[Y]$

Let's apply this theorem using our favorite joint pdf: $f(x,y) = 21x^2y / 4, x^2 \leq y \leq 1$. Through previous examples, we know $f_X(x)$, $f_Y(y)$ and $E[Y|x]$:

$f_X(x) = \frac{21}{8}x^2(1-x^4)$

$f_Y(y) = \frac{7}{2}y^{\frac{5}{2}}$

$E[Y|x] = \frac{2(1 - x^6)}{3(1 - x^4)}$

We are going to look at two ways to compute $E[Y]$. First, we can just use the definition of expected value and integrate the product $yF_Y(y)dy$ over the real line:

$E[Y] = \int_\R y f_Y(y)dy$

$E[Y] = \int_0^1 y * \frac{7}{2}y^{\frac{5}{2}} dy$

$E[Y] = \int_0^1 \frac{7}{2}y^{\frac{7}{2}} dy$

$E[Y] = \frac{7}{2} \int_0^1 y^{\frac{7}{2}} dy$

$E[Y] = \frac{7}{2} \frac{2}{9}y^\frac{9}{2}\Big|_0^1 = \frac{7}{9} y^\frac{9}{2}\Big|_0^1 = \frac{7}{9}$

Now, let's calculate $E[Y]$ using the double expectation theorem we just learned:

$E[Y] = E[E(Y|X)] = \int_\R E(Y|x) f_X(x)dx$

$E[Y] = \int_{-1}^1 \frac{2(1 - x^6)}{3(1 - x^4)} \times \frac{21}{8}x^2(1-x^4) dx$

$E[Y] = \frac{42}{24}\int_{-1}^1 \frac{(1 - x^6)}{(1 - x^4)} \times x^2(1-x^4) dx$

$E[Y] = \frac{42}{24}\int_{-1}^1 x^2(1 - x^6) dx$

$E[Y] = \frac{42}{24}\int_{-1}^1 x^2 - x^8 dx$

$E[Y] = \frac{42}{24} \left(\frac{x^3}{3} - \frac{x^9}{9}\right) \Big|_{-1}^1$

$E[Y] = \frac{42}{24} \left(\frac{3x^3 - x^9}{9} \right) \Big|_{-1}^1$

$E[Y] = \frac{42}{24} \left(\frac{3 - 1 - (-3+1)}{9} \right) = \frac{42}{24} * \frac{4}{9} = \frac{7}{9}$

In this application, we are going to see how we can use double expectation to calculate the mean of a geometric distribution.

Let $Y$ equal the number of coin flips before a head, $H$, appears, where $P(H) = p$. Thus, $Y$ is distributed as a geometric random variable parameterized by $p$: $Y \sim \text{Geom}(p)$. We know that the pmf of $Y$ is $f_Y(y) = P(Y = y) = (1-p)^{y-1}p, y = 1,2,...$. In other words, $P(Y = y)$ is the product of the probability of $y-1$ failures and the probability of one success.

Let's calculate the expected value of $Y$ using the summation equation we've used previously (take the result on faith):

$E[Y] = \sum_y y f_Y(y) = \sum_1^\infty y(1-p)^{y-1}p = \frac{1}{p}$

Now we are going to use double expectation and a *standard one-step conditioning argument* to compute $E[Y]$. First, let's define $X = 1$ if the first flip is $H$ and $X = 0$ otherwise. Let's pretend that we have knowledge of the first flip. We don't really have this knowledge, but we do know that the first flip can either be heads or tails: $P(X = 1) = p, P(X = 0) = 1 - p$.

Let's remember the double expectation formula:

$E[Y] = E[E(Y|X)] = \sum_x E(Y|x)f_X(x)$

What are the $x$-values? $X$ can only equal $0$ or $1$, so:

$E[Y] = E(Y|X = 0)P(X = 0) + E(Y|X=1)P(X=1)$

Now, if $X= 0$, the first flip was tails, and I have to start counting all over again. The expected number of flips I have to make before I see heads is $E[Y]$. However, I have already flipped once, and I flipped tails: that's what $X = 0$ means. So, the expected number of flips I need, given that I already flipped tails is $1 + E[Y]$: $P(Y|X=0) = 1 + E[Y]$ What is $P(0)$? It's just $1 - p$. Thus:

$E[Y|X = 0]P(X = 0) = (1 + E[Y])(1 - p)$

Now, if $X = 1$, the first flip was heads. I won! Given that $X = 1$, the expected value of $Y$ is one. If I know that I flipped heads on the first try, the expected number of trials before I flip heads is that one trial: $P(Y|X=1) = 1$. What is $P(1)$? It's just $p$. Thus:

$E[Y|X = 1]P(X = 1) = (1)(p) = p$

Let's solve for $E[Y]$:

$E[Y] = (1 + E[Y])(1 - p) + p$

$E[Y] = 1 + E[Y] -p -pE[Y] + p$

$E[Y] = 1 + E[Y] - pE[Y]$

$pE[Y] = 1; \quad E[Y] = \frac{1}{p}$

Let $A$ be some event. We define the random variable $Y=1$ if $A$ occurs, and $Y = 0$ otherwise. We refer to $Y$ as an indicator function of $A$; that is, the value of $Y$ indicates the occurrence of $A$. The expected value of $Y$ is given by:

$E[Y] = \sum_y y f_Y(y)dy$

Let's enumerate the $y$-values:

$E[Y] = 0(P(Y = 0)) + 1(P(Y = 1)) = P(Y = 1)$

What is $P(Y = 1)$? Well, $Y = 1$ when $A$ occurs, so $P(Y = 1) = P(A) = E[Y]$. Indeed, the expected value of an indicator function is the probability of the corresponding event.

Similarly, for any random variable, $X$, we have:

$E[Y | X = x] = \sum_y y f_Y(y|x)$

If we enumerate the $y$-values, we have:

$\begin{alignedat}{1}
E[Y | X = x] & = 0(f_Y(Y = 0|X= x)) + 1(f_Y(Y = 1|X = x)) \\[2ex]
& = f_Y(Y = 1|X = x)
\end{alignedat}$

Since we know that $f(Y = 1) = P(A)$, then:

$E[Y = 1 | X = x] = P(A|X = 1)$

Let's look at an implication of the above result. By definition:

$P[A] = E[Y] = E[E(Y | X)]$

Using LOTUS:

$P[A] = \int_\R E[Y|X=x]dF_X(x)$

Since we saw that $E[Y|X=x] = P(A|X=x)$, then:

$P[A] = \int_\R P(A|X=x)dF_X(x)$

The result above implies that, if $X$ and $Y$ are independent, continuous random variables, then:

$P(Y < X) = \int_\R P(Y < x)f_X(x)dx$

To prove, let $A = \{Y < X\}$. Then:

$P[A] = \int_\R P(A|X=x)dF_X(x)$

Substitute $A = \{Y < X\}$:

$P[A] = \int_\R P(Y < X|X=x)dF_X(x)$

What's $P(Y < X|X=x)$? In other words, for a given $X = x$, what's the probability that $Y < X$? That's a long way of saying $P(Y < x)$:

$P[A] = \int_\R P(Y < x)dF_X(x)$

$P[A] = P[Y < X] = \int_\R P(Y < x)f_X(x)dx, \quad F_x'(x) = f_X(x)dx$

Suppose we have two random variables, $X \sim \text{Exp}(\mu)$ and $Y \sim \text{Exp}(\lambda)$. Then:

$P(Y < X) = \int_\R P(Y < x)f_X(x)dx$

Note that $P(Y < x)$ is the cdf of $Y$ at $x$: $F_Y(x)$. Thus:

$P(Y < X) = \int_\R F_Y(x)f_X(x)dx$

Since $X$ and $Y$ are both exponentially distributed, we know that they have the following pdf and cdf, by definition:

$f(x; \lambda) = \lambda e^{-\lambda x}, x \geq 0$

$F(x; \lambda) = 1 - e^{-\lambda x}, x \geq 0$

Let's substitute these values in, adjusting the limits of integration appropriately:

$P(Y < X) = \int_0^\infty 1 - e^{-\lambda x}(\mu e^{-\mu x})dx$

Let's rearrange:

$P(Y < X) = \mu \int_0^\infty e^{-\mu x} - e^{-\lambda x - \mu x} dx$

$P(Y < X) = \mu \left[\int_0^\infty e^{-\mu x} dx - \int_0^\infty e^{-\lambda x - \mu x} dx\right]$

Let $u_1 = -\mu x$. Then $du_1 = -\mu dx$. Let $u_2 = -\lambda x - \mu x$. Then $du_2 = -(\lambda + \mu)dx$. Thus:

$P(Y < X) = \mu \left[-\int_0^\infty \frac{e^{u_1}}{\mu} du_1 + \int_0^\infty \frac{e^{u_2}}{\lambda + \mu} du_2\right]$

Now we can integrate:

$P(Y < X) = \mu \left[\int_0^\infty \frac{e^{u_2}}{\lambda + \mu} du_2 - \int_0^\infty \frac{e^{u_1} }{\mu}du_1 \right]$

$P(Y < X) = \mu \left[\frac{e^{u_2}}{\lambda + \mu} - \frac{e^{u_1}}{\mu} \right]_0^\infty$

$P(Y < X) = \mu \left[\frac{e^{-\lambda x - \mu x}}{\lambda + \mu} - \frac{e^{-\mu x}}{\mu} \right]_0^\infty$

$P(Y < X) = \mu \left[0 - \frac{1}{\lambda + \mu} + \frac{1}{\mu} \right]$

$P(Y < X) = \mu \left[\frac{1}{\mu} - \frac{1}{\lambda + \mu} \right]$

$P(Y < X) = \frac{\mu}{\mu} - \frac{\mu}{\lambda + \mu}$

$P(Y < X) = \frac{\lambda + \mu}{\lambda + \mu} - \frac{\mu}{\lambda + \mu} = \frac{\lambda}{\lambda + \mu}$

As it turns out, this result makes sense because $X$ and $Y$ correspond to arrivals from a poisson process and $\mu$ and $\lambda$ are the arrival rates. For example, suppose that $X$ corresponds to arrival times for women to a store, and $Y$ corresponds to arrival times for men. If women are coming in at a rate of three per hour - $\lambda = 3$ - and men are coming in at a rate of nine per hour - $\mu = 9$ - then the probability of a woman arriving before a man is going to be $3/4$.

Just as we can use double expectation for the expected value of $Y$, we can express the variance of $Y$, $\text{Var}(Y)$ in a similar fashion, which we refer to as **variance decomposition**:

$\text{Var}(Y) = E[\text{Var}(Y|X)] + \text{Var}[E(Y|X)]$

Let's start with the first term: $E[\text{Var}(Y|X)]$. Remember the definition of variance, as the second central moment:

$\text{Var}(X) = E[X^2] - (E[X])^2$

Thus, we can express $E[\text{Var}(Y|X)]$ as:

$E[\text{Var}(Y|X)] = E[E[Y^2 | X] - (E[Y|X])^2]$

Note that, since expectation is linear:

$E[\text{Var}(Y|X)] = E[E[Y^2 | X]] - E[(E[Y|X])^2]$

Notice the first expression on the right-hand side. That's a double expectation, and we know how to simplify that:

$E[\text{Var}(Y|X)] = E[Y^2] - E[(E[Y|X])^2], \quad 1.$

Now let's look at the second term in the variance decomposition: $\text{Var}[E(Y|X)]$. Considering again the definition for variance above, we can transform this term:

$\text{Var}[E(Y|X)] = E[(E[Y | X)^2] - (E[E[Y|X]])^2$

In this equation, we again see a double expectation, quantity squared. So:

$\text{Var}[E(Y|X)] = E[(E[Y| X)^2] - E[Y]^2, \quad 2.$

Remember the equation for variance decomposition:

$\text{Var}(Y) = E[\text{Var}(Y|X)] + \text{Var}[E(Y|X)]$

Let's plug in $1$ and $2$ for the first and second term, respectively:

$\text{Var}(Y) =E[Y^2] - E[(E[Y|X])^2] + E[(E[Y | X)^2] - E[Y]^2$

Notice the cancellation of the two scary inner terms to reveal the definition for variance:

$\text{Var}(Y) = E[Y^2] - E[Y]^2 = \text{Var}(Y)$

In this lesson, we are going to talk about independence, covariance, correlation, and some related results. Correlation shows up all over the place in simulation, from inputs to outputs to everywhere in between.

Suppose that $h(X,Y)$ is some function of two random variables, $X$ and $Y$. Then, via LOTUS, we know how to calculate the expected value, $E[h(X,Y)]$:

$E[h(X,Y)] = \left\{\begin{matrix}
\sum_x \sum_y h(x,y)f(x,y) & \text{if (X,Y) is discrete} \\
\int_\R \int_\R h(x,y)f(x,y)dx dy & \text{if (X,Y) is continuous} \\
\end{matrix}\right.$

Whether or not $X$ and $Y$ are independent, the sum of the expected values equals the expected value of the sum:

$E[X+Y] = E[X] + E[Y]$

If $X$ and $Y$ are independent, then the sum of the variances equals the variance of the sum:

$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$

Note that we need the equations for LOTUS in two dimensions to prove both of these theorems.

Aside: I tried to prove these theorems. It went terribly! Check out the proper proofs here.

Let's suppose we have a set of $n$ random variables: $X_1,...,X_n$. This set is said to form a **random sample** from the pmf/pdf $f(x)$ if all the variables are (i) independent and (ii) each $X_i$ has the same pdf/pmf $f(x)$.

We can use the following notation to refer to such a random sample:

$X_1,...,X_n \overset{\text{iid}}{\sim} f(x)$

Note that "iid" means "independent and identically distributed", which is what (i) and (ii) mean, respectively, in our definition above.

Given a random sample, $X_1,...,X_n \overset{\text{iid}}{\sim} f(x)$, the sample mean, $\bar{X_n}$ equals the following:

$\bar{X_n} \equiv \sum_{i =1}^n \frac{X_i}{n}$

Given the sample mean, the expected value of the sample mean is the expected value of any of the individual variables, and the variance of the sample mean is the variance of any of the individual variables divided by $n$:

$E[\bar{X_n}] =E[X_i]; \quad\text{Var}(\bar{X_n}) = \text{Var}(X_i) / n$

We can observe that as $n$ increases, $E[\bar{X_n}]$ is unaffected, but $\text{Var}(\bar{X_n})$ decreases.

Covariance is one of the most fundamental measures of non-independence between two random variables. The **covariance** between $X$ and $Y$, $\text{Cov}(X, Y)$ is defined as:

$\text{Cov}(X,Y) \equiv E[(X-E[X])(Y - E[Y])]$

The right-hand side of this equation looks daunting, so let's see if we can simplify it. We can first expand the product:

$\begin{alignedat}{1}
& E[(X-E[X])(Y - E[Y]) = \\
& E[XY - XE[Y] - YE[X] + E[Y]E[X]]
\end{alignedat}$

Since expectation is linear, we can rewrite the right-hand side as a difference of expected values:

$\begin{alignedat}{1}
& E[(X-E[X])(Y - E[Y]) = \\
& E[XY] - E[XE[Y]] - E[YE[X]] + E[E[Y]E[X]]
\end{alignedat}$

Note that both $E[X]$ and $E[Y]$ are just numbers: the expected values of the corresponding random variables. As a result, we can apply two principles here: $E[aX] = aE[X]$ and $E[a] = a$. Consider the following rearrangement:

$\begin{alignedat}{1}
& E[(X-E[X])(Y - E[Y]) = \\
& E[XY] - E[Y]E[X] - E[X]E[Y] + E[Y]E[X]
\end{alignedat}$

The last three terms are the same, they and sum to $-E[Y]E[X]$. Thus:

$\begin{alignedat}{1}
\text{Cov}(X,Y) & \equiv E[(X-E[X])(Y - E[Y])] \\[2ex]
& = E[XY] - E[Y]E[X]
\end{alignedat}$

This equation is much easier to work with; namely, $h(X,Y) = XY$ is a much simpler function than $h(X,Y) = (X-E[X])(Y - E[Y])$ when it comes time to apply LOTUS.

Let's understand what happens when we take the covariance of $X$ with itself:

$\begin{alignedat}{1}
\text{Cov}(X,X) & = E[X * X] - E[X]E[X] \\[2ex]
& = E[X^2] - (E[X])^2 \\[2ex]
& = \text{Var}(X)
\end{alignedat}$

If $X$ and $Y$ are independent random variables, then $\text{Cov}(X, Y) = 0$. On the other hand, a covariance of $0$ does **not** mean that $X$ and $Y$ are independent.

For example, consider two random variables, $X \sim \text{Unif}(-1,1)$ and $Y = X^2$. Since $Y$ is a function of $X$, the two random variables are dependent: if you know $X$, you know $Y$. However, take a look at the covariance:

$\text{Cov}(X, Y) = E[X^3] - E[X]E[X^2]$

What is $E[X]$? Well, we can integrate the pdf from $-1$ to $1$, or we can understand that the expected value of a uniform random variable is the average of the bounds of the distribution. That's a long way of saying that $E[X] =(-1 + 1) / 2 = 0$.

Now, what is $E[X^3]$? We can apply LOTUS:

$E[X^3] = \int_{-1}^1 x^3f(x)dx$

What is the pdf of a uniform random variable? By definition, it's one over the difference of the bounds:

$E[X^3] = \frac{1}{1 - - 1}\int_{-1}^1 x^3f(x)dx$

Let's integrate and evaluate:

$E[X^3] = \frac{1}{2} \frac{x^4}{4}\Big|_{-1}^1 = \frac{1^4}{8} - \frac{(-1)^4}{8} = 0$

Thus:

$\text{Cov}(X, Y) = E[X^3] - E[X]E[X^2] = 0$

Just because the covariance between $X$ and $Y$ is $0$ does not mean that they are independent!

Suppose that we have two random variables, $X$ and $Y$, as well as two constants, $a$ and $b$. We have the following theorem:

$\text{Cov}(aX, bY) = ab\text{Cov}(X,Y)$

Whether or not $X$ and $Y$ are independent,

$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$

$\text{Var}(X - Y) = \text{Var}(X) + \text{Var}(Y) - 2\text{Cov}(X, Y)$

Note that we looked at a theorem previously which gave an equation for the variance of $X + Y$ when both variables are independent: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$. That equation was a special case of the theorem above, where $\text{Cov}(X,Y) = 0$ as is the case between two independent random variables.

The **correlation** between $X$ and $Y$, $\rho$, is equal to:

$\rho \equiv \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}}$

Note that correlation is *standardized covariance*. In other words, for any $X$ and $Y$, $-1 \leq \rho \leq 1$.

If two variables are highly correlated, then $\rho$ will be close to $1$. If two variables are highly negatively correlated, then $\rho$ will be close to $-1$. Two variables with low correlation will have a $\rho$ close to $0$.

Consider the following joint pmf:

$\begin{array}{c|ccc|c}
f(x,y) & X = 2 & X = 3 & X = 4 & f_Y(y) \\ \hline
Y = 40 & 0.00 & 0.20 & 0.10 & 0.3 \\
Y = 50 & 0.15 & 0.10 & 0.05 & 0.3 \\
Y = 60 & 0.30 & 0.00 & 0.10 & 0.4 \\ \hline
f_X(x) & 0.45 & 0.30 & 0.25 & 1 \\
\end{array}$

For this pmf, $X$ can take values in $\{2, 3, 4\}$ and $Y$ can take values in $\{40, 50, 60\}$. Note the marginal pmfs along the table's right and bottom, and remember that all pmfs sum to one when calculated over all appropriate values.

What is the expected value of $X$? Let's use $f_X(x)$:

$E[X] = 2(0.45) + 3(0.3) + 4(0.25) = 2.8$

Now let's calculate the variance:

$\text{Var}(X) = E[X^2] - (E[X])^2$

$\text{Var}(X) = 4(0.45) + 9(0.3) + 16(0.25) - (2.8)^2 = 0.66$

What is the expected value of $Y$? Let's use $f_Y(y)$:

$E[Y] = 40(0.3) + 50(0.3) + 60(0.4) = 51$

Now let's calculate the variance:

$\text{Var}(Y) = E[Y^2] - (E[Y])^2$

$\text{Var}(X) = 1600(0.3) + 2500(0.3) + 3600(0.4) - (51)^2 = 69$

If we want to calculate the covariance of $X$ and $Y$, we need to know $E[XY]$, which we can calculate using two-dimensional LOTUS:

$E[XY] = \sum_x \sum_y xy f(x,y)$

$E[XY] = (2 * 40 * 0.00) + (2 * 50 * 0.15) + ... + (4 * 60 * 0.1) = 140$

With $E[XY]$ in hand, we can calculate the covariance of $X$ and $Y$:

$\text{Cov}(X,Y) = E[XY] - E[X]E[Y] = 140 - (2.8 * 51) = -2.8$

Finally, we can calculate the correlation:

$\rho = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}}$

$\rho = \frac{-2.8}{\sqrt{0.66(69)}} \approx -0.415$

Let's look at two different assets, $S_1$ and $S_2$, that we hold in our portfolio. The expected yearly returns of the assets are $E[S_1] = \mu_1$ and $E[S_2] = \mu_2$, and the variances are $\text{Var}(S_1) = \sigma_1^2$ and $\text{Var}(S_2) = \sigma_2^2$. The covariance between the assets is $\sigma_{12}$.

A portfolio is just a weighted combination of assets, and we can define our portfolio, $P$, as:

$P = wS_1 + (1 - w)S_2, \quad w \in [0,1]$

The portfolio's expected value is the sum of the expected values of the assets times their corresponding weights:

$E[P] = E[wS_1 + (1 - w)S_2]$

$E[P] = E[wS_1] + E[(1 - w)S_2]$

$E[P] = wE[S_1] + (1 - w)E[S_2]$

$E[P] = w\mu_1 + (1-w)\mu_2$

Let's calculate the variance of the portfolio:

$\text{Var}(P) = \text{Var}(wS_1 + (1-w)S_2)$

Remember how we express $\text{Var}(X + Y)$:

$\text{Var}(P) = \text{Var}(wS_1) + \text{Var}((1-w)S_2) + 2\text{Cov}(wS_1, (1-w)S_2)$

Remember that $\text{Var}(aX) = a^2\text{Var}(X)$ and $\text{Cov}(aX, bY) = ab\text{Cov}(X,Y)$. Thus:

$\text{Var}(P) = w^2\text{Var}(S_1) + (1-w)^2\text{Var}(S_2) + 2w(1-w)\text{Cov}(S_1, S_2)$

Finally, let's substitute in the appropriate variables:

$\text{Var}(P) = w^2\sigma^2_1 + (1-w)^2\sigma^2_2 + 2w(1-w)\sigma_{12}$

How might we optimize this portfolio? One thing we might want to optimize for is minimal variance: many people want their portfolios to have as little volatility as possible.

Let's recap. Given a function $f(x)$, how do we find the $x$ that minimizes $f(x)$? We can take the derivative, $f'(x)$, set it to $0$ and then solve for $x$. Let's apply this logic to $\text{Var}(P)$. First, we take the derivative with respect to $w$:

$\frac{d}{dw}\text{Var}(P) = 2w\sigma^2_1 - 2(1-w)\sigma^2_2 + 2\sigma_{12} - 4w\sigma_{12}$

$\frac{d}{dw}\text{Var}(P) = 2w\sigma^2_1 - 2\sigma^2_2 +2w\sigma^2_2 + 2\sigma_{12} - 4w\sigma_{12}$

Then, we set the derivative equal to $0$ and solve for $w$:

$0 = 2w\sigma^2_1 - 2\sigma^2_2 +2w\sigma^2_2 + 2\sigma_{12} - 4w\sigma_{12}$

$0 = w\sigma^2_1 - \sigma^2_2 +w\sigma^2_2 + \sigma_{12} - 2w\sigma_{12}$

$\sigma^2_2 - \sigma_{12} = w\sigma^2_1 +w\sigma^2_2 - 2w\sigma_{12}$

$\sigma^2_2 - \sigma_{12} = w(\sigma^2_1 +\sigma^2_2 - 2\sigma_{12})$

$\frac{\sigma^2_2 - \sigma_{12}}{\sigma^2_1 +\sigma^2_2 - 2\sigma_{12}} = w$

Suppose $E[S_1] = 0.2$, $E[S_2] = 0.1$, $\text{Var}(S_1) = 0.2$, $\text{Var}(S_2) = 0.4$, and $\text{Cov}(S_1, S_2) = -0.1$.

What value of $w$ maximizes the expected return of this portfolio? We don't even have to do any math: just allocate 100% of the portfolio to the asset with the higher expected return - $S_1$. Since we define our portfolio as $wS_1 + (1 - w)S_2$, the correct value for $w$ is $1$.

What value of $w$ minimizes the variance? Let's plug and chug:

$w = \frac{\sigma^2_2 - \sigma_{12}}{\sigma^2_1 +\sigma^2_2 - 2\sigma_{12}}$

$w = \frac{0.4 + 0.1}{0.2 + 0.4 + 0.2} = 0.5 / 0.8 = 0.625$

To minimize variance, we should hold a portfolio consisting of $5/8$ $S_1$ and $3/8$ $S_2$.

There are tradeoffs in any optimization. For example, optimizing for maximal expected return may introduce high levels of volatility into the portfolio. Conversely, optimizing for minimal variance may result in paltry returns.

In this lesson, we are going to review several popular discrete and continuous distributions.

Suppose we have a random variable, $X \sim \text{Bernoulli}(p)$. $X$ has the following pmf:

$f(x) = \left\{\begin{matrix}
p & \text{if } x = 1 \\
1 - p (= q) & \text{if } x = 0
\end{matrix}\right.$

Additionally, $X$ has the following properties:

$E[X] = p$

$\text{Var}(X) = pq$

$M_X(t) = pe^t + q$

The Bernoulli distribution generalizes to the binomial distribution. Suppose we have $n$ iid Bernoulli random variables: $X_1,X_2,...,X_n \overset{\text{iid}}\sim \text{Bern}(p)$. Each $X_i$ takes on the value $1$ with probability $p$ and $0$ with probability $1-p$. If we take the sum of the successes, we have the following random variable, $Y$:

$Y = \sum_{i = 1}^n X_i \sim \text{Bin}(n,p)$

$Y$ has the following pmf:

$f(y) = \binom{n}{y}p^yq^{n-y}, \quad y = 0, 1,...,n.$

Notice the binomial coefficient in this equation. We read this as "n choose k", which is defined as:

$\binom{n}{y} = \frac{n!}{k!(n-k)!}$

What's going on here? First, what is the probability of $y$ successes? Well, completely, it's the probability of $y$ successes and $n-y$ failures: $p^yq^{n-y}$. Of course, the outcome of $y$ *consecutive* successes followed by $n-y$ *consecutive* failures is just one particular arrangement of many. How many? $n$ choose $k$. This is what the binomial coefficient expresses.

Additionally, $Y$ has the following properties:

$E[Y] = np$

$\text{Var}(Y) = npq$

$M_Y(t) = (pe^t + q)^n$

Note that the variance and the expected value are equal to $n$ times the variance and the expected value of the Bernoulli random variable. This relationship makes sense: a binomial random variable is the sum of $n$ Bernoulli's. The moment-generating function looks a little bit different. As it turns out, we multiply the moment-generating functions when we sum the random variables.

Suppose we have a random variable, $X \sim \text{Geometric}(p)$. A geometric random variable corresponds to the number of $\text{Bern}(p)$ trials until a success occurs. For example, three failures followed by a success ("FFFS") implies that $X = 4$. A geometric random variable has the following pmf:

$f(x) = q^{x-1}p, \quad x = 1,2,...$

We can see that this equation directly corresponds to the probability of $x - 1$ failures, each with probability $q$ followed by one success, with probability $p$.

Additionally, $X$ has the following properties:

$E[X] = \frac{1}{p} \\$

$\text{Var}(X) = \frac{q}{p^2} \\$

$M_X(t) = \frac{pe^t}{1-qe^t} \\$

The geometric distribution generalizes to the negative binomial distribution. Suppose that we are interested in the number of trials it takes to see $r$ successes. We can add $r$ iid $\text{Geom}(p)$ random variables to get the random variable $Y \sim \text{NegBin}(r, p)$. For example, if $r = 3$, then a run of "FFFSSFS" implies that $Y \sim \text{NegBin}(3, p) = 7$. $Y$ has the following pmf:

$f(y) = \binom{y-1}{r-1}q^{y-r}p^{r}, \quad y = r, r + 1,...$

Additionally, $Y$ has the following properties:

$E[Y] = \frac{r}{p}$

$\text{Var}(Y) = \frac{qr}{p^2}$

Note that the variance and the expected value are equal to $r$ times the variance and the expected value of the geometric random variable. This relationship makes sense: a negative binomial random variable is the sum of $r$ geometric random variables.

A **counting process**, $N(t)$ keeps track of the number of "arrivals" observed between time $0$ and time $t$. For example, if $7$ people show up to a store by time $t=3$, then $N(3) = 7$. A **Poisson process** is a counting process that satisfies the following criteria.

- Arrivals must occur one-at-a-time at a rate, $\lambda$. For example, $\lambda = 4 / \text{hr}$ means that, on average, arrivals occur every fifteen minutes, yet no two arrivals coincide.
- Disjoint time increments are independent. Suppose we are looking at arrivals on the intervals 12 am - 2 am and 5 am - 10 am. Independent increments means that the arrivals in the first interval don't impact arrivals in the second.
- Increments are stationary; in other words, the distribution of the number of arrivals in the interval $[s, s + t]$ depends only on the interval's length, $t$. It does not depend on where the interval starts, $s$.

A random variable $X \sim \text{Pois}(\lambda)$ describes the number of arrivals that a Poisson process experiences in one time unit, i.e., $N(1)$. $X$ has the following pmf:

$f(x) = \frac{e^{-\lambda}\lambda^x}{x!}, \quad x = 0,1,...$

Additionally, $X$ has the following properties:

$E[X] = \text{Var}(X) = \lambda$

$M_X(t) = e^{\lambda(e^t - 1)}$

A uniform random variable, $X \sim \text{Uniform}(a,b)$, has the following pdf:

$f(x) = \frac{1}{b - a}, \quad a \leq x \leq b$

Additionally, $X$ has the following properties:

$E[X] = \frac{a + b}{2}$

$\text{Var}(X) = \frac{a + b}{2}$

$M_X(t) = \frac{e^{tb} - e^{ta}}{tb - ta}$

A continuous, exponential random variable $X \sim \text{Exponential}(\lambda)$ has the following pdf:

$f(x) = \lambda e^{-\lambda x}, \quad x \geq 0$

Additionally, $X$ has the following properties:

$E[X] = \frac{1}{\lambda}$

$\text{Var}(X) = \frac{1}{\lambda^2}$

$M_X(t) = \frac{\lambda}{\lambda - t}, \quad t < \lambda$

The exponential distribution also has a *memoryless property*, which means that for $s, t > 0$, $P(X > s + t | X > s) = P(X > t)$. For example, if we have a light bulb, and we know that it has lived for $s$ time units, the conditional probability that it will live for $s + t$ time units (an additional $t$ units), is the unconditional probability that it will live for $t$ time units. Analogously, there is no "memory" of the prior $s$ time units.

Let's look at a concrete example. If $X \sim \text{Exp}(1/100)$, then:

$P(X > 200 | X > 50) = P(X > 150) = e^{\lambda t} = e^{-150/100}$