Simulation

Input Analysis

89 minute read



Notice a tyop typo? Please submit an issue or open a PR.

Input Analysis

Introduction

The goal of input analysis is to ensure that the random variables we use in our simulations adequately approximate the real-world phenomena we are modeling. We might use random variables to model interarrival times, service times, breakdown times, or maintenance times, among other things. These variables don't come out of thin air; we have to specify them appropriately.

Why Worry? GIGO!

If we specify our random variables improperly, we can end up with a GIGO simulation: garbage-in-garbage-out. This situation can ruin the entire model and invalidate any results we might obtain.

GIGO Example

Let's consider a single-server queueing system with constant service times of ten minutes. Let's incorrectly assume that we have constant interarrival times of twelve minutes. We should expect to never have a line under this assumption because the service times are always shorter than the interarrival times.

However, suppose that, in reality, we have Exp(12) interarrival times. In this case, the simulation never sees the line that actually occurs. In fact, the line might get quite long, and the simulation has no way of surfacing that fact.

So What To Do?

Here's the high-level game plan for performing input analysis. First, we'll collect some data for analysis for any random variable of interest. Next, we'll determine, or at least estimate, the underlying distribution along with associated parameters: for example, Nor(30,8). We will then conduct a formal statistical test to see if the distribution we chose is "approximately" correct. If we fail the test, then our guess is wrong, and we have to go back to the drawing board.

Identifying Distributions

In this lesson, we will look at some high-level methods for examining data and guessing the underlying distribution.

Three Little Bears

We can always present data in the form of a histogram. Suppose we collect one hundred observations, and we plot the following histogram. We can't really determine the underlying distribution because we don't have enough cells.

Suppose we plot the following histogram. The resolution on this plot is too granular. We can potentially see the underlying distribution, but we risk missing the forest for the trees.

If we take the following histogram, we get a much clearer picture of the underlying distribution.

It turns out that, if we take enough observations, the histogram will eventually converge to the true pdf/pmf of the random variable we are trying to model, according to the Glivenko–Cantelli theorem.

Stem-and-Leaf

If we turn the histogram on its side and add some numerical information, we get a stem-and-leaf diagram, where each stem represents the common root shared among a collection of observations, and each leaf represents the observation itself.

Which Distribution?

When looking at empirical data, what questions might we ask to arrive at the underlying distribution? For example, can we at least tell if the observations are discrete or continuous?

We might want to ask whether the distribution is univariate or multivariate. We might be interested in someone's weight, but perhaps we need to generate height and weight observations simultaneously.

Additionally, we might need to check how much data we have available. Certain distributions lend themselves more easily to smaller samples of data.

Furthermore, we might need to communicate with experts regarding the nature of the data. For example, we might want to know if the arrival rate changes at our facility as the day progresses. While we might observe the rate directly, we might want to ask the floor supervisor what to expect beforehand.

Finally, what happens when we don't have much or any data? What if the system we want to model doesn't exist yet? How might we guess a good distribution?

Which Distribution, II?

Let's suppose we know that we have a discrete distribution. For example, we might realize that we only see a finite number of observations during our data collection process. How do we determine which discrete distribution to use?

If we want to model success and failures, we might use a Bernoulli random variable and estimate pp. If we want to look at the number of successes in nn trials, we need to consider using a binomial random variable.

Perhaps we want to understand how many trials we need until we get our first success. In that case, we need to look at a geometric random variable. Alternatively, if we want to know how many trials we need until the nnth success, we need a negative binomial random variable.

We can use the Poisson(Ξ»\lambda) distribution to count the number of arrivals over time, assuming that the arrival process satisfies certain elementary assumptions.

If we honestly don't have a good model for the discrete data, perhaps we can use an empirical or sample distribution.

Which Distribution, III?

What if the distribution is continuous?

We might consider the uniform distribution if all we know about the data is the minimum and maximum possible values. If we know the most likely value as well, we might use the triangular distribution.

If we are looking at interarrival times from a Poisson process, then we know we should be looking at the Exp(Ξ»\lambda) distribution. If the process is nonhomogeneous, we might have to do more work, but the exponential distribution is a good starting point.

We might consider the normal distribution if we are looking at heights, weights, or IQs. Furthermore, if we are looking at sample means or sums, the normal distribution is a good choice because of the central limit theorem.

We can use the Beta distribution, which generalizes the uniform distribution, to specify bounded data. We might use the gamma, Weibull, Gumbel, or lognormal distribution if we are dealing with reliability data.

When in doubt, we can use the empirical distribution, which is based solely on the sample.

Game Plan

As we said, we will choose a "reasonable" distribution, and then we'll perform a hypothesis test to make sure that our choice is not too ridiculous.

For example, suppose we hypothesize that some data is normal. This data should fall approximately on a straight line when we graph it on a normal probability plot, and it should look normal when we graph it on a standard plot. At the very least, it should also pass a goodness-of-fit test for normality, of which there are several.

Unbiased Point Estimation

It's not enough to decide that some sequence of observations is normal; we still have to estimate ΞΌ\mu and Οƒ2\sigma^2. In the next few lessons, we will look at point estimation, which lets us understand how to estimate these unknown parameters. We'll cover the concept of unbiased estimation first.

Statistic Definition

A statistic is a function of the observations X1,...,XnX_1,...,X_n that is not explicitly dependent on any unknown parameters. For example, the sample mean, Xˉ\bar{X}, and the sample variance, S2S^2, are two statistics:

XΛ‰=1nβˆ‘i=1nXiS2=1nβˆ’1βˆ‘i=1x(Xiβˆ’XΛ‰)2\begin{alignedat}{1} \bar{X} = &\frac{1}{n} \sum_{i=1}^n X_i \\[2ex] S^2 = &\frac{1}{n-1} \sum_{i=1}^x(X_i - \bar{X})^2 \end{alignedat}

Statistics are random variables. In other words, if we take two different samples, we should expect to see two different values for a given statistic.

We usually use statistics to estimate some unknown parameter from the underlying probability distribution of the XiX_i's. For instance, we use the sample mean, XΛ‰\bar{X}, to estimate the true mean, ΞΌ\mu, of the underlying distribution, which we won't normally know. If ΞΌ\mu is the true mean, then we can take a bunch of samples and use XΛ‰\bar{X} to estimate ΞΌ\mu. We know, via the law of large numbers that, as nβ†’βˆžn \to \infty, XΛ‰β†’ΞΌ\bar{X} \to \mu.

Point Estimator

Let's suppose that we have a collection of iid random variables, X1,...,XnX_1,...,X_n. Let T(X)≑T(X1,...,Xn)T(\bold{X}) \equiv T(X_1,...,X_n) be a function that we can compute based only on the observations. Therefore, T(X)T(\bold{X}) is a statistic. If we use T(X)T(\bold{X}) to estimate some unknown parameter ΞΈ\theta, then T(X)T(\bold{X}) is known as a point estimator for ΞΈ\theta.

For example, Xˉ\bar{X} is usually a point estimator for the true mean, μ=E[Xi]\mu = E[X_i], and S2S^2 is often a point estimator for the true variance, σ2=Var(X)\sigma^2 = \text{Var}(X).

T(X)T(\bold{X}) should have specific properties:

  • Its expected value should equal the parameter it's trying to estimate. This property is known as unbiasedness.
  • It should have a low variance. It doesn't do us any good if T(X)T(\bold{X}) is bouncing around depending on the sample we take.

Unbiasedness

We say that T(X)T(\bold{X}) is unbiased for ΞΈ\theta if E[T(X)]=ΞΈE[T(\bold{X})] = \theta. For example, suppose that random variables, X1,...,XnX_1,...,X_n are iid anything with mean ΞΌ\mu. Then:

E[XΛ‰]=E[1nβˆ‘i=1nXi]=1nβˆ‘i=1nE[Xi]=nE[Xi]n=E[Xi]=ΞΌ\begin{alignedat}{1} E[\bar{X}] & = E\left[\frac{1}{n}\sum_{i=1}^nX_i\right] \\[3ex] & = \frac{1}{n}\sum_{i=1}^nE[X_i] \\[3ex] & = \frac{nE[X_i]}{n} \\[2ex] & = E[X_i] = \mu \end{alignedat}

Since E[Xˉ]=μE[\bar{X}] = \mu, Xˉ\bar{X} is always unbiased for μ\mu. That's why we call it the sample mean.

Similarly, suppose we have random variables, X1,...,XnX_1,...,X_n which are iid Exp(λ)\text{Exp}(\lambda). Then, Xˉ\bar{X} is unbiased for μ=E[Xi]=1/λ\mu = E[X_i] = 1 / \lambda. Even though λ\lambda is unknown, we know that Xˉ\bar{X} is a good estimator for 1/λ1/ \lambda.

Be careful, though. Just because Xˉ\bar{X} is unbiased for 1/λ1 / \lambda does not mean that 1/Xˉ1 / \bar{X} is unbiased for λ\lambda: E[1/Xˉ]≠1/E[Xˉ]=λE[1/\bar{X}] \neq 1 /E[\bar{X}] = \lambda. In fact, 1/Xˉ1/\bar{X} is biased for λ\lambda in this exponential case.

Here's another example. Suppose that random variables, X1,...,XnX_1,...,X_n are iid anything with mean ΞΌ\mu and variance Οƒ2\sigma^2. Then:

E[S2]=E[βˆ‘i=1n(Xiβˆ’XΛ‰)2nβˆ’1]=Var(Xi)=Οƒ2E[S^2] = E\left[\frac{\sum_{i=1}^n(X_i - \bar{X})^2}{n - 1}\right] = \text{Var}(X_i) = \sigma^2

Since E[S2]=Οƒ2E[S^2] = \sigma^2, S2S^2 is always unbiased for Οƒ2\sigma^2. That's why we called it the sample variance.

For example, suppose random variables X1,...,XnX_1,...,X_n are iid Exp(Ξ»)\text{Exp}(\lambda). Then S2S^2 is unbiased for Var(Xi)=1/Ξ»2\text{Var}(X_i) = 1 / \lambda^2.

Let's give a proof for the unbiasedness of S2S^2 as an estimate for Οƒ2\sigma^2. First, let's convert S2S^2 into a better form:

S2=βˆ‘i=1n(Xiβˆ’XΛ‰)2nβˆ’1=βˆ‘i=1nXi2βˆ’2XiXΛ‰+XΛ‰2nβˆ’1=1nβˆ’1(βˆ‘i=1nXi2βˆ’2XiXΛ‰+XΛ‰2)=1nβˆ’1(βˆ‘i=1nXi2βˆ’βˆ‘i=1n2XiXΛ‰+βˆ‘i=1nXΛ‰2)\begin{alignedat}{1} S^2 & = \frac{\sum_{i=1}^n(X_i - \bar{X})^2}{n - 1} \\[3ex] & = \frac{\sum_{i=1}^n X_i^2 -2X_i\bar{X} + \bar{X}^2}{n - 1} \\[3ex] & = \frac{1}{n-1}\left(\sum_{i=1}^n X_i^2 -2X_i\bar{X} + \bar{X}^2\right) \\[3ex] & = \frac{1}{n-1}\left(\sum_{i=1}^n X_i^2 - \sum_{i=1}^n 2X_i\bar{X} + \sum_{i=1}^n \bar{X}^2\right) \\[3ex] \end{alignedat}

Let's rearrange the middle sum:

βˆ‘i=1n2XiXΛ‰=2XΛ‰βˆ‘i=1nXi \sum_{i=1}^n 2X_i\bar{X} = 2\bar{X}\sum_{i=1}^n X_i

Remember that XΛ‰\bar{X} represents the average of all the XiX_i's: βˆ‘Xi/n\sum X_i / n. Thus, if we just sum the XiX_i's and don't divide by nn, we have a quantity equal to nXΛ‰n\bar{X}:

2XΛ‰βˆ‘i=1nXi=2XΛ‰nXΛ‰=2nXΛ‰22\bar{X}\sum_{i=1}^n X_i = 2\bar{X}n\bar{X} = 2n\bar{X}^2

Now, back to action:

S2=1nβˆ’1(βˆ‘i=1nXi2βˆ’βˆ‘i=1n2XiXΛ‰+βˆ‘i=1nXΛ‰2)=1nβˆ’1(βˆ‘i=1nXi2βˆ’2nXΛ‰2+nXΛ‰2)=1nβˆ’1(βˆ‘i=1nXi2βˆ’nXΛ‰2)=βˆ‘i=1nXi2βˆ’nXΛ‰2nβˆ’1\begin{alignedat}{1} S^2 & = \frac{1}{n-1}\left(\sum_{i=1}^n X_i^2 - \sum_{i=1}^n 2X_i\bar{X} + \sum_{i=1}^n \bar{X}^2\right) \\[3ex] & = \frac{1}{n-1}\left(\sum_{i=1}^n X_i^2 - 2n\bar{X}^2 + n\bar{X}^2\right) \\[3ex] & = \frac{1}{n-1}\left(\sum_{i=1}^n X_i^2 - n\bar{X}^2 \right) \\[3ex] & = \frac{\sum_{i=1}^n X_i^2 - n\bar{X}^2}{n-1} \\[3ex] \end{alignedat}

Let's take the expected value:

E[S2]=βˆ‘i=1nE[Xi2]βˆ’nE[XΛ‰2]nβˆ’1\begin{alignedat}{1} E[S^2] & = \frac{\sum_{i=1}^n E[X_i^2] - nE[\bar{X}^2]}{n-1} \end{alignedat}

Note that E[Xi2]E[X_i^2] is the same for all XiX_i, so the sum is just nE[X12]nE[X_1^2]:

E[S2]=nE[X12]βˆ’nE[XΛ‰2]nβˆ’1=1nβˆ’1(E[X12]βˆ’E[XΛ‰2])\begin{alignedat}{1} E[S^2] & = \frac{n E[X_1^2] - nE[\bar{X}^2]}{n-1} \\[2ex] & = \frac{1}{n-1} \left(E[X_1^2] - E[\bar{X}^2]\right) \end{alignedat}

We know that Var(X)=E[X2]βˆ’(E[X])2\text{Var}(X) = E[X^2] - (E[X])^2, so E[X2]=Var(X)+(E[X])2E[X^2] = \text{Var}(X) + (E[X])^2. Therefore:

E[S2]=1nβˆ’1(Var(X1)+(E[X1]2)βˆ’Var(XΛ‰)βˆ’(E[XΛ‰]2))\begin{alignedat}{1} E[S^2] & = \frac{1}{n-1} \left(\text{Var}(X_1) + (E[X_1]^2) - \text{Var}(\bar{X}) - (E[\bar{X}]^2)\right) \end{alignedat}

Remember that E[X1]=E[Xˉ]E[X_1] = E[\bar{X}], so:

E[S2]=1nβˆ’1(Var(X1)βˆ’Var(XΛ‰))\begin{alignedat}{1} E[S^2] & = \frac{1}{n-1} \left(\text{Var}(X_1) - \text{Var}(\bar{X}) \right) \\[3ex] \end{alignedat}

Furthermore, remember that Var(Xˉ)=Var(X1)/n=σ2/n\text{Var}(\bar{X}) = \text{Var}(X_1) /n = \sigma_2 / n. Therefore:

=1nβˆ’1(Οƒ2βˆ’Οƒ2/n))=nΟƒ2nβˆ’1βˆ’Οƒ2nβˆ’1=nΟƒ2βˆ’Οƒ2nβˆ’1=Οƒ2(nβˆ’1)nβˆ’1=Οƒ2\begin{alignedat}{1} & = \frac{1}{n-1} \left(\sigma^2 - \sigma^2/n) \right) \\[3ex] & = \frac{n\sigma^2}{n-1} - \frac{\sigma^2}{n-1} \\[3ex] & = \frac{n\sigma^2- \sigma^2}{n-1} = \frac{\sigma^2(n-1)}{n-1} = \sigma^2 \end{alignedat}

Unfortunately, while S2S^2 is unbiased for the variance Οƒ2\sigma^2, SS is biased for the standard deviation Οƒ\sigma.

Mean Squared Error

In this lesson, we'll look at mean squared error, a performance measure that evaluates an estimator by combining its bias and variance.

Bias and Variance

We want to choose an estimator with the following properties:

  • Low bias (defined as the difference between the estimator's expected value and the true parameter value)
  • Low variance

Furthermore, we want the estimator to have both of these properties simultaneously. If the estimator has low bias but high variance, then its estimates are meaninglessly noisy. Its average estimate is correct, but any individual estimate may be way off the mark. On the other hand, an estimator with low variance but high bias is very confident about the wrong answer.

Example

Suppose that we have nn random variables, X1,...,Xn∼iidUnif(0,θ)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Unif}(0,\theta). We know that our observations have a lower bound of 00, but we don't know the value of the upper bound, θ\theta. As is often the case, we sample many observations from the distribution and use that sample to estimate the unknown parameter.

Consider two estimators for ΞΈ\theta:

Y1≑2XΛ‰Y2≑n+1nmax⁑1≀i≀XiXi\begin{alignedat}{1} Y_1 &\equiv 2\bar{X} \\[2ex] Y_2 &\equiv \frac{n+1}{n} \max_{1 \leq i \leq X_i}X_i \end{alignedat}

Let's look at the first estimator. We know that E[Y1]=2E[Xˉ]E[Y_1] = 2E[\bar{X}], by definition. Similarly, we know that 2E[Xˉ]=2E[Xi]2E[\bar{X}] = 2E[X_i], since Xˉ\bar{X} is always unbiased for the mean. Recall how we compute the expected value for a uniform random variable:

E[A]=(bβˆ’a)/2,A∼Unif(a,b)E[A] = (b - a) / 2, \quad A \sim \text{Unif}(a,b)

Therefore:

2E[Xi]=2(ΞΈβˆ’02)=ΞΈ=E[Y1]2E[X_i] = 2\left(\frac{\theta - 0}{2}\right) = \theta = E[Y_1]

As we can see, Y1Y_1 is unbiased for ΞΈ\theta.

It's also the case that Y2Y_2 is unbiased, but it takes more work to demonstrate. As a first step, take the cdf of the maximum of the XiX_i's, M≑max⁑iXiM \equiv \max_iX_i. Here's what P(M≀y)P(M \leq y) looks like:

P(M≀y)=P(X1≀yΒ andΒ X2≀yΒ and … andΒ Xn≀y)P(M \leq y) = P(X_1 \leq y \text{ and } X_2 \leq y \text{ and } \dots \text{ and } X_n \leq y)

If M≀yM \leq y, and MM is the maximum, then P(M≀y)P(M \leq y) is the probability that all the XiX_i's are less than yy. Since the XiX_i's are independent, we can take the product of the individual probabilities:

P(M≀y)=∏i=inP(Xi≀y)=[P(X1≀y)]nP(M \leq y) = \prod_{i = i}^n P(X_i \leq y) = [P(X_1 \leq y)]^n

Now, we know, by definition, that the cdf is the integral of the pdf. Remember that the pdf for a uniform distribution, Unif(a,b)\text{Unif}(a,b), is:

f(x)=1bβˆ’a,a<x<bf(x) = \frac{1}{b-a}, a < x < b

Let's rewrite P(M≀y)P(M \leq y):

P(M≀y)=[∫0yfX1(x)dx]n=[∫0y1ΞΈdx]n=(yΞΈ)n\begin{alignedat}{1} P(M \leq y) &= \left[\int_0^yf_{X_1}(x)dx\right]^n \\[2ex] & = \left[\int_0^y\frac{1}{\theta}dx\right]^n \\[2ex] & = \left(\frac{y}{\theta}\right)^n \end{alignedat}

Again, we know that the pdf is the derivative of the cdf, so:

fM(y)=FM(y)dy=ddyynΞΈn=nynβˆ’1ΞΈn\begin{alignedat}{1} f_M(y) & = F_M(y)dy \\[2ex] & = \frac{d}{dy} \frac{y^n}{\theta^n} \\[2ex] & = \frac{ny^{n-1}}{\theta^n} \end{alignedat}

With the pdf in hand, we can get the expected value of MM:

E[M]=∫0θyfM(y)dy=∫0θnynθndy=nyn+1(n+1)θn∣0θ=nθn+1(n+1)θn=nθn+1\begin{alignedat}{1} E[M] & = \int_0^\theta yf_M(y)dy \\[2ex] & = \int_0^\theta \frac{ny^n}{\theta^n} dy \\[2ex] & = \frac{ny^{n+1}}{(n+1)\theta^n}\Big|_0^\theta \\[2ex] & = \frac{n\theta^{n+1}}{(n+1)\theta^n} \\[2ex] & = \frac{n\theta}{n+1} \\[2ex] \end{alignedat}

Note that E[M]β‰ ΞΈE[M] \neq \theta, so MM is not an unbiased estimator for ΞΈ\theta. However, remember how we defined Y2Y_2:

Y2≑n+1nmax⁑1≀i≀XiXiY_2 \equiv \frac{n+1}{n} \max_{1 \leq i \leq X_i}X_i

Thus:

E[Y2]=n+1nE[M]=n+1n(nΞΈn+1)=ΞΈ\begin{alignedat}{1} E[Y_2] & = \frac{n+1}{n}E[M] \\[2ex] & = \frac{n+1}{n} \left(\frac{n\theta}{n+1}\right) \\[2ex] & = \theta \end{alignedat}

Therefore, Y2Y_2 is unbiased for ΞΈ\theta.

Both indicators are unbiased, so which is better? Let's compare variances now. After similar algebra, we see:

Var(Y1)=ΞΈ23n,Var(Y2)=ΞΈ2n(n+2)\text{Var}(Y_1) = \frac{\theta^2}{3n}, \quad \text{Var}(Y_2) = \frac{\theta^2}{n(n+2)}

Since the variance of Y2Y_2 involves dividing by n2n^2, while the variance of Y1Y_1 only divides by nn, Y2Y_2 has a much lower variance than Y1Y_1 and is, therefore, the better indicator.

Bias and Mean Squared Error

The bias of an estimator, T[X]T[\bold{X}], is the difference between the estimator's expected value and the value of the parameter its trying to estimate: Bias(T)≑E[T]βˆ’ΞΈ\text{Bias}(T) \equiv E[T] - \theta. When E[T]=ΞΈE[T] = \theta, then the bias is 00 and the estimator is unbiased.

The mean squared error of an estimator, T[X]T[\bold{X}], the expected value of the squared deviation of the estimator from the parameter: MSE(T)≑E[(Tβˆ’ΞΈ)2]\text{MSE}(T) \equiv E[(T-\theta)^2].

Remember the equation for variance:

Var(X)=E[X2]βˆ’(E[X])2\text{Var}(X) = E[X^2] - (E[X])^2

Using this equation, we can rewrite MSE(T)\text{MSE}(T):

MSE(T)=E[(Tβˆ’ΞΈ)2]=Var(T)+(E[T]βˆ’ΞΈ)2=Var(T)+Bias2\begin{alignedat}{1} \text{MSE}(T) & = E[(T-\theta)^2] \\[2ex] & = \text{Var}(T) + (E[T] - \theta)^2 \\[2ex] & = \text{Var}(T) + \text{Bias}^2 \\[2ex] \end{alignedat}

Usually, we use mean squared error to evaluate estimators. As a result, when selecting between multiple estimators, we might not choose the unbiased estimator, so long as that estimator's MSE is the lowest among the options.

Relative Efficiency

The relative efficiency of one estimator, T1T_1, to another, T2T_2, is the ratio of the mean squared errors: MSE(T1)/MSE(T2)\text{MSE}(T_1) / \text{MSE}(T_2). If the relative efficiency is less than one, we want T1T_1; otherwise, we want T2T_2.

Let's compute the relative efficiency of the two estimators we used in the previous example:

Y1≑2XΛ‰Y2≑n+1nmax⁑1≀i≀XiXi\begin{alignedat}{1} Y_1 &\equiv 2\bar{X} \\[2ex] Y_2 &\equiv \frac{n+1}{n} \max_{1 \leq i \leq X_i}X_i \end{alignedat}

Remember that both estimators are unbiased, so the bias is zero by definition. As a result, the mean squared errors of the two estimators is determined solely by the variance:

MSE(Y1)=Var(Y1)=ΞΈ23n,MSE(Y2)=Var(Y2)=ΞΈ2n(n+2)\text{MSE}(Y_1) = \text{Var}(Y_1) = \frac{\theta^2}{3n}, \quad \text{MSE}(Y_2) = \text{Var}(Y_2) = \frac{\theta^2}{n(n+2)}

Let's calculate the relative efficiency:

e(Y1,Y2)=ΞΈ23nΞΈ2n(n+2)=ΞΈ2βˆ—n(n+2)3nΞΈ2=n(n+2)3n=n+23\begin{alignedat}{1} e(Y_1, Y_2) & = \frac{\frac{\theta^2}{3n}}{\frac{\theta^2}{n(n+2)}} \\[3ex] & = \frac{\theta^2 * n(n+2)}{3n\theta^2} \\[2ex] & = \frac{n(n+2)}{3n} \\[2ex] & = \frac{n+2}{3} \\[2ex] \end{alignedat}

The relative efficiency is greater than one for all n>1n > 1, so Y2Y_2 is the better estimator just about all the time.

Maximum Likelihood Estimation

In this lesson, we are going to talk about maximum likelihood estimation, which is perhaps the most important point estimation method. It's a very flexible technique that many software packages use to help estimate parameters from various distributions.

Likelihood Function and Maximum Likelihood Estimator

Consider an iid random sample, X1,...,XnX_1,...,X_n, where each XiX_i has pdf/pmf f(x)f(x). Additionally, suppose that ΞΈ\theta is some unknown parameter from XiX_i that we would like to estimate. We can define a likelihood function, L(ΞΈ)L(\theta) as:

L(ΞΈ)β‰‘βˆi=1nf(xi)L(\theta) \equiv \prod_{i=1}^n f(x_i)

The maximum likelihood estimator (MLE) of ΞΈ\theta is the value of ΞΈ\theta that maximizes L(ΞΈ)L(\theta). The MLE is a function of the XiX_i's and is itself a random variable.

Exponential Example

Consider a random sample, X1,...,Xn∼iidExp(λ)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Exp}(\lambda). Find the MLE for λ\lambda. Note that, in this case, λ\lambda, is taking the place of the abstract parameter, θ\theta. Now:

L(Ξ»)β‰‘βˆi=1nf(xi)L(\lambda) \equiv \prod_{i=1}^n f(x_i)

We know that exponential random variables have the following pdf:

f(x,Ξ»)=Ξ»eβˆ’Ξ»xf(x, \lambda) = \lambda e^{-\lambda x}

Therefore:

L(Ξ»)β‰‘βˆi=1nΞ»eβˆ’Ξ»xiL(\lambda) \equiv \prod_{i=1}^n \lambda e^{-\lambda x_i}

Let's expand the product:

L(Ξ»)=(Ξ»eβˆ’Ξ»x1)βˆ—(Ξ»eβˆ’Ξ»x2)βˆ—β‹―βˆ—(Ξ»eβˆ’Ξ»xn)L(\lambda) = (\lambda e^{-\lambda x_1}) * (\lambda e^{-\lambda x_2}) * \dots * (\lambda e^{-\lambda x_n})

We can pull out a Ξ»n\lambda^n term:

L(Ξ»)=Ξ»nβˆ—(eβˆ’Ξ»x1)βˆ—(eβˆ’Ξ»x2)βˆ—β‹―βˆ—(eβˆ’Ξ»xn)L(\lambda) = \lambda^n * (e^{-\lambda x_1}) * (e^{-\lambda x_2}) * \dots * (e^{-\lambda x_n})

Remember what happens to exponents when we multiply bases:

axβˆ—ay=ax+ya^x * a^y = a^{x+y}

Let's apply this to our product (and we can swap in exp⁑\exp notation to make things easier to read):

L(Ξ»)=Ξ»nexp⁑[βˆ’Ξ»βˆ‘i=1nxi]L(\lambda) = \lambda^n \exp\left[-\lambda \sum_{i=1}^nx_i\right]

Now, we need to maximize L(λ)L(\lambda) with respect to λ\lambda. We could take the derivative of L(λ)L(\lambda), but we can use a trick! Since the natural log function is one-to-one, the λ\lambda that maximizes L(λ)L(\lambda) also maximizes ln⁑(L(λ))\ln(L(\lambda)). Let's take the natural log of L(λ)L(\lambda):

ln⁑(L(Ξ»))=ln⁑(Ξ»nexp⁑[βˆ’Ξ»βˆ‘i=1nxi])\ln(L(\lambda)) = \ln\left(\lambda^n \exp\left[-\lambda \sum_{i=1}^nx_i\right]\right)

Let's remember three different rules here:

ln⁑(ab)=ln⁑(a)+ln⁑(b),ln⁑(ab)=bln⁑(a),ln⁑(ex)=x\ln(ab) = \ln(a) + \ln(b), \quad \ln(a^b) = b\ln(a), \quad \ln(e^x) = x

Therefore:

ln⁑(L(Ξ»))=ln⁑(Ξ»n)+ln⁑(exp⁑[βˆ’Ξ»βˆ‘i=1nxi])=nln⁑(Ξ»)+ln⁑(exp⁑[βˆ’Ξ»βˆ‘i=1nxi])=nln⁑(Ξ»)βˆ’Ξ»βˆ‘i=1nxi\begin{alignedat}{1} \ln(L(\lambda)) & = \ln(\lambda^n) +\ln\left(\exp\left[-\lambda \sum_{i=1}^nx_i\right]\right) \\[2ex] & = n\ln(\lambda) +\ln\left(\exp\left[-\lambda \sum_{i=1}^nx_i\right]\right) \\[2ex] & = n\ln(\lambda) -\lambda \sum_{i=1}^nx_i \\[2ex] \end{alignedat}

Now, let's take the derivative:

ddΞ»ln⁑(L(Ξ»))=ddΞ»(nln⁑(Ξ»)βˆ’Ξ»βˆ‘i=1nxi)=nΞ»βˆ’βˆ‘i=1nxi\begin{alignedat}{1} \frac{d}{d\lambda}\ln(L(\lambda)) & = \frac{d}{d\lambda}\left(n\ln(\lambda) -\lambda \sum_{i=1}^nx_i\right) \\[2ex] & = \frac{n}{\lambda} - \sum_{i=1}^nx_i \\[2ex] \end{alignedat}

Finally, we set the derivative equal to 00 and solve for Ξ»\lambda:

0≑nΞ»βˆ’βˆ‘i=1nxinΞ»=βˆ‘i=1nxiΞ»=nβˆ‘i=1nxiΞ»=1XΛ‰\begin{alignedat}{1} 0 \equiv \frac{n}{\lambda} & - \sum_{i=1}^nx_i \\[2ex] \frac{n}{\lambda} & = \sum_{i=1}^nx_i \\[3ex] \lambda & = \frac{n}{\sum_{i=1}^nx_i} \\[2ex] \lambda & = \frac{1}{\bar{X}} \\[2ex] \end{alignedat}

Thus, the maximum likelihood estimator for λ\lambda is 1/Xˉ1 / \bar{X}, which makes a lot of sense. The mean of the exponential distribution is 1/λ1 / \lambda, and we usually estimate that mean by Xˉ\bar{X}. Since Xˉ\bar{X} is a good estimator for λ\lambda, it stands to reason that a good estimator for λ\lambda is 1/Xˉ1 / \bar{X}.

Conventionally, we put a "hat" over the Ξ»\lambda that maximizes the likelihood function to indicate that it is the MLE. Such notation looks like this: Ξ»^\widehat{\lambda}.

Note that we went from "little x's", xix_i, to "big x", Xˉ\bar{X}, in the equation. We do this to indicate that λ^\widehat{\lambda} is a random variable.

Just to be careful, we probably should have performed a second-derivative test on the function, ln⁑(L(λ))\ln(L(\lambda)), to ensure that we found a maximum likelihood estimator and not a minimum likelihood estimator.

Bernoulli Example

Let's look at a discrete example. Suppose we have X1,...,Xn∼iidBern(p)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Bern}(p). Let's find the MLE for pp. We might remember that the expected value of Bern(pp) random variable is pp, so we shouldn't be surprised if XΛ‰\bar X is our MLE.

Let's remember the values that XiX_i can take:

Xi={1w.p.Β p0w.p.Β 1βˆ’pX_i = \begin{cases} 1 & \text{w.p. } p \\ 0 & \text{w.p. } 1-p \end{cases}

Therefore, we can write the pmf for a Bern(pp) random variable as follows:

f(x)=px(1βˆ’p)1βˆ’x,x=0,1f(x) = p^x(1-p)^{1-x}, \quad x = 0, 1

Now, let's calculate L(p)L(p). First:

L(p)=∏i=1nf(xi)=∏pxi(1βˆ’p)1βˆ’xiL(p) = \prod_{i=1}^n f(x_i) = \prod p^{x_i}(1-p)^{1-x_i}

Remember that axβˆ—ay=ax+ya^x * a^y = a^{x+y}. So:

∏pxi(1βˆ’p)1βˆ’xi=pβˆ‘i=1nxi(1βˆ’p)nβˆ’βˆ‘i=1nxi\prod p^{x_i}(1-p)^{1-x_i} = p^{\sum_{i=1}^n x_i}(1-p)^{n-\sum_{i=1}^n x_i}

Let's take the natural log of both sides, remembering that ln⁑(ab)=ln⁑(a)+ln⁑(b)\ln(ab) = \ln(a) + \ln(b):

ln⁑(L(p))=βˆ‘i=1nxiln⁑(p)+(nβˆ’βˆ‘i=1nxi)ln⁑(1βˆ’p)\ln(L(p)) = \sum_{i=1}^n x_i\ln(p) + (n-\sum_{i=1}^n x_i)\ln(1-p)

Let's take the derivative, remembering that the derivative of ln⁑(1βˆ’p)\ln(1-p) equals βˆ’1/(1βˆ’p)-1/(1-p):

ddpln⁑(L(p))=βˆ‘i=1nxipβˆ’nβˆ’βˆ‘i=1nxi1βˆ’p\frac{d}{dp}\ln(L(p)) = \frac{\sum_{i=1}^n x_i}{p} - \frac{n-\sum_{i=1}^n x_i}{1-p}

Now we can set the derivative equal to zero, and solve for p^\widehat p:

βˆ‘i=1nxip^βˆ’nβˆ’βˆ‘i=1nxi1βˆ’p^=0βˆ‘i=1nxip^=nβˆ’βˆ‘i=1nxi1βˆ’p^βˆ‘i=1nxinβˆ’βˆ‘i=1nxi=p^1βˆ’p^XΛ‰βˆ’1=p^βˆ’1XΛ‰=p^\begin{alignedat}{1} \frac{\sum_{i=1}^n x_i}{\widehat p} &- \frac{n-\sum_{i=1}^n x_i}{1-\widehat p} = 0 \\ \frac{\sum_{i=1}^n x_i}{\widehat p} &= \frac{n-\sum_{i=1}^n x_i}{1-\widehat p} \\ \frac{\sum_{i=1}^n x_i}{n-\sum_{i=1}^n x_i} &= \frac{\widehat p}{1-\widehat p} \\[3ex] \bar X - 1 &= \widehat p - 1 \\ \bar X &= \widehat p \end{alignedat}

MLE Examples

In this lesson, we'll look at some additional MLE examples. MLEs will become very important when we eventually carry out goodness-of-fit tests.

Normal Example

Suppose we have X1,...,Xn∼iidNor(ΞΌ,Οƒ2)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Nor}(\mu, \sigma^2). Let's find the simultaneous MLE's for ΞΌ\mu and Οƒ2\sigma^2:

L(ΞΌ,Οƒ2)=∏i=1nf(xi)=∏i=1n12πσ2exp⁑{βˆ’12(xiβˆ’ΞΌ)2Οƒ2}=1(2πσ2)n/2exp⁑{βˆ’12βˆ‘i=1n(xiβˆ’ΞΌ)2Οƒ2}\begin{alignedat}{1} L(\mu, \sigma^2) &= \prod_{i=1}^n f(x_i) \\ &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{1}{2}\frac{(x_i - \mu)^2}{\sigma^2}\right\} \\ &= \frac{1}{(2\pi\sigma^2)^{n/2}}\exp\left\{-\frac{1}{2}\sum_{i=1}^n\frac{(x_i - \mu)^2}{\sigma^2}\right\} \end{alignedat}

Let's take the natural log:

ln⁑(L(ΞΌ,Οƒ2))=βˆ’n2ln⁑(2Ο€)βˆ’n2ln⁑(Οƒ2)βˆ’12Οƒ2βˆ‘i=1n(xiβˆ’ΞΌ)2\ln(L(\mu, \sigma^2)) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2

Let's take the first derivative with respect to ΞΌ\mu, to find the MLE, ΞΌ^\widehat\mu, for ΞΌ\mu. Remember that the derivative of terms that don't contain ΞΌ\mu are zero:

βˆ‚βˆ‚ΞΌln⁑(L(ΞΌ,Οƒ2))=βˆ‚βˆ‚ΞΌβˆ’12Οƒ2βˆ‘i=1n(xiβˆ’ΞΌ)2\frac{\partial}{\partial\mu} \ln(L(\mu, \sigma^2)) = \frac{\partial}{\partial\mu} \frac{-1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2

What's the derivative of (xiβˆ’ΞΌ)2(x_i - \mu)^2 with respect to ΞΌ\mu? Naturally, it's βˆ’2(xiβˆ’ΞΌ)-2(x_i - \mu). Therefore:

βˆ‚βˆ‚ΞΌln⁑(L(ΞΌ,Οƒ2))=1Οƒ2βˆ‘i=1n(xiβˆ’ΞΌ)\frac{\partial}{\partial\mu} \ln(L(\mu, \sigma^2)) = \frac{1}{\sigma^2}\sum_{i=1}^n(x_i - \mu)

We can set the expression on the right equal to zero and solve for ΞΌ^\widehat\mu:

0=1Οƒ2βˆ‘i=1n(xiβˆ’ΞΌ^)0 = \frac{1}{\sigma^2}\sum_{i=1}^n(x_i - \widehat\mu)

If we solve for μ^\widehat\mu, we see that μ^=Xˉ\widehat\mu = \bar X, which we expect. In other words, the MLE for the true mean, μ\mu, is the sample mean, Xˉ\bar X.

Now, let's take the derivative of ln⁑(L(ΞΌ,Οƒ2))\ln(L(\mu,\sigma^2)) with respect to Οƒ2\sigma^2. Consider:

βˆ‚βˆ‚Οƒ2ln⁑(L(ΞΌ,Οƒ2))=βˆ’n2Οƒ2+12Οƒ4βˆ‘i=1n(xiβˆ’ΞΌ^)2\frac{\partial}{\partial\sigma^2} \ln(L(\mu, \sigma^2)) = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4}\sum_{i=1}^n (x_i - \widehat\mu)^2

We can set the expression on the right equal to zero and solve for Οƒ2^\widehat{\sigma^2}:

n2Οƒ2^+12Οƒ4^βˆ‘i=1n(xiβˆ’ΞΌ^)2=012Οƒ4^βˆ‘i=1n(xiβˆ’ΞΌ^)2=n2Οƒ2^1nβˆ‘i=1n(xiβˆ’ΞΌ^)2=2Οƒ4^2Οƒ2^βˆ‘i=1n(Xiβˆ’XΛ‰)2n=Οƒ2^\begin{alignedat} -\frac{n}{2\widehat{\sigma^2}} + \frac{1}{2\widehat{\sigma^4}}\sum_{i=1}^n (x_i - \widehat\mu)^2 &= 0 \\ \frac{1}{2\widehat{\sigma^4}}\sum_{i=1}^n (x_i - \widehat\mu)^2 &= \frac{n}{2\widehat{\sigma^2}} \\ \frac{1}{n}\sum_{i=1}^n (x_i - \widehat\mu)^2 &= \frac{2\widehat{\sigma^4}}{2\widehat{\sigma^2}} \\ \frac{\sum_{i=1}^n(X_i - \bar X)^2}{n} = \widehat{\sigma^2} \end{alignedat}

Notice how close Οƒ2^\widehat{\sigma^2} is to the unbiased sample variance:

S2=βˆ‘i=1n(Xiβˆ’XΛ‰)2nβˆ’1=nΟƒ2^nβˆ’1S^2 = \frac{\sum_{i=1}^n(X_i - \bar X)^2}{n - 1} = \frac{n\widehat{\sigma^2}}{n-1}

Because S2S^2 is unbiased, we have to expect that Οƒ2^\widehat{\sigma^2} is slightly biased. However, Οƒ2^\widehat{\sigma^2} has slightly less variance than S2S^2, making it the MLE. Regardless, the two quantities converge as nn grows.

Gamma Example

Let's look at the Gamma distribution, parameterized by rr and ΞΈ\theta. The pdf for this distribution is shown below. Recall that Ξ“(r)\Gamma(r) is the gamma function.

f(x)=Ξ»rΞ“(r)xrβˆ’1eβˆ’Ξ»x,x>0f(x) = \frac{\lambda^r}{\Gamma(r)}x^{r-1}e^{-\lambda x}, \quad x > 0

Suppose we have X1,...,Xn∼iidGam(r,λ)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Gam}(r, \lambda). Let's find the MLE's for rr and λ\lambda:

L(r,Ξ»)=∏i=1nf(xi)=Ξ»nr[Ξ“(r)]n(∏i=1nxi)rβˆ’1eβˆ’Ξ»βˆ‘ixi\begin{alignedat}{1} L(r, \lambda) &= \prod_{i=1}^n f(x_i) \\[3ex] &= \frac{\lambda^{nr}}{[\Gamma(r)]^n} \left(\prod_{i=1}^n x_i\right)^{r-1}e^{-\lambda\sum_i x_i} \end{alignedat}

Let's take the natural logarithm of both sides, remembering that ln⁑(a/b)=ln⁑(a)βˆ’ln⁑(b)\ln(a/b) = \ln(a) - \ln(b):

ln⁑(L)=rnln⁑(Ξ»)βˆ’nln⁑(Ξ“(r))+(rβˆ’1)ln⁑(∏ixi)βˆ’Ξ»βˆ‘ixi\ln(L) = rn\ln(\lambda) - n\ln(\Gamma(r)) + (r-1)\ln\left(\prod_ix_i\right) - \lambda\sum_ix_i

Let's get the MLE of Ξ»\lambda first by taking the derivative with respect to Ξ»\lambda. Notice that the middle two terms disappear:

βˆ‚βˆ‚Ξ»ln⁑(L)=rnΞ»βˆ’βˆ‘i=1nxi\frac{\partial}{\partial\lambda}\ln(L) = \frac{rn}{\lambda} - \sum_{i=1}^n x_i

Let's set the expression on the right equal to zero and solve for Ξ»^\widehat\lambda:

0=r^nΞ»^βˆ’βˆ‘i=1nxiβˆ‘i=1nxi=r^nΞ»^Ξ»^=r^nβˆ‘i=1nxi=r^/XΛ‰\begin{alignedat}{1} 0 &= \frac{\widehat rn}{\widehat\lambda} - \sum_{i=1}^n x_i \\ \sum_{i=1}^n x_i &= \frac{\widehat rn}{\widehat\lambda} \\ \widehat\lambda &= \frac{\widehat rn}{\sum_{i=1}^n x_i} = \widehat r / \bar X\\ \end{alignedat}

It turns out the mean of the Gamma distribution is r/λr / \lambda. If we pretend that Xˉ=μ\bar X = \mu then we can see how, with a simple rearrangement, that λ^=r^/Xˉ\widehat\lambda = \widehat r / \bar X.

Now let's find the MLE of rr. First, we take the derivative with respect to rr:

βˆ‚βˆ‚rln⁑(L)=nln⁑(Ξ»)βˆ’nΞ“(r)ddrΞ“(r)+ln⁑(∏ixi)\frac{\partial}{\partial r}\ln(L) = n\ln(\lambda) - \frac{n}{\Gamma(r)}\frac{d}{dr}\Gamma(r) + \ln\left(\prod_ix_i\right)

We can define the digamma function, Ξ¨(r)\Psi(r), to help us with the term involving the gamma function and it's derivative:

Ξ¨(r)≑Γ′(r)/Ξ“(r)\Psi(r) \equiv \Gamma'(r) / \Gamma(r)

At this point, we can substitute in λ^=r^/Xˉ\widehat\lambda = \widehat r/\bar X, and then use a computer to solve the following equation, either by bisection, Newton's method, or some other method:

nln⁑(r^/XΛ‰)βˆ’nΞ¨(r)+ln⁑(∏ixi)=0n\ln(\widehat r/\bar X) - n\Psi(r) + \ln\left(\prod_ix_i\right) = 0

The challenging part of evaluating the digamma function is computing the derivative of the gamma function. We can use the definition of the derivative here to help us, choosing our favorite small hh and then evaluating:

Ξ“β€²(r)β‰ˆΞ“(r+h)βˆ’Ξ“(r)h\Gamma'(r) \approx \frac{\Gamma(r+h) - \Gamma(r)}{h}

Uniform Example

Suppose we have X1,...,Xn∼iidUnif(0,θ)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Unif}(0, \theta). Let's find the MLE for θ\theta.

Remember that the pdf f(x)=1/ΞΈ,0<x<ΞΈf(x) = 1/\theta, 0 < x < \theta. We can take the likelihood function as the product of the f(xi)f(x_i)'s:

L(ΞΈ)=∏i=1nf(xi)={1/ΞΈn,ifΒ 0≀xi≀θ,βˆ€i0otherwiseL(\theta) = \prod_{i=1}^n f(x_i) = \begin{cases} 1/\theta^n,& \text{if } 0 \leq x_i \leq \theta, \forall i \\ 0 & \text{otherwise} \end{cases}

In order to have L(ΞΈ)>0L(\theta) > 0, we must have 0≀xi≀θ,βˆ€i0 \leq x_i \leq \theta, \forall i. In other words, ΞΈ\theta must be at least as large as the largest observation we've seen yet: ΞΈβ‰₯max⁑ixi\theta \geq \max_i x_i.

Subject to this constraint, L(θ)=1/θnL(\theta) = 1 / \theta^n is not maximized at θ=0\theta = 0. Instead L(θ)=1/θnL(\theta) = 1 / \theta^n is maximized at the smallest possible θ\theta value, namely θ^=max⁑iXi\widehat\theta = \max_i X_i.

This result makes sense in light of the similar (unbiased) estimator Y2=(n+1)max⁑iXi/nY_2 = (n+1)\max_i X_i/n that we saw previously.

Invariance Properties of MLEs

In this lesson, we will expand the vocabulary of maximum likelihood estimators by looking at the invariance property of MLEs. In a nutshell, if we have the MLE for some parameter, then we can use the invariance property to determine the MLE for any reasonable function of that parameter.

Invariance Property of MLE's

If ΞΈ^\widehat{\theta} is the MLE of some parameter, ΞΈ\theta, and h(β‹…)h(\cdot) is a 1:1 function, then h(ΞΈ^)h(\widehat{\theta}) is the MLE of h(ΞΈ)h(\theta).

Remember that this invariance property does not hold for unbiasedness. For instance, we said previously that the sample variance is an unbiased estimator for the true variance because E[S2]=Οƒ2E[S^2] = \sigma^2. However, E[S2]β‰ ΟƒE[\sqrt{S^2}] \neq \sigma, so we cannot use the sample standard deviation as an unbiased estimator for the true standard deviation.

Examples

Suppose we have a random sample, X1,...,Xn∼iidBern(p)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Bern}(p). We might remember that the MLE of pp is p^=XΛ‰\widehat p = \bar X. If we consider the 1:1 function h(ΞΈ)=ΞΈ2,ΞΈ>0h(\theta) = \theta^2, \theta > 0, then the invariance property says that the MLE of p2p^2 is XΛ‰2\bar X^2.

Suppose we have a random sample, X1,...,Xn∼iidNor(ΞΌ,Οƒ2)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Nor}(\mu, \sigma^2). We saw previously that the MLE for Οƒ2\sigma^2 is:

Οƒ2^=1nβˆ‘i=1n(Xiβˆ’XΛ‰)2\widehat{\sigma^2} = \frac{1}{n} \sum_{i=1}^n (X_i - \bar X)^2

We just said that we couldn't take the square root of S2S^2 to estimate Οƒ \sigma in an unbiased way. However, we can use the square root of Οƒ2^\widehat{\sigma^2} to get the MLE for Οƒ\sigma.

If we consider the 1:1 function h(ΞΈ)=+ΞΈh(\theta) = +\sqrt\theta, then the invariance property says that the MLE of Οƒ\sigma is:

Οƒ^=Οƒ2^=βˆ‘i=1n(Xiβˆ’XΛ‰)2n\widehat\sigma = \sqrt{\widehat{\sigma^2}} = \sqrt\frac{\sum_{i=1}^n (X_i - \bar X)^2}{n}

Suppose we have a random sample, X1,...,Xn∼iidExp(Ξ»)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Exp}(\lambda). The survival function, FΛ‰(x)\bar{F}(x), is:

FΛ‰(x)=P(X>x)=1βˆ’F(x)=1βˆ’(1βˆ’eβˆ’Ξ»x)=eβˆ’Ξ»x\bar{F}(x) = P(X > x) = 1 - F(x) = 1 - (1 - e^{-\lambda x}) = e^{-\lambda x}

We saw previously the the MLE for λ\lambda is λ^=1/Xˉ\widehat\lambda = 1/\bar X.Therefore, using the invariance property, we can see that the MLE for Fˉ(x)\bar{F}(x) is Fˉ(λ^)\bar{F}(\widehat{\lambda}):

FΛ‰(x)^=eβˆ’Ξ»^x=eβˆ’x/XΛ‰\widehat{\bar{F}(x)} = e^{-\widehat{\lambda}x} = e^{-x / \bar{X}}

The MLE for the survival function is used all the time in actuarial sciences to determine - somewhat gruesomely, perhaps - the probability that people will live past a certain age.

The Method of Moments (Optional)

In this lesson, we'll finish off our discussion on estimators by talking about the Method of Moments.

The Method Of Moments

The kkth moment of a random variable XX is:

E[Xk]={βˆ‘xxkf(x)ifΒ XΒ isΒ discrete∫Rxkf(x)dxifΒ XΒ isΒ continuousE[X^k] = \begin{cases} \sum_x x^k f(x) & \text{if X is discrete} \\ \int_\mathbb{R} x^k f(x)dx & \text{if X is continuous} \end{cases}

Suppose we have a sequence of random variables, X1,...,XnX_1,...,X_n, which are iid from pmf/pdf f(x)f(x). The method of moments (MOM) estimator for E[Xk]E[X^k], mkm_k, is:

mk=1nβˆ‘i=1nXikm_k = \frac{1}{n} \sum_{i=1}^n X_i^k

Note that mkm_k is equal to the sample average of the XikX_i^k's. Indeed, the MOM estimator for μ=E[Xi]\mu = E[X_i], is the sample mean, Xˉ\bar X:

m1=1nβˆ‘i=1nXi=XΛ‰=E[Xi]m_1 = \frac{1}{n} \sum_{i=1}^n X_i = \bar X = E[X_i]

Similarly, we can find the MOM estimator for k=2k=2:

m2=1nβˆ‘i=1nXi2=E[Xi2]m_2 = \frac{1}{n}\sum_{i=1}^n X_i^2 = E[X_i^2]

We can combine the MOM estimators for k=1,2k=1,2 to produce an expression for the variance of XiX_i:

Var(Xi)=E[Xi2]βˆ’(E[Xi])2=1nβˆ‘i=1nXi2βˆ’XΛ‰2=βˆ‘i=1n(Xi2n)βˆ’XΛ‰2=βˆ‘i=1n(Xi2nβˆ’XΛ‰2n)=1nβˆ‘i=1n(Xi2βˆ’XΛ‰2)=nβˆ’1nS2\begin{alignedat}{1} \text{Var}(X_i) &= E[X_i^2] - (E[X_i])^2 \\ &= \frac{1}{n}\sum_{i=1}^n X_i^2 - \bar X^2 \\ &= \sum_{i=1}^n\left(\frac{X_i^2}{n}\right) - \bar X^2 \\ &= \sum_{i=1}^n\left(\frac{X_i^2}{n} - \frac{\bar X^2}{n} \right) \\ &= \frac{1}{n}\sum_{i=1}^n\left(X_i^2 - \bar X^2 \right) \\ &= \frac{n-1}{n}S^2 \end{alignedat}

Of course, it's perfectly okay to use S2S^2 to estimate the variance, and the two quantities converge as nn grows.

Poisson Example

Suppose that X1,...,Xn∼iidPois(Ξ»)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Pois}(\lambda). We know that, for the Poisson distribution, E[Xi]=Ξ»E[X_i] = \lambda, so a MOM estimator for Ξ»\lambda is XΛ‰\bar X.

We might remember that the variance of the Poisson distribution is also Ξ»\lambda, so another MOM estimator for Ξ»\lambda is:

nβˆ’1nS2\frac{n-1}{n}S^2