Simulation

Calculus, Probability, and Statistics Primers

91 minute read



Notice a tyop typo? Please submit an issue or open a PR.

Calculus, Probability, and Statistics Primers

Calculus Primer [OPTIONAL]

In this lesson, we are going to take a quick review of calculus. If you are already familiar with basic calculus, there is nothing new here; regardless, it may be helpful to revisit these concepts.

Calculus Primer

Suppose that we have a function, f(x)f(x), that maps values of xx from a domain, XX, to a range, YY. We can represent this in shorthand as: f(x):X→Yf(x) : X \to Y.

For example, if f(x)=x2f(x) = x^2, then f(x)f(x) maps values from the set of real numbers, R\mathbb{R}, to the nonnegative portion of that set, R+\mathbb{R^+}.

We say that f(x)f(x) is continuous if, f(x)f(x) exists for all x∈Xx \in X and, for any x0,x∈Xx_0, x \in X, lim⁡x→ x0f(x)=f(x0)\lim_{x \to\ x_0} f(x) = f(x_0). Here, "lim" refers to limit.

For example, the function f(x)=3x2f(x) = 3x^2 is continuous for all xx. However, consider the function f(x)=⌊x⌋f(x) = \lfloor{x}\rfloor, which rounds down xx to the nearest integer. This function is not continuous and has a jump discontinuity for any integer xx. Here is a graph.

If f(x)f(x) is continuous, then the derivative at xx - assuming that it exists and is well-defined for any xx - is:

ddxf(x)≡f′(x)≡lim⁡h→ 0f(x+h)−f(x)h\frac{d}{dx} f(x) \equiv f'(x) \equiv \lim_{h \to\ 0} \frac{f(x + h) - f(x)}{h}

Note that we also refer to the derivative at xx as the instantaneous slope at xx. The expression f(x+h)−f(x)f(x + h) - f(x) represents the "rise", and hh represents the "run". As h→0h \to 0, the slope between (x,f(x))(x, f(x)) and (x+h,f(x+h))(x + h, f(x + h)) approaches the instantaneous slope at (x,f(x))(x, f(x)).

Some Old Friends

Let's revisit the derivative of some common expressions.

  • The derivative of a constant, aa, is 00.
  • The derivative of a polynomial term, xkx^k, is kxk−1kx^{k-1}.
  • The derivative of exe^x is exe^x.
  • The derivative of sin⁡(x)\sin(x) is cos⁡(x)\cos(x).
  • The derivative of cos⁡(x)\cos(x) is −sin⁡(x)-\sin(x).
  • The derivative of the natural log of xx, ln⁡(x)\ln(x) is 1x\frac{1}{x}.
  • Finally, the derivative of arctan⁡(x)\arctan(x) is equal to 11+x2\frac{1}{1+x^2}

Now let's look at some well-known properties of derivatives. The derivative of a function times a constant value is equal to the derivative of the function times the constant:

[af(x)]′=af′(x)\left[af(x)\right]' = af'(x)

The derivative of the sum of two functions is equal to the sum of the derivatives of the functions:

[f(x)+g(x)]′=f′(x)+g′(x)\left[f(x) + g(x)\right]' = f'(x) + g'(x)

The derivative of the product of two functions follows this rule:

[f(x)g(x)]′=f′(x)g(x)+g′(x)f(x)\left[f(x)g(x)\right]' = f'(x)g(x) + g'(x)f(x)

The derivative of the quotient of two functions follows this rule:

[f(x)g(x)]′=g(x)f′(x)−f(x)g′(x)g2(x)\left[\frac{f(x)}{g(x)}\right]' = \frac{g(x)f'(x) - f(x)g'(x)}{g^2(x)}

We can remember this quotient rule with the following pneumonic, referring to the numerator as "hi" and the denominator as "lo": "lo dee hi minus hi dee lo over lo lo".

Finally, the derivative of the composition of two functions follows this rule:

[f(g(x))]′=f′(g(x))g′(x)\left[f(g(x))\right]' = f'(g(x))g'(x)

Example

Let's look at an example. Suppose that f(x)=x2f(x) = x^2 and g(x)=ln⁡(x)g(x) = \ln(x). From our initial derivative rules, we know that f′(x)=2xf'(x) = 2x and g′(x)=1xg'(x) = \frac{1}{x}.

Let's calculate the derivative of the product of f(x)f(x) and g(x)g(x):

[f(x)g(x)]′=f′(x)g(x)+g′(x)f(x)\left[f(x)g(x)\right]' = f'(x)g(x) + g'(x)f(x)
[f(x)g(x)]′=2xln⁡x+x2x=2xln⁡x+x\left[f(x)g(x)\right]' = 2x\ln{x} + \frac{x^2}{x} = 2x\ln{x} + x

Let's calculate the derivative of the quotient of f(x)f(x) and g(x)g(x):

[f(x)g(x)]′=g(x)f′(x)−f(x)g′(x)g2(x)\left[\frac{f(x)}{g(x)}\right]' = \frac{g(x)f'(x) - f(x)g'(x)}{g^2(x)}
[f(x)g(x)]′=2xln⁡x−xln⁡2x\left[\frac{f(x)}{g(x)}\right]' = \frac{2x\ln{x} - x}{\ln^2{x}}

Let's calculate the derivative of the composition f(g(x))f(g(x)):

[f(g(x))]′=f′(g(x))g′(x)\left[f(g(x))\right]' = f'(g(x))g'(x)
[f(g(x))]′=2ln⁡(x)x\left[f(g(x))\right]' = \frac{2\ln(x)}{x}

The expression f′(g(x))f'(g(x)) might look tricky at first. Remember that f(x)=x2f(x) = x^2, so f′(x)=2xf'(x) = 2x. Thus, f′(g(x))=2g(x)=2ln⁡(x)f'(g(x)) = 2g(x) = 2\ln(x).

Second Derivatives

The second derivative of f(x)f(x) equals the derivative of the first derivative of f(x)f(x): f′′(x)=ddxf′(x)f^{\prime\prime}(x) = \frac{d}{dx}f'(x). We spoke of the first derivative as the instantaneous slope of f(x)f(x) at xx. We can think of the second derivative as the slope of the slope.

A classic example comes from physics. If f(x)f(x) describes the position of an object, then f′(x)f'(x) describes the object's velocity and f′′(x)f^{\prime\prime}(x) describes the object's acceleration.

So why do we care about second derivatives?

A minimum or maximum of f(x)f(x) can only occur when the slope of f(x)f(x) equals 00; in other words, when f′(x)=0f'(x) = 0. Mentally visualizing the peaks and valleys of graphs of certain functions may help in understanding why this is true.

Consider a point, x0x_0, such that f′(x0)=0f'(x_0) = 0. If f′′(x0)<0f^{\prime\prime}(x_0) < 0, then f(x0)f(x_0) is a maximum. If f′′(x0)>0f^{\prime\prime}(x_0) > 0, then f(x0)f(x_0) is a minimum. If f′′(x0)=0f^{\prime\prime}(x_0) = 0, then f(x0)f(x_0) is an inflection point.

Example

Consider the function f(x)=e2x+e−xf(x) = e^{2x} + e^{-x}. We want to find a point, x0x_0, that minimizes ff. Let's first compute the derivative, using the composition rule for each term:

f′(x)=[e2x+e−x]′=2e2x−e−xf'(x) = \left[e^{2x} + e^{-x}\right]' = 2e^{2x} - e^{-x}

Let's find x0x_0 such that f′(x0)=0f'(x_0) = 0.

2e2x−e−x=02e^{2x} - e^{-x} = 0
2e2x=e−x2e^{2x} = e^{-x}
e−xe2x=2\frac{e^{-x}}{e^{2x}} = 2
ln⁡(e−xe2x)=ln⁡(2)\ln(\frac{e^{-x}}{e^{2x}}) = \ln(2)
ln⁡(e−x)−ln⁡(e2x)=ln⁡(2)\ln(e^{-x}) - \ln({e^{2x}}) = \ln(2)
−x−2x=ln⁡(2)-x - 2x = \ln(2)
−3x=ln⁡(2)-3x = \ln(2)
x=ln⁡(2)−3≈−0.231x = \frac{\ln(2)}{-3} \approx -0.231

Now, let's calculate f′′(x)f^{\prime\prime}(x):

f′′(x)=[2e2x−e−x]′=4e2x+e−xf^{\prime\prime}(x) = \left[2e^{2x} - e^{-x}\right]' = 4e^{2x} + e^{-x}

Let's plug in x0x_0:

f′′(−0.231)=4e2∗−0.231+e0.231≈3.78f^{\prime\prime}(-0.231) = 4e^{2 * -0.231} + e^{0.231}\approx 3.78

Since this value is positive, f(x0)f(x_0) is a minimum. Furthermore, since ex>0e^x > 0 for all xx, f′′(x)>0f^{\prime\prime}(x) > 0 for all xx. This means that f(x0)f(x_0) is not only a local minimum, but specifically is the global minimum of f(x)f(x).

Saved By Zero! Solving Equations [OPTIONAL]

In this lesson, we are going to look at formal ways to find solutions to nonlinear equations. We will use these techniques several times throughout the course, as solving equations is useful in a lot of different methodologies within simulation.

Finding Zeroes

When we talk about solving a nonlinear equation, ff, what we mean is finding a value, xx, such that f(x)=0f(x) = 0.

There are a few methods by which we might find such an xx:

  • trial and error (not so good)
  • bisection (divide and conquer)
  • Newton's method or some variation
  • Fixed-point method

Example

Let's remind ourselves of an example from the previous lesson.

Consider the function f(x)=e2x+e−xf(x) = e^{2x} + e^{-x}. We want to find a point, x0x_0, that minimizes ff. Let's first compute the derivative, using the composition rule for each term:

f′(x)=[e2x+e−x]′=2e2x−e−xf'(x) = \left[e^{2x} + e^{-x}\right]' = 2e^{2x} - e^{-x}

Let's find x0x_0 such that f′(x0)=0f'(x_0) = 0.

2e2x−e−x=02e^{2x} - e^{-x} = 0
2e2x=e−x2e^{2x} = e^{-x}
e−xe2x=2\frac{e^{-x}}{e^{2x}} = 2
ln⁡(e−xe2x)=ln⁡(2)\ln(\frac{e^{-x}}{e^{2x}}) = \ln(2)
ln⁡(e−x)−ln⁡(e2x)=ln⁡(2)\ln(e^{-x}) - \ln({e^{2x}}) = \ln(2)
−x−2x=ln⁡(2)-x - 2x = \ln(2)
−3x=ln⁡(2)-3x = \ln(2)
x=ln⁡(2)−3≈−0.231x = \frac{\ln(2)}{-3} \approx -0.231

Now, let's calculate f′′(x)f^{\prime\prime}(x):

f′′(x)=[2e2x−e−x]′=4e2x+e−xf^{\prime\prime}(x) = \left[2e^{2x} - e^{-x}\right]' = 4e^{2x} + e^{-x}

Let's plug in x0x_0:

f′′(−0.231)=4e2∗−0.231+e0.231≈3.78f^{\prime\prime}(-0.231) = 4e^{2 * -0.231} + e^{0.231}\approx 3.78

Since this value is positive, f(x0)f(x_0) is a minimum. Furthermore, since ex>0e^x > 0 for all xx, f′′(x)>0f^{\prime\prime}(x) > 0 for all xx. This means that f(x0)f(x_0) is not only a local minimum, but specifically is the global minimum of f(x)f(x).

Bisection

Suppose we have a function, g(x)g(x), and suppose that we can find two values, x1x_1 and x2x_2, such that g(x1)<0g(x_1) < 0 and g(x2)>0g(x_2) > 0. Given these conditions, we know, via the intermediate value theorem, that there must be a solution in between x1x_1 and x2x_2. In other words, there exists x∗∈[x1,x2]x^* \in [x_1, x_2] such that g(x∗)=0g(x^*) = 0.

To find x∗x^*, we first compute x3=x1+x22x_3 = \frac{x_1 + x_2}{2}. If g(x3)<0g(x_3) < 0, then we know that x∗x^* exists on [x3,x2][x_3, x_2]. Otherwise, if g(x3)>0g(x_3) > 0, then x∗x^* exists on [x1,x3][x_1, x_3]. We call this method bisection because we bisect the search interval - we cut it in half - on each round. We continue bisecting until the length of the search interval is as small as desired. See binary search.

Example

Now we are going to use bisection to find the solution to g(x)=x2−2g(x) = x^2 - 2. Of course, we know analytically that g(2)=0g(\sqrt{2}) = 0, so we are essentially using bisection here to approximate 2\sqrt{2}.

Let's pick our two starting points, x1=1x_1 = 1 and x2=2x_2 = 2. Since f(x1)=−1f(x_1) = -1 and f(x2)=2f(x_2) = 2, we know, from the intermediate value theorem, that there must exist an x∗∈[1,2]x^* \in [1, 2] such that f(x∗)=0f(x^*) = 0.

We consider a point, x3x_3, halfway between x1x_1 and x2x_2: x3=1+22=1.5x_3 = \frac{1 + 2}{2} = 1.5. Since f(x3)=0.25f(x_3) = 0.25, we know that x∗x^* lies on the interval [1,1.5][1, 1.5].

Similarly, we can consider a point, x4x_4, halfway between x1x_1 and x3x_3: x4=1+1.52=1.25x_4 = \frac{1 + 1.5}{2} = 1.25. Since f(x4)=−0.4375f(x_4) = -0.4375, we know that x∗x^* lies on the interval [1.25,1.5][1.25, 1.5].

Let's do this twice more. x5=1.25+1.52=1.375x_5 = \frac{1.25 + 1.5}{2} = 1.375. f(x5)=−0.109f(x_5) = -0.109, so x∗x^* lies on the interval [1.375,1.5][1.375, 1.5]. x6=1.375+1.52=1.4375x_6 = \frac{1.375 + 1.5}{2} = 1.4375. f(x6)=0.0664f(x_6) = 0.0664, so x∗x^* lies on the interval [1.375,1.4375][1.375, 1.4375].

We can see that our search is converging to 2≈1.414\sqrt{2} \approx 1.414.

Newton's Method

Suppose that, for a function g(x)g(x), we can find a reasonable first guess, x0x_0, for the solution of g(x)g(x). If g(x)g(x) has a derivative that isn't too flat near the solution, then we can iteratively refine our estimate of the solution using the following sequence:

xi+1=xi−g(xi)g′(xi)x_{i+1} = x_i - \frac{g(x_i)}{g'(x_i)}

We continue iterating until the sequence appears to converge.

Example

Let's try out Newton's method for g(x)=x2−2g(x) = x^2 - 2. We can re-express the sequence above as follows:

xi+1=xi−xi2−22xix_{i+1} = x_i - \frac{x_i^2 - 2}{2x_i}
xi+1=xi−(xi2−1xi)x_{i+1} = x_i - (\frac{x_i}{2} - \frac{1}{x_i})
xi+1=xi2+1xix_{i+1} = \frac{x_i}{2} + \frac{1}{x_i}

Let's start with a bad guess, x0=1x_0 = 1. Then:

x1=x02+1x0=12+11=1.5x_1 = \frac{x_0}{2} + \frac{1}{x_0} = \frac{1}{2} + \frac{1}{1} = 1.5
x2=x12+1x1=1.52+11.5≈1.4167x_2 = \frac{x_1}{2} + \frac{1}{x_1} = \frac{1.5}{2} + \frac{1}{1.5} \approx 1.4167
x3=x22+1x2=1.41672+11.4167≈1.4142x_3 = \frac{x_2}{2} + \frac{1}{x_2} = \frac{1.4167}{2} + \frac{1}{1.4167} \approx 1.4142

After just three iterations, we have approximated 2\sqrt{2} to four decimal places!

Integration [OPTIONAL]

What goes up, must come down. A few lessons ago, we looked at derivatives. In this lesson, we will focus on integration.

Integration

A function, F(x)F(x), having derivative f(x)f(x) is called the antiderivative. The antiderivative, also referred to as the indefinite integral of f(x)f(x), is denoted F(x)=∫f(x)dxF(x) = \int{f(x)dx}.

The fundamental theorem of calculus states that if f(x)f(x) is continuous, then the area under the curve for x∈[a,b]x \in [a, b] is given by the definite integral:

∫abf(x)dx≡F(x)∣ab≡F(b)−F(a)\int^b_a f(x)dx \equiv F(x) \Big|^b_a \equiv F(b) - F(a)

Friends of Mine

Let's look at some indefinite integrals:

∫xkdx=xk+1k+1+C,k≠−1\int x^kdx = \frac{x^{k + 1}}{k+ 1} + C, k \neq -1
∫dxx=ln⁡∣x∣+C\int \frac{dx}{x} = \ln|x| + C
∫exdx=ex+C\int e^xdx = e^x + C
∫cos(x)dx=sin(x)+C\int cos(x)dx = sin(x) + C
∫dx1+x2=arctan⁡(x)+C\int \frac{dx}{1 + x^2} = \arctan(x) + C

Note that CC is a constant value. Consider a function f(x)f(x). Since the derivative of a constant value is zero, f′(x)=[f(x)+C]′f'(x) = \left[f(x) + C\right]'. When we integrate f′(x)f'(x), we need to re-include this constant expression: ∫f′(x)=f(x)+C\int f'(x) = f(x) + C.

Let's look at some well-known properties of definite integrals.

The integral of a function from aa to aa is zero:

∫aaf(x)dx=0\int_a^a f(x)dx = 0

The integral of a function from aa to bb is the negative of the integral from bb to aa:

∫abf(x)dx=−∫baf(x)dx\int_a^b f(x)dx = -\int_b^a f(x)dx

Given a third point, cc, the integral of a function from aa to bb is the sum of the integrals from aa to cc and cc to bb:

∫abf(x)dx=∫acf(x)dx+∫cbf(x)dx\int_a^b f(x)dx = \int_a^c f(x)dx + \int_c^b f(x)dx

Furthermore, the integral of a sum is the sum of the integrals:

∫[f(x)+g(x)]dx=∫f(x)dx+∫g(x)dx\int \left[f(x) + g(x) \right]dx = \int f(x)dx + \int g(x)dx

Similar to the product rule for derivatives, we integrate products using integration by parts:

∫f(x)g′(x)dx=f(x)g(x)−∫g(x)f′(x)dx\int f(x)g'(x)dx = f(x)g(x) - \int g(x)f'(x)dx

Similar to the chain rule for derivatives, we integrate composed functions using the substitution rule, substituting uu for g(x)g(x):

∫f(g(x))g′(x)dx=∫f(u)du, where u=g(x)\int f(g(x))g'(x)dx = \int f(u)du, \text{ where } u = g(x)

Example

Let's look at an example. Given f(x)=xf(x) = x and g′(x)=e2xg'(x) = e^{2x}, let's compute the integral of f(x)g′(x)dxf(x)g'(x)dx from [0,1][0, 1].

We know, via integration by parts, that:

∫01f(x)g′(x)dx=f(x)g(x)∣01−∫01g(x)f′(x)dx\int_0^1 f(x)g'(x)dx = f(x)g(x)\Big|_0^1 - \int_0^1 g(x)f'(x)dx

Notice that we need to take the integral of g′(x)g'(x). We can calculate this using u-substitution. Let a(x)=2xa(x) = 2x and b(x)=exb(x) = e^x. Then, using the substitution rule above:

∫b(a(x))a′(x)dx=∫b(u)du, where u=a(x)\int b(a(x))a'(x)dx = \int b(u)du, \text{ where } u = a(x)

Note that a′(x)=2a'(x) = 2, and b(a(x))=b(2x)=e2x=g′(x)b(a(x)) = b(2x) = e^{2x} = g'(x). Thus,

∫2g′(x)dx=∫eudu, where u=a(x)\int 2g'(x)dx = \int e^udu, \text{ where } u = a(x)

Divide both sides by two:

∫g′(x)dx=12∫eudu, where u=a(x)\int g'(x)dx = \frac{1}{2}\int e^udu, \text{ where } u = a(x)

Integrate:

g(x)+C=12eu+C, where u=a(x)g(x) + C = \frac{1}{2} e^u + C, \text{ where } u = a(x)

Subtract CC from both sides and substitute:

g(x)=12e2xg(x) = \frac{1}{2} e^{2x}

Now that we know g(x)g(x), let's return to our integration by parts:

∫01f(x)g′(x)dx=f(x)g(x)∣01−∫01g(x)f′(x)dx\int_0^1 f(x)g'(x)dx = f(x)g(x)\Big|_0^1 - \int_0^1 g(x)f'(x)dx

Let's substitute in the appropriate values for f(x)f(x), f′(x)f'(x) and g(x)g(x):

∫01xe2xdx=12xe2x∣01−∫0112e2xdx\int_0^1 xe^{2x}dx = \frac{1}{2} xe^{2x}\Big|_0^1 - \int_0^1 \frac{1}{2} e^{2x}dx

Let's pull out the 12\frac{1}{2}:

∫01xe2xdx=12(xe2x∣01−∫01e2xdx)\int_0^1 xe^{2x}dx = \frac{1}{2} \left(xe^{2x}\Big|_0^1 - \int_0^1 e^{2x}dx\right)

Of course, we already know how to integrate e2xe^{2x}:

∫01xe2xdx=12(xe2x∣01−12e2x∣01)\int_0^1 xe^{2x}dx = \frac{1}{2} \left(xe^{2x}\Big|_0^1 - \frac{1}{2}e^{2x}\Big|_0^1\right)

Now, let's solve:

∫01xe2xdx=12((e2−12e2)−(0−12e0))\int_0^1 xe^{2x}dx = \frac{1}{2} \left(\left(e^{2} - \frac{1}{2}e^{2}\right) - \left(0 - \frac{1}{2}e^0\right)\right)
∫01xe2xdx=12(e22+12)\int_0^1 xe^{2x}dx = \frac{1}{2} \left(\frac{e^{2}}{2} + \frac{1}{2}\right)
∫01xe2xdx=e24+14\int_0^1 xe^{2x}dx = \frac{e^{2}}{4} + \frac{1}{4}

Taylor and Maclaurin Series

Derivatives of arbitrary order kk can be written as f(k)(x)f^{(k)}(x) or dkdxkf(x)\frac{d^k}{dx^k}f(x). By convention, f(0)(x)=f(x)f^{(0)}(x) = f(x).

The Taylor series expansion of f(x)f(x) about a point aa is given by the following infinite sum:

f(x)=∑k=0∞f(k)(a)(x−a)kk!f(x) = \sum_{k = 0}^\infty \frac{f^{(k)}(a)(x - a)^k}{k!}

The Maclaurin series expansion of f(x)f(x) is simply the Taylor series about a=0a = 0:

f(x)=∑k=0∞f(k)(0)∗xkk!f(x) = \sum_{k = 0}^\infty \frac{f^{(k)}(0) * x^k}{k!}

Maclaurin Friends

Let's look at some familiar Maclaurin series:

sin⁡(x)=∑k=0∞−1k+1∗x2k+1(2k+1)!\sin(x) = \sum_{k = 0}^\infty \frac{-1^{k + 1} * x^{2k + 1}}{(2k + 1)!}
cos⁡(x)=∑k=0∞−1k∗x2k(2k)!\cos(x) = \sum_{k = 0}^\infty \frac{-1^{k} * x^{2k}}{(2k)!}
ex=∑k=0∞xkk!e^x = \sum_{k = 0}^\infty \frac{x^{k}}{k!}

While We're At It...

Let's look at some other sums, unrelated to Taylor or Maclaurin series, that are helpful to know.

The sum of all the integers between 1 and nn is given by the following equation:

∑k=1∞k=n(n+1)2\sum_{k = 1}^\infty k = \frac{n(n + 1)}{2}

Similarly, if we want to add the sum of the squares of all the integers between 1 and nn, we can use this equation:

∑k=1∞k2=n(n+1)(2n+1)6\sum_{k = 1}^\infty k^2 = \frac{n(n + 1)(2n + 1)}{6}

Finally, if we want to sum all of the powers of pp, and −1<p<1-1 < p < 1, we can use this equation:

∑k=0∞pk=11−p\sum_{k = 0}^\infty p^k = \frac{1}{1 - p}

L'Hôspital's Rule

Occasionally, we run into trouble when we encounter indeterminate ratios of the form 0/00/0 or ∞/∞\infty/\infty. L'Hôspital's Rule states that, when lim⁡x→af(x)\lim_{x \to a}f(x) and lim⁡x→ag(x)\lim_{x \to a}g(x) both go to zero or both go to infinity, then:

lim⁡x→af(x)g(x)=lim⁡x→af′(x)g′(x)\lim_{x \to a}\frac{f(x)}{g(x)} = \lim_{x \to a}\frac{f'(x)}{g'(x)}

For example, consider the following limit:

lim⁡x→0sin⁡(x)x\lim_{x \to 0}\frac{\sin(x)}{x}

As x→0x \to 0, sin⁡(x)→0\sin(x) \to 0. Thus, we can apply L'Hôspital's Rule:

lim⁡x→0sin⁡(x)x=lim⁡x→0cos⁡(x)1=1\lim_{x \to 0}\frac{\sin(x)}{x} = \lim_{x \to 0}\frac{\cos(x)}{1} = 1

Integration Computer Exercises [OPTIONAL]

In this lesson, we will demonstrate several numerical techniques that we might need to use if we can't find a closed-form solution to a function we are integrating. One of these techniques incorporates simulation!

Riemann Sums

Suppose we have a continuous function, f(x)f(x), under which we'd like to approximate the area from aa to bb. We can fit nn adjacent rectangles between aa and bb, such that each rectangle has a width Δx=(b−a)/n\Delta x = (b - a) / n and a height f(xi)f(x_i), where xix_i = a+iΔxa + i\Delta x is the right-hand endpoint of the iith rectangle.

The sum of the areas of the rectangles approximates the area under f(x)f(x) from aa to bb, which is equal to the integral of f(x)f(x) from aa to bb:

∫abf(x)dx≈∑i=1n[f(xi)Δx)]\int_a^b f(x)dx \approx \sum_{i = 1}^{n}\left[f(x_i)\Delta x)\right]

We can simplify the right-hand side of the equation by pulling the Δx\Delta x term out in front of the sum and substituting in the appropriate values for xix_i:

∑i=1n[f(xi)Δx)]=b−an∑i=1nf(a+i(b−a)n)≈∫abf(x)dx\sum_{i = 1}^{n}\left[f(x_i)\Delta x)\right] = \frac{b - a}{n} \sum_{i = 1}^{n} f\left(a + \frac{i(b-a)}{n}\right) \approx \int_a^b f(x)dx

As n→∞n \to \infty, this approximation becomes an equality.

Example

Suppose we have a function, f(x)=sin⁡((πx)/2)f(x) = \sin((\pi x) / 2), which we would like to integrate from 00 to 11. In other words, we want to compute:

∫01sin⁡(πx2)\int_0^1\sin\left(\frac{\pi x}{2}\right)

We can approximate the area under this curve using the following formula:

∫abf(x)dx≈b−an∑i=1nf(a+i(b−a)n)\int_a^b f(x)dx \approx \frac{b - a}{n} \sum_{i = 1}^{n} f\left(a + \frac{i(b-a)}{n}\right)

Let's plug in a=0a = 0 and b=1b = 1:

∫01f(x)dx≈1n∑i=1nf(in)\int_0^1 f(x)dx \approx \frac{1}{n} \sum_{i = 1}^{n} f\left(\frac{i}{n}\right)

Finally, let's replace ff:

∫01f(x)dx≈1n∑i=1nsin⁡(πi2n)\int_0^1 f(x)dx \approx \frac{1}{n} \sum_{i = 1}^{n} \sin\left(\frac{\pi i}{2n}\right)

For n=100n = 100, this sum calculates out to approximately 0.64160.6416, which is pretty close to the true answer of 2/π≈0.63662/\pi \approx 0.6366. For n=1000n = 1000, our estimate improves to approximately 0.63710.6371.

Trapezoid Rule

Here we are going to perform the same type of numerical integration, but we are going to use the trapezoid rule instead of the rectangle rule/Reimann sum. Under this rule:

∫abf(x)dx≈[f(x0)2+∑i=1n−1f(xi)+f(xn)2]Δx\int_a^b f(x)dx \approx \left[\frac{f(x_0)}{2} + \sum_{i = 1}^{n - 1} f(x_i) + \frac{f(x_n)}{2} \right]\Delta x

Substituting aa and bb simplifies the right-hand side of the formula:

∫abf(x)dx≈b−an[f(a)2+∑i=1n−1f(a+i(b−a)n)+f(b)2] \int_a^b f(x)dx \approx \frac{b - a}{n} \left[\frac{f(a)}{2} + \sum_{i = 1}^{n - 1} f\left(a + \frac{i(b-a)}{n}\right) + \frac{f(b)}{2}\right]

Example

Suppose we have a function, f(x)=sin⁡((πx)/2)f(x) = \sin((\pi x) / 2), which we would like to integrate from 00 to 11. In other words, we want to compute:

∫01sin⁡(πx2)\int_0^1\sin\left(\frac{\pi x}{2}\right)

We can approximate the area under this curve using the following formula:

∫abf(x)dx≈b−an[f(a)2+∑i=1n−1f(a+i(b−a)n)+f(b)2]\int_a^b f(x)dx \approx \frac{b - a}{n} \left[\frac{f(a)}{2} + \sum_{i = 1}^{n - 1} f\left(a + \frac{i(b-a)}{n}\right) + \frac{f(b)}{2}\right]

Let's plug in a=0a = 0 and b=1b = 1:

∫01f(x)dx≈1n[f(0)2+∑i=1n−1f(in)+f(1)2]\int_0^1 f(x)dx \approx \frac{1}{n} \left[\frac{f(0)}{2} + \sum_{i = 1}^{n - 1} f\left(\frac{i}{n}\right) + \frac{f(1)}{2}\right]

Let's replace ff:

∫01f(x)dx≈1n[sin⁡(0)2+∑i=1n−1sin⁡(πi2n)+sin⁡(π/2)2]\int_0^1 f(x)dx \approx \frac{1}{n} \left[\frac{\sin(0)}{2} + \sum_{i = 1}^{n - 1} \sin\left(\frac{\pi i}{2n}\right) + \frac{\sin(\pi / 2)}{2}\right]

Finally, let's evaluate and simplify:

∫01f(x)dx≈1n[∑i=1n−1sin⁡(πi2n)+12]\int_0^1 f(x)dx \approx \frac{1}{n} \left[\sum_{i = 1}^{n - 1} \sin\left(\frac{\pi i}{2n}\right) + \frac{1}{2}\right]

For n=100n = 100, this sum calculates out to approximately 0.636610.63661, which is very close to the true answer of 2/π≈0.636622/\pi \approx 0.63662. Note that, even when n=1000n = 1000, the Reimann estimation was not this precise; indeed, integration via the trapezoid rule often converges faster than the Reimann approach.

Monte Carlo Approximation

Suppose that we can generate an independent and identically distributed sequence of numbers, U1,U2,...,UnU_1, U_2, ..., U_n, sampled randomly from a uniform (0,1)(0, 1) distribution. If so, it can be shown that we can approximate the integral of f(x)f(x) from aa to bb according to the following formula:

∫abf(x)dx≈b−an∑i=1nf(a+Ui(b−a))\int_a^b f(x)dx \approx \frac{b - a}{n} \sum_{i = 1}^n f(a + U_i(b - a))

Note that this looks a lot like the Reimann integral summation. The difference is that these rectangles are not adjacent, but rather scattered randomly between aa and bb. As n→∞n \to \infty, the approximation converges to an equality, and it does so about as quickly as the Reimann approach.

Example

Suppose we have a function, f(x)=sin⁡((πx)/2)f(x) = \sin((\pi x) / 2), which we would like to integrate from 00 to 11. In other words, we want to compute:

∫01sin⁡(πx2)\int_0^1\sin\left(\frac{\pi x}{2}\right)

We can approximate the area under this curve using the following formula:

∫abf(x)dx≈b−an∑i=1nf(a+Ui(b−a))\int_a^b f(x)dx \approx \frac{b - a}{n} \sum_{i = 1}^n f(a + U_i(b - a))

Let's plug in a=0a = 0 and b=1b = 1:

∫01f(x)dx≈1n∑i=1nf(Ui)\int_0^1 f(x)dx \approx \frac{1}{n} \sum_{i = 1}^n f(U_i)

Let's replace ff:

∫01f(x)dx≈1n∑i=1nsin⁡(πUi2)\int_0^1 f(x)dx \approx \frac{1}{n} \sum_{i = 1}^n \sin\left(\frac{\pi U_i}{2}\right)

Here is some python code for how we might simulate this:

# Tested with Python 3.8.3

from random import random
from math import pi, sin

def simulate(n):
  result = 0
  
  for _ in range(n):
    result += sin(pi * random() / 2)
  
  return result / n

trials_100  = sum([simulate(100) for _ in range(100)]) / 100
print(trials_100)

trials_1000 = sum([simulate(100) for _ in range(1000)]) / 1000
print(trials_1000)

After running this script once on my laptop, trials_100 equals approximately 0.63550.6355, and trials_1000 equals approximately 0.63660.6366.

Probability Basics

In this lesson, we will start our review of probability.

Basics

Hopefully, we already know the very basics, such as sample spaces, events, and the definition of probability. For example, if someone tells us that some event has a probability greater than one or less than zero, we should immediately know that what they are saying is false.

Conditional Probability

The probability of some event, AA, given some other event, BB, equals the probability of the intersection of AA and BB, divided by the probability of BB. In other words, the conditional probability of AA given BB is:

P(A∣B)=P(A∩B)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}

Note that we assume that P(B)>0P(B) > 0 so we can avoid any division-by-zero errors.

A non-mathematical way to think about conditional probability is the probability that AA will occur given some updated information BB. For example, think about the probability that your best friend is asleep at any point in time. Now, consider that same probability, given that it's Tuesday at 3 am.

Example

Let's toss a fair die. Let A={1,2,3}A = \{1,2,3\} and B={3,4,5,6}B = \{3,4,5,6\}. What is the probability that the dice roll is in AA given that we know it is in BB?

We can think about this problem intuitively first. There are four values in BB, each of which is equally likely to occur. One of those values, three, is also in AA. If we know that the roll is one of the values in BB, then there is a one in four chance that the roll is three. Thus, the probability is 1/41/4.

We can also use the conditional probability equation to calculate P(A∣B)P(A|B):

P(A∣B)=P(A∩B)P(B)P(A | B) = \frac{P(A \cap B)}{P(B)}

Let's calculate P(A∩B)P(A \cap B). There are six possible rolls total, and AA and BB share one of them. Therefore, P(A∩B)=1/6P(A \cap B) = 1/6. Now, let's calculate P(B)P(B). There are six possible rolls total, and BB contains four of them, so P(B)=4/6P(B) = 4/6. As a result:

P(A∣B)=P(A∩B)P(B)=1/64/6=14P(A | B) = \frac{P(A \cap B)}{P(B)} = \frac{1/6}{4/6} = \frac{1}{4}

Note that P(A∣B)≠P(A)P(A | B) \neq P(A). P(A)=1/2P(A) = 1/2. Prior information changes probabilities.

Independent Events

If P(A∩B)=P(A)P(B)P(A \cap B) = P(A)P(B), then AA and BB are independent events.

For instance, consider the temperature on Mars and the stock price of IBM. Those two events are independent; in other words, the temperature on Mars has no impact on IBM stock, and vice versa.

Let's consider a theorem: if AA and BB are independent, then P(A∣B)=P(A)P(A|B) = P(A). This means that if AA and BB are independent, then prior information about BB in no way influences that probability of AA occurring.

For example, consider two consecutive coin flips. Knowing that you just flipped heads has no impact on the probability that you will flip heads again: successive coin flips are independent events. However, knowing that it rained yesterday almost certainly impacts the probability that it will rain today. Today's weather is often very much dependent on yesterday's weather.

Example

Toss two dice. Let A=Sum is 7A = \text{Sum is 7} and B=First die is 4B = \text{First die is 4}. Since there are six ways to roll a seven with two dice among thirty-six possible outcomes, P(A)=1/6P(A) = 1/6. Similarly, since there is one way to roll a four among six possible rolls, P(B)=1/6P(B) = 1/6.

Out of all thirty-six possible dice rolls, only one meets both criteria: rolling a four followed by a three. As a result:

P(A∩B)=P((4,3))=136=P(A)P(B)P(A \cap B) = P((4, 3)) = \frac{1}{36} = P(A)P(B)

Because of this equality, we can conclude that AA and BB are independent events.

Random Variables

A random variable, XX, is a function that maps the sample space, Ω\Omega, to the real line: X:Ω→RX: \Omega \to \R.

For example, let XX be the sum of two dice rolls. What is the sample space? Well, it's the set of all possible combinations of two dice rolls: {(1,1),(1,2),...,(6,5),(6,6)}\{ (1,1), (1,2), ..., (6,5), (6,6)\}. What is the output of XX? It's a real number: the sum of the two rolls. Thus, the function XX maps an element from the sample space to a real number. As a concrete example, X((4,6))=10X((4,6)) = 10.

Additionally, we can enumerate the probabilities that our random value takes any specific value within the sample space. We refer to this as P(X=x)P(X = x), where XX is the random variable, and xx is the observation we are interested in.

Consider our XX above. What is the probability that the sum of two dice rolls takes on any of the possible values?

P(X=x)={1/36 if x=22/36 if x=3⋮6/36 if x=7⋮1/36 if x=120 otherwise P(X = x) = \left\{ \begin{array}{ll} 1/36 \quad \text{ if } x = 2 \\ 2/36 \quad \text{ if } x = 3 \\ \vdots \\ 6/36 \quad \text{ if } x = 7 \\ \vdots \\ 1/36 \quad \text { if } x = 12 \\ 0 \quad\quad \text { otherwise } \end{array} \right.

Discrete Random Variables

If the number of possible values of a random variable, XX, is finite or countably infinite, then XX is a discrete random variable. The probability mass function (pmf) of a discrete random variable is given by a function, f(x)=P(X=x)f(x) = P(X = x). Note that, necessarily, ∑xf(x)=1\sum_xf(x) = 1.

By countably infinite, we mean that there could be an infinite number of possible values for xx, but they have a one-to-one correspondence with the integers.

For example, flip two coins. Let XX be the number of heads. We can define the pmf, f(x)f(x) as:

f(x)={1/4 if x=01/2 if x=11/4 if x=20 otherwise f(x) = \left\{ \begin{array}{ll} 1/4 \quad \text{ if } x = 0\\ 1/2 \quad \text{ if } x = 1 \\ 1/4 \quad \text{ if } x = 2 \\ 0 \quad\quad \text{ otherwise } \end{array} \right.

Out of the four possible pairs of coin flips, one includes no heads, two includes one head, and one includes two heads. All other values are impossible, so we assign them all a probability of zero. As expected, the sum of all f(x)f(x) for all xx equals one.

Some other well-known discrete random variables include Bernoulli(pp), Binomial(nn, pp), Geometric(pp) and Poisson(λ\lambda). We will talk about each of these types of random variables as we need them.

Continuous Random Variables

We are also interested in continuous random variables. A continuous random variable is one that has probability zero for every individual point, and for which there exists a probability density function (pdf), f(x)f(x), such that P(X∈A)=∫Af(x)dxP(X \in A) = \int_A f(x)dx for every set AA. Note that, necessarily, ∫Rf(x)=1\int_\R f(x) = 1.

To reiterate, the pdf does not provide probabilities directly like the pmf; instead, we integrate the pdf over the set of events AA to determine P(X∈A)P(X \in A).

For example, pick a random real number between three and seven. There is an uncountably infinite number of real numbers in this range, and the probability that I pick any particular value is zero. The pdf, f(x)f(x), for this continuous random variable is

f(x)={1/4 if 3≤x≤70 otherwise f(x) = \left\{ \begin{array}{ll} 1/4 \quad \text{ if } 3 \leq x \leq 7\\ 0 \quad\quad \text{ otherwise } \end{array} \right.

Even though f(x)f(x) does not give us probabilities directly, it's the pdf, which means we can integrate it to calculate probabilities.

For instance, what is P(X≤5)P(X \leq 5)? To calculate this, we need to integrate f(x)f(x) from −∞-\infty to 55. The integral of f(x)f(x) from −∞-\infty to 33 is zero, since the integral of 0 is 0. The integral of f(x)f(x) from 33 to 55 is 5/4−3/4=2/4=1/25/4 - 3/4 = 2/4 = 1/2. Thus, P(X≤5)=1/2P(X \leq 5) = 1/2, which makes sense as we are splitting our range of numbers in half.

Notice that our function describes a rectangle of width 44 and height 1/41/4. If we take the area under the curve of this function - if we integrate it - we get 1.

Some other well-known continuous random variables include Uniform(aa, bb), Exponential(λ\lambda) and Normal(μ\mu, σ2\sigma^2). We will talk about each of these types of random variables as we need them.

Notation

Just a note on notation. The symbol ∼\sim means "is distributed as". For instance, X∼Unif(0,1)X \sim{\text{Unif}(0,1)} means that a random variable, XX is distributed according to the uniform (0,1)(0, 1) probability distribution.

Cumulative Distribution Function

For a random variable, XX, either discrete or continuous, the cumulative distribution function (cdf), F(x)F(x) is the probability that X≤xX \leq x. In other words,

F(x)≡P(X≤x)={∑y≤xf(y) if X is discrete ∫−∞xf(y)dy if X is continuous F(x) \equiv P(X \leq x) = \left\{ \begin{array}{ll} \sum_{y \leq x}f(y) \quad \text{ if } X \text{ is discrete } \\ \\ \int_{-\infty}^xf(y)dy \quad \text{ if } X \text{ is continuous } \end{array} \right.

For discrete random variables, F(x)F(x) is equal to the sum of the discrete probabilities for all y≤xy \leq x. For continuous random variables, F(x)F(x) is equal to the integral of the pdf from −∞-\infty to xx.

Note that as x→−∞x \to -\infty, F(x)→0F(x) \to 0 and as x→∞x \to \infty, F(x)→1F(x) \to 1. In other words, P(x≤−∞)=0P(x \leq -\infty) = 0 and P(x≤∞)=1P(x \leq \infty) = 1. Additionally, if XX is continuous, then F′(x)=f(x)F'(x) = f(x).

Let's look at a discrete example. Flip two coins and let XX be the number of heads. XX has the following cdf:

F(x)={0 if X<01/4 if 0≤X<13/4 if 1≤X<21 if X≥2F(x) = \left\{ \begin{array}{ll} 0 \quad\quad \text{ if } X < 0 \\ 1/4 \quad \text{ if } 0 \leq X < 1 \\ 3/4 \quad \text{ if } 1 \leq X < 2 \\ 1 \quad\quad \text{ if } X \geq 2 \\ \end{array} \right.

For any x<0x < 0, P(X≤x)=0P(X \leq x) = 0. We can't observe fewer than zero heads. For any 0≤x<10 \leq x < 1, P(X≤x)=1/4P(X \leq x) = 1/4. P(X≤x)P(X \leq x) covers P(X=0)P(X = 0), which is 1/41/4. For any 1≤x<21 \leq x < 2. P(X≤x)P(X \leq x) covers P(X=1)P(X = 1), which is 1/21/2, which we add to the previous 1/41/4 to get 3/43/4. Finally, for x≥2x \geq 2, F(x)=1F(x) = 1, since we have covered all possible outcomes.

Let's consider a continuous example. Suppose we have X∼Exp(λ)X \sim \text{Exp}(\lambda). By definition, f(x)=λe−λx,x≥0f(x) = \lambda e^{-\lambda x}, x \geq 0. If we integrate f(x)f(x) from 00 to xx, we get the cdf F(x)=1−eλxF(x) = 1 - e^{\lambda x}.

Simulating Random Variables

In this lesson, we are going to look at simulating some simple random variables using a computer.

Discrete Uniform Example

For the simplest example, let's consider a discrete uniform distribution, DUDU, from 11 to nn: DU={1,2,...,n}DU = \{1, 2, ..., n\}. Let X=iX = i with probability 1/n1/n for i∈DUi \in DU. This example might look complicated, but we can think of it basically as an nn-sided die toss.

If UU is a uniform (0,1)(0, 1) random variable - that is, U∼Unif(0,1)U \sim \text{Unif}(0, 1) - we can obtain X∼DU(1,n)X \sim DU(1, n) through the following transformation: X=⌈nU⌉X = \lceil{nU}\rceil, where ⌈⋅⌉\lceil\cdot\rceil is the "ceiling", or "round up" function.

For example, suppose n=10n = 10 and U∼Unif(0,1)U \sim \text{Unif}(0, 1). If U=0.73U = 0.73, then X=⌈10(0.73)⌉=⌈7.3⌉=8X = \lceil{10(0.73)}\rceil = \lceil{7.3}\rceil = 8.

Another Discrete Random Variable Example

Let's look at another discrete random variable. Consider the following pmf, f(x)f(x) for XX:

f(x)≡P(X=x)={0.25 if x−20.10 if x=30.65 if x=4.20 otherwise f(x) \equiv P(X = x) = \left\{ \begin{array}{ll} 0.25 \quad \text{ if } x -2\\ 0.10 \quad \text{ if } x = 3 \\ 0.65 \quad \text{ if } x = 4.2 \\ 0 \quad\quad \text{ otherwise } \end{array} \right.

We can't use a die toss to simulate this random variable. Instead, we have to use the inverse transform method.

Consider the following table:

xf(x)P(X≤x)Unif(0,1)−20.250.25[0.00,0.25]30.10.35(0.25,0.35]4.20.651.00(0.35,1.00)\begin{array}{c|c|c|c} x & f(x) & P(X \leq x) & \text{Unif}(0,1) \\ \hline -2 & 0.25 & 0.25 & [0.00, 0.25] \\ 3 & 0.1 & 0.35 & (0.25, 0.35] \\ 4.2 & 0.65 & 1.00 & (0.35, 1.00) \\ \end{array}

In this first column, we see the three discrete values that XX can take: {−2,3,4.2}\{-2, 3, 4.2\}. In the second column, we see the values for f(x)=P(X=x)f(x) = P(X = x) as defined by the pmf above. In the third column, we see the cdf, F(x)=P(X≤x)F(x) = P(X \leq x), which we obtain by accumulating the pmf.

We need to associate uniforms with xx-values using both the pmf and the cdf. We accomplish this task in the fourth column.

Consider x=−2x = -2. f(x)=0.25f(x) = 0.25 and P(X≤x)=0.25P(X \leq x) = 0.25. With this information, we can associate the uniforms on [0.00,0.25][0.00, 0.25] to x=−2x = -2. In other words, if we draw a uniform, and it falls on [0.00,0.25][0.00, 0.25] - which occurs with probability 0.25 - we select x=−2x = -2, which has a corresponding f(x)f(x) of 0.25.

Consider x=3x = 3. f(x)=0.10f(x) = 0.10 and P(X≤x)=0.35P(X \leq x) = 0.35. With this information, we can associate the uniforms on (0.25,0.35](0.25, 0.35] to x=3x = 3. In other words, if we draw a uniform, and it falls on (0.25,0.35](0.25, 0.35] - which occurs with probability 0.1 - we select x=3x = 3, which has a corresponding f(x)f(x) of 0.25.

Finally, consider x=4.2x = 4.2. f(x)=0.65f(x) = 0.65 and P(X≤x)=1P(X \leq x) = 1. With this information, we can associate the uniforms on (0.35,1.00)(0.35, 1.00) to x=4.2x = 4.2. In other words, if we draw a uniform, and it falls on (0.35,1.00)(0.35, 1.00) - which occurs with probability 0.65 - we select x=4.2x = 4.2, which has a corresponding f(x)f(x) of 0.65

For a concrete example, let's sample U∼Unif(0,1)U \sim \text{Unif}(0,1). Given a function, F(x)F(x) that maps xx-values to the associated set of uniforms, we can compute, XX, given UU, by taking the inverse: X=F−1(U)X = F^{-1}(U). For instance, suppose we draw U=0.46U = 0.46. Since F(4.2)=(0.35,1.00)F(4.2) = (0.35, 1.00), X=F−1(0.46)=4.2X = F^{-1}(0.46) = 4.2.

Inverse Transform Method

Let's now use the inverse transform method to generate a continuous random variable. Consider the following theorem: if XX is a continuous random variable with cdf, F(x)F(x), then the random variable F(X)∼Unif(0,1)F(X) \sim \text{Unif}(0, 1).

Notice that F(X)F(X) is not a cdf because XX is a random variable, not a particular value. F(X)F(X) itself is a random variable distributed as Unif(0,1)\text{Unif}(0,1).

Given this theorem, we can set F(X)=U∼Unif(0,1)F(X) = U \sim \text{Unif}(0, 1), and then solve backwards for XX using the inverse of FF: X=F−1(U)X = F^{-1}(U). If we can compute F−1(U)F^{-1}(U), we can generate a value for XX given a uniform.

For example, suppose we have a random variable X∼Exp(λ)X \sim \text{Exp}(\lambda). The cdf of XX is given by the function F(x)=1−e−λx,x>0F(x) = 1 - e^{-\lambda x}, x > 0. Correspondingly, F(X)=1−e−λXF(X) = 1 - e^{-\lambda X}