91 minute read

Notice a tyop typo? Please submit an issue or open a PR.

In this lesson, we are going to take a quick review of calculus. If you are already familiar with basic calculus, there is nothing new here; regardless, it may be helpful to revisit these concepts.

Suppose that we have a function, $f(x)$, that maps values of $x$ from a domain, $X$, to a range, $Y$. We can represent this in shorthand as: $f(x) : X \to Y$.

For example, if $f(x) = x^2$, then $f(x)$ maps values from the set of real numbers, $\mathbb{R}$, to the nonnegative portion of that set, $\mathbb{R^+}$.

We say that $f(x)$ is **continuous** if, $f(x)$ exists for all $x \in X$ and, for any $x_0, x \in X$, $\lim_{x \to\ x_0} f(x) = f(x_0)$. Here, "lim" refers to limit.

For example, the function $f(x) = 3x^2$ is continuous for all $x$. However, consider the function $f(x) = \lfloor{x}\rfloor$, which rounds down $x$ to the nearest integer. This function is not continuous and has a jump discontinuity for any integer $x$. Here is a graph.

If $f(x)$ is continuous, then the **derivative** at $x$ - assuming that it exists and is well-defined for any $x$ - is:

$\frac{d}{dx} f(x) \equiv f'(x) \equiv \lim_{h \to\ 0} \frac{f(x + h) - f(x)}{h}$

Note that we also refer to the derivative at $x$ as the instantaneous slope at $x$. The expression $f(x + h) - f(x)$ represents the "rise", and $h$ represents the "run". As $h \to 0$, the slope between $(x, f(x))$ and $(x + h, f(x + h))$ approaches the instantaneous slope at $(x, f(x))$.

Let's revisit the derivative of some common expressions.

- The derivative of a constant, $a$, is $0$.
- The derivative of a polynomial term, $x^k$, is $kx^{k-1}$.
- The derivative of $e^x$ is $e^x$.
- The derivative of $\sin(x)$ is $\cos(x)$.
- The derivative of $\cos(x)$ is $-\sin(x)$.
- The derivative of the natural log of $x$, $\ln(x)$ is $\frac{1}{x}$.
- Finally, the derivative of $\arctan(x)$ is equal to $\frac{1}{1+x^2}$

Now let's look at some well-known properties of derivatives. The derivative of a function times a constant value is equal to the derivative of the function times the constant:

$\left[af(x)\right]' = af'(x)$

The derivative of the sum of two functions is equal to the sum of the derivatives of the functions:

$\left[f(x) + g(x)\right]' = f'(x) + g'(x)$

The derivative of the product of two functions follows this rule:

$\left[f(x)g(x)\right]' = f'(x)g(x) + g'(x)f(x)$

The derivative of the quotient of two functions follows this rule:

$\left[\frac{f(x)}{g(x)}\right]' = \frac{g(x)f'(x) - f(x)g'(x)}{g^2(x)}$

We can remember this quotient rule with the following pneumonic, referring to the numerator as "hi" and the denominator as "lo": "lo dee hi minus hi dee lo over lo lo".

Finally, the derivative of the composition of two functions follows this rule:

$\left[f(g(x))\right]' = f'(g(x))g'(x)$

Let's look at an example. Suppose that $f(x) = x^2$ and $g(x) = \ln(x)$. From our initial derivative rules, we know that $f'(x) = 2x$ and $g'(x) = \frac{1}{x}$.

Let's calculate the derivative of the product of $f(x)$ and $g(x)$:

$\left[f(x)g(x)\right]' = f'(x)g(x) + g'(x)f(x)$

$\left[f(x)g(x)\right]' = 2x\ln{x} + \frac{x^2}{x} = 2x\ln{x} + x$

Let's calculate the derivative of the quotient of $f(x)$ and $g(x)$:

$\left[\frac{f(x)}{g(x)}\right]' = \frac{g(x)f'(x) - f(x)g'(x)}{g^2(x)}$

$\left[\frac{f(x)}{g(x)}\right]' = \frac{2x\ln{x} - x}{\ln^2{x}}$

Let's calculate the derivative of the composition $f(g(x))$:

$\left[f(g(x))\right]' = f'(g(x))g'(x)$

$\left[f(g(x))\right]' = \frac{2\ln(x)}{x}$

The expression $f'(g(x))$ might look tricky at first. Remember that $f(x) = x^2$, so $f'(x) = 2x$. Thus, $f'(g(x)) = 2g(x) = 2\ln(x)$.

The second derivative of $f(x)$ equals the derivative of the first derivative of $f(x)$: $f^{\prime\prime}(x) = \frac{d}{dx}f'(x)$. We spoke of the first derivative as the instantaneous slope of $f(x)$ at $x$. We can think of the second derivative as the slope of the slope.

A classic example comes from physics. If $f(x)$ describes the position of an object, then $f'(x)$ describes the object's velocity and $f^{\prime\prime}(x)$ describes the object's acceleration.

So why do we care about second derivatives?

A minimum or maximum of $f(x)$ can only occur when the slope of $f(x)$ equals $0$; in other words, when $f'(x) = 0$. Mentally visualizing the peaks and valleys of graphs of certain functions may help in understanding why this is true.

Consider a point, $x_0$, such that $f'(x_0) = 0$. If $f^{\prime\prime}(x_0) < 0$, then $f(x_0)$ is a maximum. If $f^{\prime\prime}(x_0) > 0$, then $f(x_0)$ is a minimum. If $f^{\prime\prime}(x_0) = 0$, then $f(x_0)$ is an inflection point.

Consider the function $f(x) = e^{2x} + e^{-x}$. We want to find a point, $x_0$, that minimizes $f$. Let's first compute the derivative, using the composition rule for each term:

$f'(x) = \left[e^{2x} + e^{-x}\right]' = 2e^{2x} - e^{-x}$

Let's find $x_0$ such that $f'(x_0) = 0$.

$2e^{2x} - e^{-x} = 0$

$2e^{2x} = e^{-x}$

$\frac{e^{-x}}{e^{2x}} = 2$

$\ln(\frac{e^{-x}}{e^{2x}}) = \ln(2)$

$\ln(e^{-x}) - \ln({e^{2x}}) = \ln(2)$

$-x - 2x = \ln(2)$

$-3x = \ln(2)$

$x = \frac{\ln(2)}{-3} \approx -0.231$

Now, let's calculate $f^{\prime\prime}(x)$:

$f^{\prime\prime}(x) = \left[2e^{2x} - e^{-x}\right]' = 4e^{2x} + e^{-x}$

Let's plug in $x_0$:

$f^{\prime\prime}(-0.231) = 4e^{2 * -0.231} + e^{0.231}\approx 3.78$

Since this value is positive, $f(x_0)$ is a minimum. Furthermore, since $e^x > 0$ for all $x$, $f^{\prime\prime}(x) > 0$ for all $x$. This means that $f(x_0)$ is not only a *local* minimum, but specifically is the *global* minimum of $f(x)$.

In this lesson, we are going to look at formal ways to find solutions to nonlinear equations. We will use these techniques several times throughout the course, as solving equations is useful in a lot of different methodologies within simulation.

When we talk about solving a nonlinear equation, $f$, what we mean is finding a value, $x$, such that $f(x) = 0$.

There are a few methods by which we might find such an $x$:

- trial and error (not so good)
- bisection (divide and conquer)
- Newton's method or some variation
- Fixed-point method

Let's remind ourselves of an example from the previous lesson.

Consider the function $f(x) = e^{2x} + e^{-x}$. We want to find a point, $x_0$, that minimizes $f$. Let's first compute the derivative, using the composition rule for each term:

$f'(x) = \left[e^{2x} + e^{-x}\right]' = 2e^{2x} - e^{-x}$

Let's find $x_0$ such that $f'(x_0) = 0$.

$2e^{2x} - e^{-x} = 0$

$2e^{2x} = e^{-x}$

$\frac{e^{-x}}{e^{2x}} = 2$

$\ln(\frac{e^{-x}}{e^{2x}}) = \ln(2)$

$\ln(e^{-x}) - \ln({e^{2x}}) = \ln(2)$

$-x - 2x = \ln(2)$

$-3x = \ln(2)$

$x = \frac{\ln(2)}{-3} \approx -0.231$

Now, let's calculate $f^{\prime\prime}(x)$:

$f^{\prime\prime}(x) = \left[2e^{2x} - e^{-x}\right]' = 4e^{2x} + e^{-x}$

Let's plug in $x_0$:

$f^{\prime\prime}(-0.231) = 4e^{2 * -0.231} + e^{0.231}\approx 3.78$

Since this value is positive, $f(x_0)$ is a minimum. Furthermore, since $e^x > 0$ for all $x$, $f^{\prime\prime}(x) > 0$ for all $x$. This means that $f(x_0)$ is not only a *local* minimum, but specifically is the *global* minimum of $f(x)$.

Suppose we have a function, $g(x)$, and suppose that we can find two values, $x_1$ and $x_2$, such that $g(x_1) < 0$ and $g(x_2) > 0$. Given these conditions, we know, via the intermediate value theorem, that there must be a solution in between $x_1$ and $x_2$. In other words, there exists $x^* \in [x_1, x_2]$ such that $g(x^*) = 0$.

To find $x^*$, we first compute $x_3 = \frac{x_1 + x_2}{2}$. If $g(x_3) < 0$, then we know that $x^*$ exists on $[x_3, x_2]$. Otherwise, if $g(x_3) > 0$, then $x^*$ exists on $[x_1, x_3]$. We call this method **bisection** because we bisect the search interval - we cut it in half - on each round. We continue bisecting until the length of the search interval is as small as desired. See binary search.

Now we are going to use bisection to find the solution to $g(x) = x^2 - 2$. Of course, we know analytically that $g(\sqrt{2}) = 0$, so we are essentially using bisection here to approximate $\sqrt{2}$.

Let's pick our two starting points, $x_1 = 1$ and $x_2 = 2$. Since $f(x_1) = -1$ and $f(x_2) = 2$, we know, from the intermediate value theorem, that there must exist an $x^* \in [1, 2]$ such that $f(x^*) = 0$.

We consider a point, $x_3$, halfway between $x_1$ and $x_2$: $x_3 = \frac{1 + 2}{2} = 1.5$. Since $f(x_3) = 0.25$, we know that $x^*$ lies on the interval $[1, 1.5]$.

Similarly, we can consider a point, $x_4$, halfway between $x_1$ and $x_3$: $x_4 = \frac{1 + 1.5}{2} = 1.25$. Since $f(x_4) = -0.4375$, we know that $x^*$ lies on the interval $[1.25, 1.5]$.

Let's do this twice more. $x_5 = \frac{1.25 + 1.5}{2} = 1.375$. $f(x_5) = -0.109$, so $x^*$ lies on the interval $[1.375, 1.5]$. $x_6 = \frac{1.375 + 1.5}{2} = 1.4375$. $f(x_6) = 0.0664$, so $x^*$ lies on the interval $[1.375, 1.4375]$.

We can see that our search is converging to $\sqrt{2} \approx 1.414$.

Suppose that, for a function $g(x)$, we can find a reasonable first guess, $x_0$, for the solution of $g(x)$. If $g(x)$ has a derivative that isn't too flat near the solution, then we can iteratively refine our estimate of the solution using the following sequence:

$x_{i+1} = x_i - \frac{g(x_i)}{g'(x_i)}$

We continue iterating until the sequence appears to converge.

Let's try out Newton's method for $g(x) = x^2 - 2$. We can re-express the sequence above as follows:

$x_{i+1} = x_i - \frac{x_i^2 - 2}{2x_i}$

$x_{i+1} = x_i - (\frac{x_i}{2} - \frac{1}{x_i})$

$x_{i+1} = \frac{x_i}{2} + \frac{1}{x_i}$

Let's start with a bad guess, $x_0 = 1$. Then:

$x_1 = \frac{x_0}{2} + \frac{1}{x_0} = \frac{1}{2} + \frac{1}{1} = 1.5$

$x_2 = \frac{x_1}{2} + \frac{1}{x_1} = \frac{1.5}{2} + \frac{1}{1.5} \approx 1.4167$

$x_3 = \frac{x_2}{2} + \frac{1}{x_2} = \frac{1.4167}{2} + \frac{1}{1.4167} \approx 1.4142$

After just three iterations, we have approximated $\sqrt{2}$ to four decimal places!

What goes up, must come down. A few lessons ago, we looked at derivatives. In this lesson, we will focus on integration.

A function, $F(x)$, having derivative $f(x)$ is called the **antiderivative**. The antiderivative, also referred to as the **indefinite integral** of $f(x)$, is denoted $F(x) = \int{f(x)dx}$.

The fundamental theorem of calculus states that if $f(x)$ is continuous, then the area under the curve for $x \in [a, b]$ is given by the **definite integral**:

$\int^b_a f(x)dx \equiv F(x) \Big|^b_a \equiv F(b) - F(a)$

Let's look at some indefinite integrals:

$\int x^kdx = \frac{x^{k + 1}}{k+ 1} + C, k \neq -1$

$\int \frac{dx}{x} = \ln|x| + C$

$\int e^xdx = e^x + C$

$\int cos(x)dx = sin(x) + C$

$\int \frac{dx}{1 + x^2} = \arctan(x) + C$

Note that $C$ is a constant value. Consider a function $f(x)$. Since the derivative of a constant value is zero, $f'(x) = \left[f(x) + C\right]'$. When we integrate $f'(x)$, we need to re-include this constant expression: $\int f'(x) = f(x) + C$.

Let's look at some well-known properties of definite integrals.

The integral of a function from $a$ to $a$ is zero:

$\int_a^a f(x)dx = 0$

The integral of a function from $a$ to $b$ is the negative of the integral from $b$ to $a$:

$\int_a^b f(x)dx = -\int_b^a f(x)dx$

Given a third point, $c$, the integral of a function from $a$ to $b$ is the sum of the integrals from $a$ to $c$ and $c$ to $b$:

$\int_a^b f(x)dx = \int_a^c f(x)dx + \int_c^b f(x)dx$

Furthermore, the integral of a sum is the sum of the integrals:

$\int \left[f(x) + g(x) \right]dx = \int f(x)dx + \int g(x)dx$

Similar to the product rule for derivatives, we integrate products using integration by parts:

$\int f(x)g'(x)dx = f(x)g(x) - \int g(x)f'(x)dx$

Similar to the chain rule for derivatives, we integrate composed functions using the substitution rule, substituting $u$ for $g(x)$:

$\int f(g(x))g'(x)dx = \int f(u)du, \text{ where } u = g(x)$

Let's look at an example. Given $f(x) = x$ and $g'(x) = e^{2x}$, let's compute the integral of $f(x)g'(x)dx$ from $[0, 1]$.

We know, via integration by parts, that:

$\int_0^1 f(x)g'(x)dx = f(x)g(x)\Big|_0^1 - \int_0^1 g(x)f'(x)dx$

Notice that we need to take the integral of $g'(x)$. We can calculate this using u-substitution. Let $a(x) = 2x$ and $b(x) = e^x$. Then, using the substitution rule above:

$\int b(a(x))a'(x)dx = \int b(u)du, \text{ where } u = a(x)$

Note that $a'(x) = 2$, and $b(a(x)) = b(2x) = e^{2x} = g'(x)$. Thus,

$\int 2g'(x)dx = \int e^udu, \text{ where } u = a(x)$

Divide both sides by two:

$\int g'(x)dx = \frac{1}{2}\int e^udu, \text{ where } u = a(x)$

Integrate:

$g(x) + C = \frac{1}{2} e^u + C, \text{ where } u = a(x)$

Subtract $C$ from both sides and substitute:

$g(x) = \frac{1}{2} e^{2x}$

Now that we know $g(x)$, let's return to our integration by parts:

$\int_0^1 f(x)g'(x)dx = f(x)g(x)\Big|_0^1 - \int_0^1 g(x)f'(x)dx$

Let's substitute in the appropriate values for $f(x)$, $f'(x)$ and $g(x)$:

$\int_0^1 xe^{2x}dx = \frac{1}{2} xe^{2x}\Big|_0^1 - \int_0^1 \frac{1}{2} e^{2x}dx$

Let's pull out the $\frac{1}{2}$:

$\int_0^1 xe^{2x}dx = \frac{1}{2} \left(xe^{2x}\Big|_0^1 - \int_0^1 e^{2x}dx\right)$

Of course, we already know how to integrate $e^{2x}$:

$\int_0^1 xe^{2x}dx = \frac{1}{2} \left(xe^{2x}\Big|_0^1 - \frac{1}{2}e^{2x}\Big|_0^1\right)$

Now, let's solve:

$\int_0^1 xe^{2x}dx = \frac{1}{2} \left(\left(e^{2} - \frac{1}{2}e^{2}\right) - \left(0 - \frac{1}{2}e^0\right)\right)$

$\int_0^1 xe^{2x}dx = \frac{1}{2} \left(\frac{e^{2}}{2} + \frac{1}{2}\right)$

$\int_0^1 xe^{2x}dx = \frac{e^{2}}{4} + \frac{1}{4}$

Derivatives of arbitrary order $k$ can be written as $f^{(k)}(x)$ or $\frac{d^k}{dx^k}f(x)$. By convention, $f^{(0)}(x) = f(x)$.

The **Taylor series expansion** of $f(x)$ about a point $a$ is given by the following infinite sum:

$f(x) = \sum_{k = 0}^\infty \frac{f^{(k)}(a)(x - a)^k}{k!}$

The **Maclaurin series expansion** of $f(x)$ is simply the Taylor series about $a = 0$:

$f(x) = \sum_{k = 0}^\infty \frac{f^{(k)}(0) * x^k}{k!}$

Let's look at some familiar Maclaurin series:

$\sin(x) = \sum_{k = 0}^\infty \frac{-1^{k + 1} * x^{2k + 1}}{(2k + 1)!}$

$\cos(x) = \sum_{k = 0}^\infty \frac{-1^{k} * x^{2k}}{(2k)!}$

$e^x = \sum_{k = 0}^\infty \frac{x^{k}}{k!}$

Let's look at some other sums, unrelated to Taylor or Maclaurin series, that are helpful to know.

The sum of all the integers between 1 and $n$ is given by the following equation:

$\sum_{k = 1}^\infty k = \frac{n(n + 1)}{2}$

Similarly, if we want to add the sum of the squares of all the integers between 1 and $n$, we can use this equation:

$\sum_{k = 1}^\infty k^2 = \frac{n(n + 1)(2n + 1)}{6}$

Finally, if we want to sum all of the powers of $p$, and $-1 < p < 1$, we can use this equation:

$\sum_{k = 0}^\infty p^k = \frac{1}{1 - p}$

Occasionally, we run into trouble when we encounter indeterminate ratios of the form $0/0$ or $\infty/\infty$. L'Hôspital's Rule states that, when $\lim_{x \to a}f(x)$ and $\lim_{x \to a}g(x)$ both go to zero or both go to infinity, then:

$\lim_{x \to a}\frac{f(x)}{g(x)} = \lim_{x \to a}\frac{f'(x)}{g'(x)}$

For example, consider the following limit:

$\lim_{x \to 0}\frac{\sin(x)}{x}$

As $x \to 0$, $\sin(x) \to 0$. Thus, we can apply L'Hôspital's Rule:

$\lim_{x \to 0}\frac{\sin(x)}{x} = \lim_{x \to 0}\frac{\cos(x)}{1} = 1$

In this lesson, we will demonstrate several numerical techniques that we might need to use if we can't find a closed-form solution to a function we are integrating. One of these techniques incorporates simulation!

Suppose we have a continuous function, $f(x)$, under which we'd like to approximate the area from $a$ to $b$. We can fit $n$ adjacent rectangles between $a$ and $b$, such that each rectangle has a width $\Delta x = (b - a) / n$ and a height $f(x_i)$, where $x_i$ = $a + i\Delta x$ is the right-hand endpoint of the $i$th rectangle.

The sum of the areas of the rectangles approximates the area under $f(x)$ from $a$ to $b$, which is equal to the integral of $f(x)$ from $a$ to $b$:

$\int_a^b f(x)dx \approx \sum_{i = 1}^{n}\left[f(x_i)\Delta x)\right]$

We can simplify the right-hand side of the equation by pulling the $\Delta x$ term out in front of the sum and substituting in the appropriate values for $x_i$:

$\sum_{i = 1}^{n}\left[f(x_i)\Delta x)\right] = \frac{b - a}{n} \sum_{i = 1}^{n} f\left(a + \frac{i(b-a)}{n}\right) \approx \int_a^b f(x)dx$

As $n \to \infty$, this approximation becomes an equality.

Suppose we have a function, $f(x) = \sin((\pi x) / 2)$, which we would like to integrate from $0$ to $1$. In other words, we want to compute:

$\int_0^1\sin\left(\frac{\pi x}{2}\right)$

We can approximate the area under this curve using the following formula:

$\int_a^b f(x)dx \approx \frac{b - a}{n} \sum_{i = 1}^{n} f\left(a + \frac{i(b-a)}{n}\right)$

Let's plug in $a = 0$ and $b = 1$:

$\int_0^1 f(x)dx \approx \frac{1}{n} \sum_{i = 1}^{n} f\left(\frac{i}{n}\right)$

Finally, let's replace $f$:

$\int_0^1 f(x)dx \approx \frac{1}{n} \sum_{i = 1}^{n} \sin\left(\frac{\pi i}{2n}\right)$

For $n = 100$, this sum calculates out to approximately $0.6416$, which is pretty close to the true answer of $2/\pi \approx 0.6366$. For $n = 1000$, our estimate improves to approximately $0.6371$.

Here we are going to perform the same type of numerical integration, but we are going to use the trapezoid rule instead of the rectangle rule/Reimann sum. Under this rule:

$\int_a^b f(x)dx \approx \left[\frac{f(x_0)}{2} + \sum_{i = 1}^{n - 1} f(x_i) + \frac{f(x_n)}{2} \right]\Delta x$

Substituting $a$ and $b$ simplifies the right-hand side of the formula:

$\int_a^b f(x)dx \approx \frac{b - a}{n} \left[\frac{f(a)}{2} + \sum_{i = 1}^{n - 1} f\left(a + \frac{i(b-a)}{n}\right) + \frac{f(b)}{2}\right]$

Suppose we have a function, $f(x) = \sin((\pi x) / 2)$, which we would like to integrate from $0$ to $1$. In other words, we want to compute:

$\int_0^1\sin\left(\frac{\pi x}{2}\right)$

We can approximate the area under this curve using the following formula:

$\int_a^b f(x)dx \approx \frac{b - a}{n} \left[\frac{f(a)}{2} + \sum_{i = 1}^{n - 1} f\left(a + \frac{i(b-a)}{n}\right) + \frac{f(b)}{2}\right]$

Let's plug in $a = 0$ and $b = 1$:

$\int_0^1 f(x)dx \approx \frac{1}{n} \left[\frac{f(0)}{2} + \sum_{i = 1}^{n - 1} f\left(\frac{i}{n}\right) + \frac{f(1)}{2}\right]$

Let's replace $f$:

$\int_0^1 f(x)dx \approx \frac{1}{n} \left[\frac{\sin(0)}{2} + \sum_{i = 1}^{n - 1} \sin\left(\frac{\pi i}{2n}\right) + \frac{\sin(\pi / 2)}{2}\right]$

Finally, let's evaluate and simplify:

$\int_0^1 f(x)dx \approx \frac{1}{n} \left[\sum_{i = 1}^{n - 1} \sin\left(\frac{\pi i}{2n}\right) + \frac{1}{2}\right]$

For $n = 100$, this sum calculates out to approximately $0.63661$, which is very close to the true answer of $2/\pi \approx 0.63662$. Note that, even when $n = 1000$, the Reimann estimation was not this precise; indeed, integration via the trapezoid rule often converges faster than the Reimann approach.

Suppose that we can generate an independent and identically distributed sequence of numbers, $U_1, U_2, ..., U_n$, sampled randomly from a uniform $(0, 1)$ distribution. If so, it can be shown that we can approximate the integral of $f(x)$ from $a$ to $b$ according to the following formula:

$\int_a^b f(x)dx \approx \frac{b - a}{n} \sum_{i = 1}^n f(a + U_i(b - a))$

Note that this looks a lot like the Reimann integral summation. The difference is that these rectangles are not adjacent, but rather scattered randomly between $a$ and $b$. As $n \to \infty$, the approximation converges to an equality, and it does so about as quickly as the Reimann approach.

Suppose we have a function, $f(x) = \sin((\pi x) / 2)$, which we would like to integrate from $0$ to $1$. In other words, we want to compute:

$\int_0^1\sin\left(\frac{\pi x}{2}\right)$

We can approximate the area under this curve using the following formula:

$\int_a^b f(x)dx \approx \frac{b - a}{n} \sum_{i = 1}^n f(a + U_i(b - a))$

Let's plug in $a = 0$ and $b = 1$:

$\int_0^1 f(x)dx \approx \frac{1}{n} \sum_{i = 1}^n f(U_i)$

Let's replace $f$:

$\int_0^1 f(x)dx \approx \frac{1}{n} \sum_{i = 1}^n \sin\left(\frac{\pi U_i}{2}\right)$

Here is some python code for how we might simulate this:

```
# Tested with Python 3.8.3
from random import random
from math import pi, sin
def simulate(n):
result = 0
for _ in range(n):
result += sin(pi * random() / 2)
return result / n
trials_100 = sum([simulate(100) for _ in range(100)]) / 100
print(trials_100)
trials_1000 = sum([simulate(100) for _ in range(1000)]) / 1000
print(trials_1000)
```

After running this script once on my laptop, `trials_100`

equals approximately $0.6355$, and `trials_1000`

equals approximately $0.6366$.

In this lesson, we will start our review of probability.

Hopefully, we already know the very basics, such as sample spaces, events, and the definition of probability. For example, if someone tells us that some event has a probability greater than one or less than zero, we should immediately know that what they are saying is false.

The probability of some event, $A$, given some other event, $B$, equals the probability of the intersection of $A$ and $B$, divided by the probability of $B$. In other words, the **conditional probability** of $A$ given $B$ is:

$P(A|B) = \frac{P(A \cap B)}{P(B)}$

Note that we assume that $P(B) > 0$ so we can avoid any division-by-zero errors.

A non-mathematical way to think about conditional probability is the probability that $A$ will occur given some updated information $B$. For example, think about the probability that your best friend is asleep at any point in time. Now, consider that same probability, given that it's Tuesday at 3 am.

Let's toss a fair die. Let $A = \{1,2,3\}$ and $B = \{3,4,5,6\}$. What is the probability that the dice roll is in $A$ given that we know it is in $B$?

We can think about this problem intuitively first. There are four values in $B$, each of which is equally likely to occur. One of those values, three, is also in $A$. If we know that the roll is one of the values in $B$, then there is a one in four chance that the roll is three. Thus, the probability is $1/4$.

We can also use the conditional probability equation to calculate $P(A|B)$:

$P(A | B) = \frac{P(A \cap B)}{P(B)}$

Let's calculate $P(A \cap B)$. There are six possible rolls total, and $A$ and $B$ share one of them. Therefore, $P(A \cap B) = 1/6$. Now, let's calculate $P(B)$. There are six possible rolls total, and $B$ contains four of them, so $P(B) = 4/6$. As a result:

$P(A | B) = \frac{P(A \cap B)}{P(B)} = \frac{1/6}{4/6} = \frac{1}{4}$

Note that $P(A | B) \neq P(A)$. $P(A) = 1/2$. Prior information changes probabilities.

If $P(A \cap B) = P(A)P(B)$, then $A$ and $B$ are **independent events**.

For instance, consider the temperature on Mars and the stock price of IBM. Those two events are independent; in other words, the temperature on Mars has no impact on IBM stock, and vice versa.

Let's consider a theorem: if $A$ and $B$ are independent, then $P(A|B) = P(A)$. This means that if $A$ and $B$ are independent, then prior information about $B$ in no way influences that probability of $A$ occurring.

For example, consider two consecutive coin flips. Knowing that you just flipped heads has no impact on the probability that you will flip heads again: successive coin flips are independent events. However, knowing that it rained yesterday almost certainly impacts the probability that it will rain today. Today's weather is often very much dependent on yesterday's weather.

Toss two dice. Let $A = \text{Sum is 7}$ and $B = \text{First die is 4}$. Since there are six ways to roll a seven with two dice among thirty-six possible outcomes, $P(A) = 1/6$. Similarly, since there is one way to roll a four among six possible rolls, $P(B) = 1/6$.

Out of all thirty-six possible dice rolls, only one meets both criteria: rolling a four followed by a three. As a result:

$P(A \cap B) = P((4, 3)) = \frac{1}{36} = P(A)P(B)$

Because of this equality, we can conclude that $A$ and $B$ are independent events.

A **random variable**, $X$, is a function that maps the sample space, $\Omega$, to the real line: $X: \Omega \to \mathbb{R}$.

For example, let $X$ be the sum of two dice rolls. What is the sample space? Well, it's the set of all possible combinations of two dice rolls: $\{ (1,1), (1,2), ..., (6,5), (6,6)\}$. What is the output of $X$? It's a real number: the sum of the two rolls. Thus, the function $X$ maps an element from the sample space to a real number. As a concrete example, $X((4,6)) = 10$.

Additionally, we can enumerate the probabilities that our random value takes any specific value within the sample space. We refer to this as $P(X = x)$, where $X$ is the random variable, and $x$ is the observation we are interested in.

Consider our $X$ above. What is the probability that the sum of two dice rolls takes on any of the possible values?

$P(X = x) = \left\{
\begin{array}{ll}
1/36 \quad \text{ if } x = 2 \\
2/36 \quad \text{ if } x = 3 \\
\vdots \\
6/36 \quad \text{ if } x = 7 \\
\vdots \\
1/36 \quad \text { if } x = 12 \\
0 \quad\quad \text { otherwise }
\end{array}
\right.$

If the number of possible values of a random variable, $X$, is finite or countably infinite, then $X$ is a **discrete random variable**. The **probability mass function** (pmf) of a discrete random variable is given by a function, $f(x) = P(X = x)$. Note that, necessarily, $\sum_xf(x) = 1$.

By countably infinite, we mean that there could be an infinite number of possible values for $x$, but they have a one-to-one correspondence with the integers.

For example, flip two coins. Let $X$ be the number of heads. We can define the pmf, $f(x)$ as:

$f(x) = \left\{
\begin{array}{ll}
1/4 \quad \text{ if } x = 0\\
1/2 \quad \text{ if } x = 1 \\
1/4 \quad \text{ if } x = 2 \\
0 \quad\quad \text{ otherwise }
\end{array}
\right.$

Out of the four possible pairs of coin flips, one includes no heads, two includes one head, and one includes two heads. All other values are impossible, so we assign them all a probability of zero. As expected, the sum of all $f(x)$ for all $x$ equals one.

Some other well-known discrete random variables include Bernoulli($p$), Binomial($n$, $p$), Geometric($p$) and Poisson($\lambda$). We will talk about each of these types of random variables as we need them.

We are also interested in continuous random variables. A **continuous random variable** is one that has probability zero for every individual point, and for which there exists a **probability density function** (pdf), $f(x)$, such that $P(X \in A) = \int_A f(x)dx$ for every set $A$. Note that, necessarily, $\int_{\mathbb{R}} f(x) = 1$.

To reiterate, the pdf does not provide probabilities directly like the pmf; instead, we integrate the pdf over the set of events $A$ to determine $P(X \in A)$.

For example, pick a random real number between three and seven. There is an uncountably infinite number of real numbers in this range, and the probability that I pick any particular value is zero. The pdf, $f(x)$, for this continuous random variable is

$f(x) = \left\{
\begin{array}{ll}
1/4 \quad \text{ if } 3 \leq x \leq 7\\
0 \quad\quad \text{ otherwise }
\end{array}
\right.$

Even though $f(x)$ does not give us probabilities directly, it's the pdf, which means we can integrate it to calculate probabilities.

For instance, what is $P(X \leq 5)$? To calculate this, we need to integrate $f(x)$ from $-\infty$ to $5$. The integral of $f(x)$ from $-\infty$ to $3$ is zero, since the integral of 0 is 0. The integral of $f(x)$ from $3$ to $5$ is $5/4 - 3/4 = 2/4 = 1/2$. Thus, $P(X \leq 5) = 1/2$, which makes sense as we are splitting our range of numbers in half.

Notice that our function describes a rectangle of width $4$ and height $1/4$. If we take the area under the curve of this function - if we integrate it - we get 1.

Some other well-known continuous random variables include Uniform($a$, $b$), Exponential($\lambda$) and Normal($\mu$, $\sigma^2$). We will talk about each of these types of random variables as we need them.

Just a note on notation. The symbol $\sim$ means "is distributed as". For instance, $X \sim{\text{Unif}(0,1)}$ means that a random variable, $X$ is distributed according to the uniform $(0, 1)$ probability distribution.

For a random variable, $X$, either discrete or continuous, the **cumulative distribution function** (cdf), $F(x)$ is the probability that $X \leq x$. In other words,

$F(x) \equiv P(X \leq x) = \left\{
\begin{array}{ll}
\sum_{y \leq x}f(y) \quad \text{ if } X \text{ is discrete } \\ \\
\int_{-\infty}^xf(y)dy \quad \text{ if } X \text{ is continuous }
\end{array}
\right.$

For discrete random variables, $F(x)$ is equal to the sum of the discrete probabilities for all $y \leq x$. For continuous random variables, $F(x)$ is equal to the integral of the pdf from $-\infty$ to $x$.

Note that as $x \to -\infty$, $F(x) \to 0$ and as $x \to \infty$, $F(x) \to 1$. In other words, $P(x \leq -\infty) = 0$ and $P(x \leq \infty) = 1$. Additionally, if $X$ is continuous, then $F'(x) = f(x)$.

Let's look at a discrete example. Flip two coins and let $X$ be the number of heads. $X$ has the following cdf:

$F(x) = \left\{
\begin{array}{ll}
0 \quad\quad \text{ if } X < 0 \\
1/4 \quad \text{ if } 0 \leq X < 1 \\
3/4 \quad \text{ if } 1 \leq X < 2 \\
1 \quad\quad \text{ if } X \geq 2 \\
\end{array}
\right.$

For any $x < 0$, $P(X \leq x) = 0$. We can't observe fewer than zero heads. For any $0 \leq x < 1$, $P(X \leq x) = 1/4$. $P(X \leq x)$ covers $P(X = 0)$, which is $1/4$. For any $1 \leq x < 2$. $P(X \leq x)$ covers $P(X = 1)$, which is $1/2$, which we add to the previous $1/4$ to get $3/4$. Finally, for $x \geq 2$, $F(x) = 1$, since we have covered all possible outcomes.

Let's consider a continuous example. Suppose we have $X \sim \text{Exp}(\lambda)$. By definition, $f(x) = \lambda e^{-\lambda x}, x \geq 0$. If we integrate $f(x)$ from $0$ to $x$, we get the cdf $F(x) = 1 - e^{\lambda x}$.

In this lesson, we are going to look at simulating some simple random variables using a computer.

For the simplest example, let's consider a discrete uniform distribution, $DU$, from $1$ to $n$: $DU = \{1, 2, ..., n\}$. Let $X = i$ with probability $1/n$ for $i \in DU$. This example might look complicated, but we can think of it basically as an $n$-sided die toss.

If $U$ is a uniform $(0, 1)$ random variable - that is, $U \sim \text{Unif}(0, 1)$ - we can obtain $X \sim DU(1, n)$ through the following transformation: $X = \lceil{nU}\rceil$, where $\lceil\cdot\rceil$ is the "ceiling", or "round up" function.

For example, suppose $n = 10$ and $U \sim \text{Unif}(0, 1)$. If $U = 0.73$, then $X = \lceil{10(0.73)}\rceil = \lceil{7.3}\rceil = 8$.

Let's look at another discrete random variable. Consider the following pmf, $f(x)$ for $X$:

$f(x) \equiv P(X = x) = \left\{
\begin{array}{ll}
0.25 \quad \text{ if } x -2\\
0.10 \quad \text{ if } x = 3 \\
0.65 \quad \text{ if } x = 4.2 \\
0 \quad\quad \text{ otherwise }
\end{array}
\right.$

We can't use a die toss to simulate this random variable. Instead, we have to use the **inverse transform method**.

Consider the following table:

$\begin{array}{c|c|c|c}
x & f(x) & P(X \leq x) & \text{Unif}(0,1) \\ \hline
-2 & 0.25 & 0.25 & [0.00, 0.25] \\
3 & 0.1 & 0.35 & (0.25, 0.35] \\
4.2 & 0.65 & 1.00 & (0.35, 1.00) \\
\end{array}$

In this first column, we see the three discrete values that $X$ can take: $\{-2, 3, 4.2\}$. In the second column, we see the values for $f(x) = P(X = x)$ as defined by the pmf above. In the third column, we see the cdf, $F(x) = P(X \leq x)$, which we obtain by accumulating the pmf.

We need to associate uniforms with $x$-values using both the pmf and the cdf. We accomplish this task in the fourth column.

Consider $x = -2$. $f(x) = 0.25$ and $P(X \leq x) = 0.25$. With this information, we can associate the uniforms on $[0.00, 0.25]$ to $x = -2$. In other words, if we draw a uniform, and it falls on $[0.00, 0.25]$ - which occurs with probability 0.25 - we select $x = -2$, which has a corresponding $f(x)$ of 0.25.

Consider $x = 3$. $f(x) = 0.10$ and $P(X \leq x) = 0.35$. With this information, we can associate the uniforms on $(0.25, 0.35]$ to $x = 3$. In other words, if we draw a uniform, and it falls on $(0.25, 0.35]$ - which occurs with probability 0.1 - we select $x = 3$, which has a corresponding $f(x)$ of 0.25.

Finally, consider $x = 4.2$. $f(x) = 0.65$ and $P(X \leq x) = 1$. With this information, we can associate the uniforms on $(0.35, 1.00)$ to $x = 4.2$. In other words, if we draw a uniform, and it falls on $(0.35, 1.00)$ - which occurs with probability 0.65 - we select $x = 4.2$, which has a corresponding $f(x)$ of 0.65

For a concrete example, let's sample $U \sim \text{Unif}(0,1)$. Given a function, $F(x)$ that maps $x$-values to the associated set of uniforms, we can compute, $X$, given $U$, by taking the inverse: $X = F^{-1}(U)$. For instance, suppose we draw $U = 0.46$. Since $F(4.2) = (0.35, 1.00)$, $X = F^{-1}(0.46) = 4.2$.

Let's now use the inverse transform method to generate a continuous random variable. Consider the following theorem: if $X$ is a continuous random variable with cdf, $F(x)$, then the random variable $F(X) \sim \text{Unif}(0, 1)$.

Notice that $F(X)$ is not a cdf because $X$ is a random variable, not a particular value. $F(X)$ itself is a random variable distributed as $\text{Unif}(0,1)$.

Given this theorem, we can set $F(X) = U \sim \text{Unif}(0, 1)$, and then solve backwards for $X$ using the inverse of $F$: $X = F^{-1}(U)$. If we can compute $F^{-1}(U)$, we can generate a value for