Skip to main content
\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)
Mathematics LibreTexts

1.7: 07 Expected values

  • Page ID
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)

    Construction of the expectation operator

    We wish to define the notion of the \textbf{expected value}, or \textbf{expectation}, of a random variable \(X\), which will be denoted \(\expec X\) (or \(\expec(X)\)). In measure theory this is denoted \(\int X dP\) and is called the ``Lebesgue integral''. It is one of the most important concepts in all of mathematical analysis! So time invested in understanding it is time well-spent.

    The idea is simple. For bounded random variables, we want the expectation to satisfy three properties: First, the expectation of an indicator variable \(1_A\), where \(A\) is an event, should be equal to \(\prob(A)\). Second, the expectation operator should be linear i.e., should satisfy \(\expec(aX+bY)=a\expec X+b\expec Y\) for real numbers \(a,b\) and r.v.'s \(X,Y\). Third, it should be monotone, i.e., if \(X\le Y\) (meaning \(X(\omega)\le Y(\omega)\) for all \(\omega\in \Omega\)) then \(\expec X\le \expec Y\).

    For unbounded random variables, we will also require some kind of continuity, but let's treat the case of bounded case first. It turns out that these properties determine the expectation/Lebesgue integral operator uniquely. Different textbooks may have some variation in how they construct it, but the existence and uniqueness are really the essential facts.

    Theorem: Expectation Properties

    Let \((\Omega, {\cal F}, \prob)\) be a probability space. Let \(B_\Omega\) denote the class of bounded random variables. There exists a unique operator \(\expec\) that takes a r.v.\ \(X\in B_\Omega\) and returns a number in \(\R\), and satisfies:

    1. If \(A\in {\cal F}\) then \(\expec(1_A) = \prob(A)\).
    2. If \(X,Y\in B_\Omega\), \(a,b\in\R\) then \(\expec(aX+bY) = a\expec(X)+b\expec(Y)\).
    3. If \(X,Y\in B_\Omega\) and \(X\ge Y\) then \(\expec(X)\ge \expec(Y)\).

    Call \(X\) a simple function if it is of the form \(X=\sum_{i=1}^n a_i 1_{B_i}\), where \(a_1,\ldots,a_n\in \R\) and \(B_1,\ldots,B_n\) are disjoint events. For such r.v.'s define \(\expec(X)=\sum a_i \prob(B_i)\). Show that the linearity and monotonicity properties hold, and so far uniqueness clearly holds since we had no choice in how to define \(\expec(X)\) for such functions if we wanted the properties above to hold. Now for a general bounded r.v.\ \(X\) with \(|X|\le M\), for any \(\epsilon>0\) it is possible to approximate \(X\) from below and above by simple functions \(Y \le X \le Z\) such that \(\expec(Z-Y) < \epsilon\). This suggests defining

    \begin{equation} \label{eq:expec-def}
    \expec(X) = \sup \{ \expec(Y) : Y\textrm{ is a simple function such that }Y \le X \}.

    By approximation, the construction is shown to still satisfy the properties in the Theorem and to be unique, since \(\expec(X)\) is squeezed between \(\expec(Y)\) and \(\expec(Z)\), and these can be made arbitrarily close to each other.

    We can now extend the definition of the expectation operator to non-negative random variables. In that case we still define \(\expec X\) by eq.~\eqref{eq:expec-def}. This can be thought of as a kind of ``continuity from below'' axiom that is added to the properties 1--3 above, although we shall see that it can be reformulated in several equivalent ways. Note that now \(\expec X\) may sometimes be infinite.

    Finally, for a general random variable \(X\), we decompose \(X\) as a difference of two non-negative r.v.'s by writing

    \[ X = X_+ - X_-, \]

    where \(X_+ = \max(X,0)\) is called the positive part of \(X\) and \(X_-=\max(-X,0)\) is called the negative part of \(X\).

    We say that \(X\) has an expectation if the two numbers \(\expec X_-, \expec X_+\) are not both \(\infty\). In this case we define

    \[ \expec X = \expec X_+ - \expec X_-. \]

    This is a number in \(\R\cup\{ -\infty,\infty\}\). If both \(\expec X_-, \expec X_+\) are \(<\infty\), or in other words if \(\expec |X|<\infty\) (since \(|X|=X_+ + X_-\)), we say that \(X\) has finite expectation or is integrable.


    Suppose \(X,Y\ge 0\) or \(X,Y\le 0\) or \(\expec |X|, \expec|Y|<\infty\). Then:

    1. If \(X\) is a simple function then the definition \eqref{eq:expec-def} coincides with the original definition, namely \(\expec(\sum_i a_i 1_{B_i}) = \sum_i a_i \prob(B_i)\).
    2. \(\expec(aX+bY+c)=a\expec X + b\expec Y+c\) for any real numbers \(a,b,c\), where in the case where \(\expec(X)=\expec(Y) = \pm \infty\), we require \(a,b\) to have the same sign in order for the right-hand side of this identity to be well-defined.
    3. If \(X\ge Y\) then \(\expec X\ge \expec Y\).
    See [Dur2010], section 1.4.

    See here for a nice description of the difference in approaches between the more familiar Riemann integral and the Lebesgue integral.


    1. Expectation is invariant under ``almost-sure equivalence'': If \(X\le Y\) almost surely, meaning \(\prob(X\le Y)=1\), then by the definition we have \(\expec X \le \expec Y\), since any simple function \(Z\) such that \(Z\le X\) can be replaced with another simple function \(Z'\) such that \(Z'\le Y\) and \(Z=Z'\) almost surely. It follows also that if \(X=Y\) almost surely then \(\expec X=\expec Y\).
    2. Triangle inequality: \( |\expec X| \le \expec |X|\).
    Proof: Triangle inequality

    \[|\expec X| = |\expec X_+ - \expec X_- | \le \expec X_+ + \expec X_- = \expec |X|. \]

    1. Markov's inequality (or Chebyshev's inequality) in [Dur2010]): \[ \prob(X \ge t) \le \frac{\expec X}{t}. \]
    Proof: Markov's inequality

    Use monotonicity twice to deduce:

    \[ \prob(X \ge t) = \expec(1_{\{X\ge t\}}) \le \expec\left[ \frac{1}{t} X 1_{\{X\ge t\}} \right] \le \frac{\expec X}{t}. \]

    1. Variance: If \(X\) has finite expectation, we define its variance to be

    \[ \var(X) = \expec(X-\expec X)^2. \]

    If \(\var(X)<\infty\), by expanding the square it is easy to rewrite the variance as

    \[ \var(X) = \expec(X^2)-(\expec X)^2. \]

    We denote \(\sigma(X) = \sqrt{\var(X)}\) and call this quantity the standard deviation of \(X\). Note that if \(a\in \R\) then \(\var(a X) = a^2 \var(X)\) and \(\sigma(a X)=|a| \sigma(X)\).

    1. Chebyshev's inequality: \[ \prob ( | X-\expec X| \ge t) \le \frac{\var(X)}{t^2}. \]
    Proof: Chebyshev's inequality
    Apply Markov's inequality to \(Y=(X-\expec X)^2\).
    1. Cauchy-Schwartz inequality: \[ \expec|XY| \le \left(\expec X^2 \expec Y^2\right)^{1/2}. \] Equality holds if and only if \(X\) and \(Y\) are linearly dependent, i.e. \(aX+bY\equiv 0\) holds for some \(a,b\in \R\).
    Proof: Cauchy-Schwartz inequality

    Consider the function

    \[ p(t) = \expec(|X| + t |Y|)^2 = t^2 \expec Y^2 + 2t \expec|XY| + \expec X^2.\]

    Since \(p(t)=at^2+bt+c\) is a quadratic polynomial in \(t\) that satisfies \(p(t)\ge 0\) for all \(t\), its discriminant \(b^2-4ac\) must be non-positive. This gives

    \[ (\expec|XY|)^2 - \expec X^2 \expec Y^2 \le 0, \]

    as claimed. The condition for equality is left as an exercise.
    1. Jensen's inequality: A function \(\varphi:\R\to\R\) is called convex if it satisfies \[\varphi(\alpha x + (1-\alpha)y) \le \alpha \varphi(x)+(1-\alpha)\varphi(y)\] for all \(x,y\in\R\) and \(\alpha\in [0,1]\). If \(\varphi\) is convex then \[ \varphi(\expec X) \le \expec (\varphi(X)). \]
    Proof: Jensen's inequality
    See homework.
    1. \(L_p\)-norm monotonicity: If \(0<r\le s\) then \[ (\expec|X|^r)^{1/r} \le (\expec |X|^s)^{1/s}.
    Proof: \(L_p\)-norm monotonicity
    Apply Jensen's inequality to the r.v.\ \(|X|^r\) with the convex function \(\varphi(x) = x^{s/r}\).

    Convergence theorems for expectations

    We want to study notions of continuity for the expectation operator. If \(X_n\to X\) as \(n\to\infty\), under what conditions do we have that \(\expec(X_n)\to \expec X\)? First we have to decide what ``\(X_n\to X\)'' actually means. We define two notions of convergence of a sequence of random variables to a limit.


    Let \(X, X_1, X_2, \ldots\) be random variables all defined on the same probability space. We say that \(X_n\) converges in probability to \(X\), and denote \( X_n \xrightarrow[n\to\infty]{\prob} X\), if for all \(\epsilon>0\) we have that

    \[ \prob(|X_n-X|>\epsilon) \xrightarrow[n\to\infty]{} 0. \]


    With \(X,X_1,X_2,\ldots\) as before, we say that \(X_n\) converges almost surely to \(X\) (or converges to \(X\) with probability 1}), and denote \( X_n \xrightarrow[n\to\infty]{\textrm{a.s.}} X\),

    \[ \prob( X_n \to X ) = \prob\left(\left\{ \omega \in \Omega : X(\omega)=\lim_{n\to\infty} X_n(\omega) \right\} \right) = 1. \]

    Show that \(\left\{ \omega \in \Omega : X(\omega)=\lim_{n\to\infty} X_n(\omega) \right\}\) is an event and therefore has a well-defined probability. In other words, represent it in terms of countable union, intersection and complementation operations on simple sets that are known to be events. Hint: Use the \(\epsilon-\delta\) definition of a limit.
    Almost sure convergence is a stronger notion of convergence than convergence in probability. In other words, if \(X_n\xrightarrow[n\to\infty]{\textrm{a.s.}} X\) then \(X_n\xrightarrow[n\to\infty]{\prob} X\), but the converse is not true.

    Prove Lemma~\ref{lem-convergence-types}. For the counterexample showing that convergence in probability does not imply almost sure convergence, consider the following sequence of random variables defined on the space \(((0,1),{\cal B}, \textrm{Lebesgue measure})\):

    1_{(0,1)}, \\
    1_{(0,1/2)}, 1_{(1/2,1)}, \\
    1_{(0,1/4)}, 1_{(1/4,2/4)}, 1_{(2/4,3/4)}, 1_{(3/4,1)}, \\
    1_{(0,1/8)}, 1_{(1/8,2/8)}, 1_{(2/8,3/8)}, 1_{(3/8,4/8)},
    1_{(4/8,5/8)}, 1_{(5/8,6/8)}, 1_{(6/8,7/8)}, 1_{(7/8,1)}, \\ \ldots

    If \((X_n)_{n=1}^\infty\) is a sequence of r.v.s such that \(X_n \xrightarrow[n\to\infty]{\prob} X\) then there exists a subsequence \((X_{n_k})_{k=1}^\infty\) such that \(X_n \xrightarrow[k\to\infty]{\textrm{a.s.}} X\).
    Prove lemma~\ref{lem:prob-as-subseq}.

    We can now formulate the fundamental convergence theorems for Lebesgue integration.

    Theorem: Bounded convergence theorem
    If \(X_n\) is a sequence of r.v.'s such that \(|X_n|\le M\) for all \(n\), and \(X_n\to X\) in probability, then \(\expec X_n \to \expec X\).

    Fix \(\epsilon>0\). Then

    |\expec X_n - \expec X| &\le& \expec |X_n - X| = \expec |X_n - X|1_{\{|X_n-X|>\epsilon\}}
    + \expec |X_n - X|1_{\{|X_n-X|\le \epsilon\}}
    \\ &\le& 2M \prob(|X_n-X|>\epsilon) + \epsilon\xrightarrow[n\to\infty]{} \epsilon.

    Since \(\epsilon\) was an arbitrary positive number, this implies that \(|\expec X_n-\expec X|\to 0\), as claimed.

    Theorem: Fatou's lemma
    If \(X_n \ge 0\) then \(\liminf_{n\to\infty} \expec X_n \ge \expec(\liminf_{n\to\infty} X_n).\)

    To see that the inequality in the lemma can fail to be an equality, let \(U \sim U(0,1)\), and define \(X_n = n 1_{\{U\le 1/n\}}\). Clearly \(\liminf_{n\to\infty}X_n=\lim_{n\to\infty} X_n \equiv 0\), but \(\expec(X_n) = 1\) for all \(n\).


    Let \(Y = \liminf_{n\to\infty} X_n\). Note that \(Y\) can be written as

    \[ Y = \sup_{n\ge 1} \inf_{m\ge n} X_m \]

    (this is a general fact about the lim inf of a sequence of real numbers), or \(Y = \sup_n Y_n\), where we denote

    \[ Y_n = \inf_{m\ge n} X_m. \]

    We have \(Y_n \le X_n\), and as \(n\to\infty\), \(Y_n \to Y\) (in fact \(Y_n \uparrow Y\)) almost surely. Therefore \(\expec Y_n \le \expec X_n\), so \(\liminf_{n\to\infty} \expec Y_n \le \liminf_{n\to\infty} \expec X_n\), and therefore it is enough to show that

    \[ \liminf_{n\to\infty} \expec Y_n \ge \expec Y. \]

    But for any \(M\) we have that

    \[ Y_n \wedge M \xrightarrow[n\to\infty]{\textrm{a.s.}} Y \wedge M, \]

    and this is a sequence of uniformly bounded r.v.'s, therefore by the bounded convergence theorem we get that

    \[ \expec(Y_n) \ge \expec(Y_n \wedge M) \xrightarrow[n\to\infty]{} \expec(Y\wedge M). \]

    We therefore get that \(\liminf_{n\to\infty} \expec(Y_n)\ge \expec(Y\wedge M)\) for any \(M>0\), which implies the result because of the following exercise.

    Let \(Y\ge 0\) be a random variable. Prove that

    \[ \expec(Y) = \sup_{M>0} \expec(Y\wedge M).\]

    Theorem: Monotone convergence theorem]
    If \(0\le X_n \uparrow X\) as \(n\to\infty\) then \(\expec X_n \uparrow \expec X\).
    \[ \expec X = \expec [\liminf_{n\to\infty} X_n] \le \liminf_{n\to\infty} \expec X_n \le \limsup_{n\to\infty}
    \expec X_n \le \limsup_{n\to\infty} \expec X = \expec X. \]
    Theorem: Dominated convergence theorem
    If \(X_n\to X\) almost surely, \(|X_n|\le Y\) for all \(n\ge 1\) and \(\expec Y<\infty\), then \(\expec X_n\to \expec X\).
    Apply Fatou's lemma separately to \(Y+X_n\) and to \(Y-X_n\).

    Computing expected values


    If \(X\) is a discrete r.v., that is, takes values in some countable set \(S\), then

    \[ \expec X = \sum_{s\in S} s\, \prob(X=s) \]

    when the right-hand side is well-defined, i.e., when at least one of the numbers

    \[\expec(X_-)=\sum_{s\in S, s<0} (-s)\,\prob(X=s), \qquad \expec(X_+)= \sum_{s\in S, s>0} s\, \prob(X=s) \]

    is finite. It follows that for any function \(g:\R\to\R\), we also have

    \[ \expec(g(X)) = \sum_{s\in S} g(s) \prob(X=s). \]

    If \(S\) is finite then \(X\) is a simple function, and can be written \(X=\sum_{s\in S} s\,1_{\{X=s\}}\), so this follows from the definition of \(\expec(\cdot)\) for simple functions. If \(S\) is infinite this follows (check!) from the convergence theorems in the previous section by considering approximations to \(X\) of the form \(\sum_{s\in S, |s|<M} s\, 1_{\{X=s\}}\).

    If \(X\) is a r.v.\ with a density function \(f_X\), then

    \[ \expec(X) = \int_{-\infty}^\infty x f_X(x)\,dx \]

    when the right-hand side is well-defined, i.e., when at least one of the numbers

    \[ \expec(X_-) = -\int_{-\infty}^0 x f_X(x)\,dx, \qquad \expec(X_+) = \int_0^\infty x f_X(x)\,dx \]

    is finite. Similarly, for any ``reasonable'' function \(g\) we have

    \[ \expec(g(X)) = \int_{-\infty}^\infty g(x) f_X(x)\, dx. \]


    Fix \(\epsilon>0\), and approximate \(X\) by a discrete r.v. \(Y\), e.g.,

    \[ Y = \sum_{k=-\infty}^\infty k\epsilon 1_{\{k\epsilon < X \le (k+1)\epsilon\}}.\]

    Then \(|\expec(X)-\expec(Y)| \le \expec|X-Y| \le \epsilon\). By the previous lemma we have

    \[ \expec(Y) = \sum_{k=-\infty}^\infty k\epsilon \prob(k\epsilon<X<(k+1)\epsilon) =\sum_{k=-\infty}^\infty k\epsilon \int_{k\epsilon}^{(k+1)\epsilon} f_X(x)\,dx, \]

    so the result for \(\expec X\) follows by letting \(\epsilon\to 0\). For general functions \(g(X)\) repeat this argument, and invoke the relevant convergence theorem to deduce that \(\expec(g(Y))\to \expec g(Y)\) as \(\epsilon\to 0\).

    Expectation and independent random variables

    1. If \(X,Y\) are independent r.v.'s then \(\expec(X Y) = \expec X \expec Y\).
    2. \(X,Y\) are independent if and only if \(\expec[g(X)h(Y)] = \expec g(X) \expec h(Y)\) for all bounded measurable functions \(g,h:\R\to\R\).
    1. Part (i) follows either by approximation of \(X,Y\) using simple functions, or using Fubini's theorem, which you can read about in section 1.7 of [Dur2010] (Note that Fubini's theorem in turn is proved by approximation using simple functions, so these two seemingly different approaches are really equivalent).
    2. For part (ii), the ``only if'' follows from part (i) together with the observation that if \(X,Y\) are independent then so are \(g(X), h(Y)\). For the ``if'' part, observe that the function \(1_{(a,b)}\) is a bounded measurable function, so in particular the condition \(\expec[g(X)h(Y)]=\expec g(X) \expec h(Y)\) includes the information that \(\prob(X\in I, Y\in J)=\prob(X \in I) \prob(Y\in J)\) for any two finite intervals \(I,J\), which we already know is enough to imply independence.
    1. If \(X_1,X_2,\ldots,X_n\) are independent r.v.'s then \(\expec(X_1 \ldots X_n) = \prod_{k=1}^n \expec X_k\).
    2. \(X_1,\ldots,X_n\) are independent if and only if \(\expec\left(\prod_{k=1}^n g_k(X_k)\right) = \prod_{k=1}^n \expec g_k(X_k)\) for all bounded measurable functions \(g_1,\ldots,g_n:\R\to\R\).

    The fact that expectation is multiplicative for independent random variables implies an important fact about the variance of a sum of independent r.v.'s. Let \(X,Y\) be independent r.v.'s with finite variance. Then we get immediately that

    \var(X+Y) &=& \expec\left[(X-\expec X)+(Y-\expec Y)\right]^2 \\ &=&
    \expec(X-\expec X)^2 + \expec(Y-\expec Y)^2 +2 \expec\left[ (X-\expec X)(Y-\expec Y)\right]
    \\ &=& \var(X)+\var(Y) + 0 = \var(X)+\var(Y).

    More generally, if \(X,Y\) are not necessarily independent, then we can define the \textrm{covariance of \(X\) and \(Y\) by

    \[ \cov(X,Y) = \expec\left[ (X-\expec X)(Y-\expec Y)\right]. \]

    We then get the more general formula

    \[ \var(X+Y) = \var(X) + \var(Y) + 2 \cov(X,Y). \]

    Repeating this computation with a sum of \(n\) variables instead of just two, we get the following formula for the variance of a sum of r.v.'s.


    If \(X_1,\ldots,X_n\) are r.v.'s with finite variance, then

    \[ \var\left(\sum_{k=1}^n X_k\right) = \sum_{k=1}^n \var(X_k) + 2 \sum_{1\le i<j\le n} \cov(X_i,X_j). \]

    Lemma: Properties of the covariance
    If \(X,Y\) are r.v.'s with finite variance, then:
    1. \(\cov(X,Y)=\cov(Y,X) = \expec(XY) - \expec(X) \expec(Y)\).
    2. \(\cov(X,X) = \var(X)\).
    3. \(\cov(aX_1+bX_2,Y) = a\cov(X_1,Y)+b\cov(X_2,Y)$
    4. \(\cov(X,aY_1+bY_2) = a\cov(X,Y_1)+b\cov(X,Y_2)$
    5. If \(X,Y\) are independent then \(\cov(X,Y)=0\).
    6. \(|\cov(X,Y)| \le \sigma(X) \sigma(Y)\), with equality if and only if \(X\) and \(Y\) are linearly dependent.
    Properties 1--5 are obvious. Property 6 follows by applying the Cauchy-Schwartz inequality to the r.v.'s \(X-\expec X\) and \(Y-\expec Y\).

    If \(\cov(X,Y)=0\) we say that \(X\) and \(Y\) are uncorrelated, or orthogonal. This is a weaker condition than being independent, but because of the way the variance of a sum of r.v.'s behaves, it is still often useful for deriving bounds, as we shall see.

    Define the correlation coefficient of \(X\) and \(Y\) by

    \[ \rho(X,Y) = \frac{\cov(X,Y)}{\sigma(X)\sigma(Y)}. \]

    This measures the correlation in units of the standard deviation of \(X\) and \(Y\) so does not depend on the choice of scale. From property 6 in the above lemma, we get that \(\)-1\le \rho(X,Y)\le 1,\[ with equality on either side if and only if \(X\) and \(Y\) are linearly dependent.


    For an integer \(k\ge 0\), the \(k\)-th moment} of a random variable \(X\) is the number \(\expec(X^k)\). The \textbf{$k\)-th moment around a point \(c\in\R\) is the number \(\expec(X-c)^k\). If \(c\) is not mentioned it is understood that the moment is around 0. Th {\(k\)-th central moment} is the \(k\)-th moment around \(\expec X\) (when it exists), i.e., \(\expec(X-\expec X)^k\). In this terminology, the variance is the second central moment.

    The sequence of moments (usually around 0, or around \(\expec X\)) often contains important information about the behavior of \(X\) and is an important computational and theoretical tool. Important special distributions often turn out to have interesting sequences of moments. Also note that by the monotonicity of the \(L_p\) norms (inequality \eqref{eq:lpnormmono} in Section~\ref{sec-expec-prop}), the set of values \(r\ge 0\) such that \(\expec(X^r)\) exists (one can also talk about \(r\)-th moments for non-integer \(r\), but that is much less commonly discussed) is an interval containing \(0\).

    A nice characterization of the variance is that it is the minimal second moment. To compute the second moment around a point \(t\), we can write

    \expec(X-t)^2 &=& \expec[ (X-\expec X)-(t-\expec X)]^2 \\ &=& \expec(X-\expec X)^2 + \expec(t-\expec X)^2 + 2(t-\expec X) \expec(X-\expec X)
    \\ &=& \var(X) + (t-\expec X)^2 \ge \var(X).

    So the function \(t\to\expec(X-t)^2\) is a quadratic polynomial that attains its minimum at \(t=\expec X\), and the value of the minimum is \(\var(X)\). In words, the identity

    \[ \expec(X-t)^2 = \var(X) + (t-\expec X)^2 \]

    says that ``the second moment around \(t\) is equal to the second moment around the mean \(\expec X\) plus the square of the distance between \(t\) and the mean''. Note that this is analogous to (and mathematically equivalent to) the Huygens-Steiner theorem (also called the Parallel Axis theorem) from mechanics, which says that ``the moment of inertia of a body with unit mass around a given axis \(L\) is equal to the moment of inertia around the line parallel to \(L\) passing through the center of mass of the body, plus the square of the distance between the two lines''. Indeed, the ``moment'' terminology seems to originate in this physical context.