2.3 Conditional Expectation

2.3.1 Conditional Expectation

Definition 2.9 Let \(X\) be a integrable random variable on \((\Omega, \cal F,P)\).

  1. Let \(\cal A\) be a sub \(\sigma\)-field of \(\cal F\). The condition expectation of \(X\) given \(\cal A\) (which we denote it as \(E(X|\cal A)\)) is the a.s.-unique random variable satisfying \(E(X|\cal A)\) is measurable from \((\Omega,\cal A)\) to \((R,\cal B)\) and \(\int_A E(X|\cal A) \rm dP=\) \(\int_{A} X dP\) for all \(A \in \cal A\).

  2. Let \(B \in \cal F\). The conditional probability of \(B\) given \(\cal A\) is defined to be \(P(B|\cal{A})=\) \(E(I_B|\cal A)\).

  3. Let \(Y\) be measurable from \((\Omega, \cal F,P)\) to \((\Lambda, \cal G)\). The conditional expectation of \(X\) given \(Y\) is defined to be \(E(X|Y)=E(X|\sigma(Y)).\)

Define \(\mu^{+}=\int_A X^{+} dP\) for \(A \in \cal A\), then such \(\mu^{+}\) is a measure on \(\cal A\). Let \(P_0\) be the restriction of \(P\) on \(\cal A\). Then clearly we have \(\mu^{+}\ll P_0\). It is easy to check that \(E(X^{+}|\cal A)= \frac{d\mu^{+}}{dP_0}\) and \(E(X^{-}|\cal A)= \frac{d\mu^{-}}{dP_0}\) (by similarly defining \(\mu^{-}\)) will satisfy the definition of conditional expectation. The uniqueness and existence follows by RN theorem.

Example 2.8 Suppose \(\Omega={1,2,3,4}\) and \(P(\{k\})=\frac{1}{4}\) for \(k\in\Omega\). Suppose that \(X(k)=k\) for \(k\in\Omega\). Let \(Y(1)=4,\ Y(2)=5,\ Y(3)=Y(4)=6\). Find \(E(X|\sigma(Y))=h(Y)\). Directly by calculating that \(\int_A E(X|\cal A) \rm dP=\) \(\int_{A} X dP\), we can derive that \(h(4)=1,\ h(5)=2,\ h(6)=\frac{7}{2}\).

If we consider trivial \(\sigma\)-algebra \(\cal A=\{\phi,\ \Omega\}\), then by measurability we know that \(E(X|\cal A)\) must be a constant function. By the integral restriction, clearly \(E(X|\cal A)=\rm E(X)\). Furthermore, suppose \(X\) is measurable w.r.t \(\cal A_0\) which is a sub \(\sigma\)-field of \(\cal A\). Then \(E(X|\cal A)=\rm X\).

Note that \(E(X|\sigma(Y))=E(X|Y)=h(Y)\) for some \(h\) by lemma 2.1. Thus we may write \(E(X|Y=y)=h(y)\). Below we give some proposition regarding to conditoinal expectation.

Proposition 2.6 Given the same setup above, we have

  1. If \(X=c\) a.s. for \(c \in \mathbb{R}\) then \(E(X|\cal A)=c\) a.s. (measurability is trivial for a constant function).
  2. If \(X\leq Y\) a.s., then \(E(X|\cal A)\leq \rm E(Y|\cal A)\) a.s. (which can be quickly proved by linearity)
  3. For \(E|X|,\ E|Y|< \infty\), \(E(aX+bY|\cal A)=a\rm E(X|\cal A)+b\rm E(Y|\cal A)\).
  4. \(E(E(X|\cal A))= \int_{\Omega} \rm E(X|\cal A) \rm dP=\int_{\Omega} X dP=E(X)\).
  5. Let \(\cal A_0 \subset \cal A\) be sub \(\sigma\)-fields of some \(\cal F\). Then \(E(E(X|\cal A)|\cal A_0))=\rm E(\rm E(X|\cal A_0)|\cal A))=\rm E(X|\cal A_0)\).
  6. If \(\sigma(Y)\subset \cal A\) and \(E(|XY|)< \infty\), then \(E(XY|\cal A)=\rm YE(X|\cal A)\).
  7. Suppose \(X\) and \(Y\) are independent, \(g\) is Borel function and \(E|g(X,Y)|<\infty\). Let \(h(y)=E(g(X,y))\) for all \(y \in Y\). Then \(E(g(X,Y)|Y)=h(Y)\), or equivalently, \(E(g(X,Y)|Y=y)=h(y)\).
  8. If \(E(X^2)< \infty\), then \([E(X|\cal A)]^2\leq \rm E(X^2|\cal A)\) a.s.

Remark. We briefly discuss how to show some of the properties as following:

  1. For 6. we start from considering \(Y\) is a simple function and use LDCT on general measurable \(Y\).
  2. For 7. let \(g(X,Y)=I_A(X)I_B(Y)\), then \(\int_C h(y)dP_Y(y)=P(X\in A)P(Y\in B\cap C)\). On the other hand, \(\int_C I_A(X)I_B(Y) dP_Y(y)=P(X\in A, Y\in B\cap C)\), so the result follows by independence.
  3. For 8., we directly show that by \(0 \leq E[X-E(X|\cal A)^2|\cal A]=\rm E(X^2|\cal A)- \rm (E(X|\cal A))^2\). The equality follows by linearity and 6.

Example 2.9 This example shows that \(E(X|\cal A)\) is the best guess of \(X\) given some knowledge of \(\cal A\), which means \[\int (X-E(X|\cal A))^2 \rm dP\leq \int (X-Y)^2 \rm dP\] for any \(Y\) measurable w.r.t. \(\cal A\). Let \(Z=Y-E(X|\cal A)\) measurable w.r.t. \(\cal A\). It follows by that \(\int Z(X-E(X|\cal A)) \rm dP=E(E(Z(X-E(X|\cal A))|\cal A))=0\).

2.3.2 Independence

First we extend the definition of independence to \(\sigma\)-algebra.

Let \((\Omega,\cal F,P)\) be a probability space.

  1. Let \(\cal C\) be a collection of subsets in \(\cal F\). Events in \(\cal C\) is said to be independent if for any \(n \in \mathbb{N}\) and distinct events \(A_1,\cdots,A_n\) in \(\cal C\), we have \(P(A_1\cap\cdots A_n)=\prod_{i=1}^n P(A_i)\).

  2. Collections \(\cal C_i\subset \cal F,\, i\in I\) are said to be independent if events in any collection of the form \(\{A_i\in\cal C_i:i\in I\}\) are independent.

  3. Random elements \(X_i\) are independent if \(\sigma(X_i)\) are independent.

Suppose that \(X\) is a random variable on \((\Omega,\cal F,P)\) with finite moment and \(\cal A_1\) and \(\cal A_2\) are sub \(\sigma\)-fields of \(\cal F\). If \(\sigma(\sigma(X)\cup\cal A_1)\) and \(\cal A_2\) are independent, then \[E(X|\sigma(\cal A_1\cup \cal A_2))=\rm E(X|\cal A_1)\quad a.s.\] In fact, it is sufficient to show that \[\int_{\cal A_1\cap \cal A_1}E(X|\cal A_1)\rm dP= \int_{\cal A_1\cap \cal A_1}X dP.\] for any \(A_1\in \cal A_1\) and \(A_2\in \cal A_2\) since \(\mathbb{C}=\rm \{A_1\cap A_2| A_1\in \cal A_1,A_2\in \cal A_2\}\) is a \(\pi\)-system and \(\sigma(\mathbb{C})=\sigma(\cal A_1\cup \cal A_2)\). This result can be further established by the fact that \(E(E(X|A_1) I_{A_2})=\rm E(X|A_1)P(A_2)\) given the assumption.

As a special case, \(E(X|Y_1,Y_2)=E(X|Y_1)\) if \((X,Y_1)\) and \(Y_2\) are independent by the result of the exercise below (by replacing \(\cal A_1\), \(\cal A_2\) with \(\sigma(Y_1)\) and \(\sigma(Y_2)\)). It still holds if replacing the random variable \(X\) with \(h(X)\) for any Borel function \(h\). In particular, with taking \(h\) as indicator function, we have \[P(A|Y_1,Y_2)=P(A|Y_1)\] for any \(A\in \sigma(X)\) if \((X,Y_1)\) and \(Y_2\) are independent. In such case, we say \(X\) and \(Y_2\) are conditionally independent given non-constant \(Y_1\).

Also, if \(E|X|<\infty\), \(\sigma(X)\) and \(\sigma(Y)\) are independent, then \(E(X|Y)=E(X)\) a.s.

Exercise 2.1 Let \(Z=(Y_1,Y_2)\), \(\sigma(Z)\stackrel{?}=\sigma(\sigma(Y_1)\cup\sigma(Y_2))\).

First we show that \[\sigma(\sigma(Y_1)\cup\sigma(Y_2))=\sigma(\{Y_1^{-1}(B_1)\cap\sigma(Y_2^{-1}(B_2)):B_1\in \cal B^n, \rm B_2\in \cal B^m\}).\] Then the result follows if \[\sigma(Z)=\sigma(\{Y_1^{-1}(B_1)\cap\sigma(Y_2^{-1}(B_2)):B_1\in \cal B^n, \rm B_2\in \cal B^m\}).\]

For the first equality, notice that \(\subseteq\)-direction is clear since both \(\sigma(Y_1)\) and \(\sigma(Y_2)\) lie in the \(\sigma\)-field on the right hand side. For another direction, note that for any \(B_1\) and \(B_2\), \(Y_i^{-1}(B_i) \in \sigma(Y_1)\cup \sigma(Y_2),\, i=1,2\), thus the intersection must lie in the \(\sigma\)-field on the left hand side.

For second equality, the \(\supseteq\)-direction is obvious since the set \(\mathbb{D}:=\{Y_1^{-1}(B_1)\cap\sigma(Y_2^{-1}(B_2)):B_1\in \cal B^n, \rm B_2\in \cal B^m\}\) is just \(\{Z^{-1}(B_1\times B_2):B_1\in \cal B^n, \rm B_2\in \cal B^m\}\). For another direction, note that for \(\cal B^{n+m}=\sigma(\cal B^n\times \cal B^m)\) and the fact that \(Z^{-1}(\sigma(\mathbb{D})) \subseteq \sigma(Z^{-1}(\mathbb{D}))\) (in fact, they are equal). The fact can be shown by proving the set \(\cal E:=\{\rm A| Z^{-1}(A)\in \sigma(Z^{-1}(\mathbb{D}))\}\) is a \(\sigma\)-field including the collection \(\mathbb{D}\). Then the result follows from \(\sigma(\mathbb{D})\subseteq \cal E\).

2.3.3 Conditional Distribution

First we define \(\mu(B,Y)=E(I_B(X)|Y)\). In other words, \(\int_{Y^{-1}(C)} \mu(B,Y)dP=\int_{Y^{-1}(C)} I_B(X)dP\). Furthermore, If \[\int I_C(y)[\int_B f_{X|Y=y}(x) d\mu(x)]dP_Y(y)=P((X,Y)\in B\times C)\] for \(B\in \cal B_X\) and \(C \in \cal B_\rm{Y}\), then we say \(f_{X|Y=y}(x)\) is the conditional density of \(X\) given \(Y=y\) w.r.t \(\mu\). In fact, such function \(\mu(B,Y)\) is called a random probability measure.

Suppose \((X,Y)\) has a joint density function \(f_{X,Y}\) w.r.t \(\mu \times \nu\). First of all, we show the result (\(f_{X|Y=y}(x)\) is the conditional pdf w.r.t. \(\mu\)) in basic probability that the conditional can be written as joint over marginal. Let \(f_{X|Y=y}(x)=\frac{f_{X,Y}(x,y)}{f_Y(y)}\) for \(f_Y\) is marginal pdf w.r.t. \(\nu\). The result can be validated by \[\int I_C(y)(\int_B \frac{f_{X,Y}(x,y)}{f_Y(y)}\, d\mu(x))dP_Y(y)=\int_{B\times C }f_{X,Y}(x,y) d(\mu\times \nu)(x,y).\] On the contrary, let \(g(x,y)=f_{X|Y=y}(x)f_Y(y),\) then \(g\) is the pdf of \((X,Y)\) w.r.t. \(\mu\times \nu\). In other words, we would like to show that \[\int_{B\times C} g(x,y)\, d(\mu\times \nu)(x,y)=P((X,Y)\in B\times C)\] for \(B\in \cal B^m\) and \(C\in \cal B^n\). Indeed, the result directly follows the definition of conditional pdf and density of \(Y\).

Secondly, we would like to show if \(E(X|Y=y)=\int xf_{X|Y=y}(x) d\mu(x)\). In other words, we may validate that \[E[XI_C(y)]=\int I_C(y) [\int xf_{X|Y=y}(x) d\mu(x)] dP_Y(y),\] by the definition, which can be quickly proved by approximating \(X\) with simple functions.

Example 2.10 Consider a simple case for Bayesian variable selection as the following, \((X_{i1},X_{i2},Y_i):=\cal D\) which is i.i.d. random sample given \((a_1,a_2)\) and follows the model \(Y=a_1X_1+a_2X_2+\epsilon\), where \((X_1,X_2,\epsilon)\) are assumed to be mutually independent. \(X_1,X_2\) has density \(f_1\) and \(f_2\) w.r.t \(\lambda\) given \((a_1,a_2)\), also assume \(\epsilon \sim N(0,\sigma^2)\) for some \(\sigma>0\). In particular, we put prior \(\pi_1\times \pi_1\) on \((a_1,a_2)\) with \(\pi_1\)’s density being \[f_0:=\frac{d\pi_1(a)}{d(\lambda+\mu_0)}=c_0I_{\{0\}}(a)+(1-c_0)\phi(a)I_{\{0\}^c}(a)\] w.r.t. measure \((\lambda+\mu_0)\), where \(\phi(a)\) denotes the density of standard normal, \(\lambda\) is Lebesgue measure, \(\mu_0\) is point mass measure on \(0\) and \(c_0\in(0,1)\). Calculate the corresponding posterior.

Let \(\phi_{\sigma}\) be the pdf of \(N(0,\sigma^2)\). Clearly the joint density of \((\cal D,a_1,a_2)\) is \[\prod_{i=1}^n\, \phi_{\sigma}(y_i-(a_1x_{1,i}+a_2x_{2,i}))f_1(x_{1,i})f_2(x_{2,i})f_0(a_1)f_0(a_2):=h(a_1,a_2)c(\tilde{x})\] w.r.t \(\lambda^3\times (\lambda+\mu_0)^2\). In addition, the marginal density of \(\cal D\) can be calculated by \[c(\tilde{x})(\int\int h(a_1,a_2) d\mu_0(a_1)d\mu_0(a_2)+ \int\int h(a_1,a_2) d\lambda(a_1)d\lambda(a_2)).\] We omit the detailed calculation of marginal here. (In particular, note that \(\int g(a) d\mu_0(a)=g(0)\).) Denote the posterior density as \(f_3\). Remarkably, the posterior probability of \((a_1,a_2)=(0,0)\) is \[\tilde{\pi}((a_1,a_2)=(0,0))=\int_{(0,0)} f_3(a_1,a_2)d((\lambda+\mu_0)\times(\lambda+\mu_0))(a_1,a_2)=f_3(0,0).\] Similarly, the posterior probability of \(a_1=0\) (or \(a_2=0\)) can be also derived.

Finally, we consider a more general case when joint pdf may not exist with respect to product measure in the above scenario (\((X,Y)\)). An application for such consideration is factorization theorem for finding sufficient statistics. For instance, suppose that \((X,Y)\)’s distribution has pdf \(f_{X,Y}\) w.r.t. \(P_{X_0,Y_0}\) for some \((X_0,Y_0)\) (dim(\(X\))=dim(\(X_0\)) and dim(\(Y\))=dim(\(Y_0\))). Let \[ f_Y(y)=\int f_{X,Y}(x,y) dP_{X_0|Y_0=y}(x),\] then \(f_Y\) is a pdf of \(Y\) w.r.t. \(P_{Y_0}\). In other words, we can verify that \[P(Y\in B)=\int I_B(y)\int f_{X,Y}(x,y) dP_{X_0|Y_0=y}(x) dP_{Y_0}(y).\] Indeed, (from the exercise below) the right hand side of the equality can be written as \[E[I_B(Y_0)f_{X,Y}(X_0,Y_0)]=\int I_B(y)f_{X,Y}(x,y) dP_{X_0,Y_0}(x,y)=P(Y\in B),\] since \(f_{X,Y}\) is the pdf w.r.t. \(P_{X_0,Y_0}\).

Next, we can check that \(f_{X|Y=y}(y)=\frac{f_{X,Y}(x,y)}{f_Y(y)}\) is the conditional pdf of \(X\) given \(Y\) w.r.t. \(P_{X_0|Y_0=y}\). That is to say, we shall validate (which is clear) \[\int_C \int _B \frac{f_{X,Y}(x,y)}{f_Y(y)} dP_{X_0|Y_0=y}(x) f_Y(y) dP_{Y_0}(y)= P((X,Y)\in B\times C).\]

Exercise 2.2 Let \(g(x,y)=\sum_i a_i I_{B_i}(x)I_{C_i}(y)\) (an approximation to a Borel function), prove that \(\int g(x,y) dP_{X_0|Y_0=y}(x)=E(g(X_0,Y_0)|Y_0=y)\).

Since \[ \int\sum_i a_iI_{C_i}(y) I_{B_i}(x) dP_{X_0|Y_0=y}(x)= E(\sum_i a_iI_{C_i}(Y_0)I_{B_i}(X_0)|Y_0=y),\] which is directly \(E(g(X_0,Y_0)|Y_0=y)\).

Example 2.11 Let \(X:=(X_1,\cdots,X_n)\) i.i.d from \(N(\mu,1)\) and \(Y:=(Y_1,\cdots,Y_n)\) i.i.d. from \(N(0,1)\). It can be seen that \((X,\bar{X})\)’s distribution does not have density w.r.t. product Lebesgue measure. Instead, we can consider the density of \((X,\bar{X})\) w.r.t. \(P(Y,\bar{Y})\). First we may show that \(P_{X,\bar{X}}\ll P_{Y,\bar{Y}}\). If we define a transformation \[T(y_1,\cdots,y_n)=((y_1,\cdots,y_n),\frac{\sum_i y_i}{n}).\] Then \[P_{Y,\bar{Y}}(A)=P(Y\in T^{-1}(A))=P_Y(T^{-1}(A))=0,\] implies \(\lambda^n(T^{-1}(A))=0\). Similarly we can argue \[P_{X,\bar{X}}(A)=P_X(T^{-1}(A))=0.\] Thus \(P_{X,\bar{X}}\ll P_{Y,\bar{Y}}\).

Secondly, we claim that \(\frac{dP_X}{dP_Y}\) is the density of \((X,\bar{X})\) w.r.t. \(P_{Y,\bar{Y}}\), where \(\frac{dP_X}{dP_Y}:=\frac{\frac{dP_X}{d\lambda^n}}{\frac{dP_Y}{d\lambda^n}}\). That is to say, we need to verify that \[\int_{A\times B}\frac{dP_X}{dP_Y}(x)dP_{Y,\bar{Y}}(x,s)=P_{X,\bar{X}}(A\times B).\] The LHS can be written as \(E[I_A(Y)I_B(\bar{Y})\frac{dP_X}{dP_Y}(Y)]\), which turns out to be \[E[I_{T^{-1}(A\times B)}(Y)\frac{dP_X}{dP_Y}(Y)].\] On the other hand, the RHS is just \(P_X(T^{-1}(A\times B))\), which is \[P_X(T^{-1}(A\times B))=\int I_{T^{-1}(A\times B)}\frac{dP_X}{dP_Y}(y)dP_Y(y).\] Therefore, the LHS agrees on RHS.

Lastly, we would like to write down the conditional probability. Since the joint density is given above. It remains to calculate the marginal density, that is, \[f_{\bar{X}}(s)=\int f_{X,\bar{X}}(x,s) dP_{Y|\bar{Y}=s}(x)\] w.r.t. \(P_{\bar{Y}}\). The key observation here is the joint density \(\frac{dP_X}{dP_Y}\) can be written as \[\frac{dP_X}{dP_Y}=e^{-\frac{(n(\bar{X}-\mu)^2+n\bar{X}^2)}{2}},\] which is a function only dependent on \(\bar{X}\) and parameter \(\mu\). Furthermore, the marginal density of \(\bar{X}\) w.r.t. \(P_{Y|\bar{Y}=s}\) is \[\int \frac{dP_X}{dP_Y}(x) dP_{Y|\bar{Y}=s}(x)=e^{-\frac{(n(s-\mu)^2+ns^2)}{2}}.\] Therefore the conditional density is \(1\). In particular, recall the definition of sufficient statistics. The result shows that \(\bar{X}\) is indeed the sufficient statistic of \(\mu\).

Example 2.12 Below we see a generalization of the result in the last example. See lemma 2.1 in the textbook. Let \(X=(X_1,\cdots,X_n)\) with \(P_X\in\{P_{\theta}:\theta \in \Theta\}\) has dominating \(\sigma\)-finite measure \(\nu\). Denote the pdf of \(X\) as \(f_\theta\) w.r.t \(\nu\). There exist \(\{c_i\}_{i=1}^\infty\) sequence of positive numbers and \(\{\theta_i\}_{i=1}^\infty\) sequence in \(\cal \Theta\) such that \(\sum_i c_i=1\) and \(P_\theta \ll \sum c_iP_{\theta_i}\) for all \(\theta \in \cal \Theta\). There exists a random variable \(X_0\) such that \(P_{X_0}=\sum_i c_iP_{\theta_i}\). Furthermore, by MCT we can show \[\frac{dP_{X_0}}{d\nu}=\sum_i c_i\frac{dP_{\theta_i}}{d\nu}=\sum_i c_i f_{\theta_i}.\] Hence we have \[\frac{dP_\theta}{dP_{X_0}}=\frac{f_{\theta}}{\sum_i c_i f_{\theta_i}}.\] Then part side of the factorization theorem tells us that \(f_\theta(x)=g(\theta,T(x))h(x)\) implies \(T(X)\) is sufficient for \(P_X\). That is to say, it requires to show that \(P_{X|T(X)=s}\) is independent of \(\theta\). Similarly, the first step is to show that \[\frac{dP_{X,T(X)}}{dP_{X_0,T(X_0)}}=\frac{dP_{X}}{dP_{X_0}}=\frac{f_{\theta}}{\sum_i c_i f_{\theta_i}}.\] Then the result immediately follows by the condition and the same argument.

Remark. In the proof of lemma 2.1, the author (Jun Shao) merely considered the case of finite measure \(\nu\). In fact, for a \(\sigma\)-finite measure \(\nu\), we can find a finite measure \(\mu\) (or a probability measure) that dominates \(\nu\). To show that, since \(\nu\) is \(\sigma\)-finite, we can find \(E_k\) with finite measure such that \(\cup_k E_k\) is the whole measurable space. Then the fact follows by defining \[\mu(A)=\sum_{k=1}^\infty \frac{\nu(A\cap E_k)}{2^k\nu(E_k)}\] for any \(A\) in the corresponding \(\sigma\)-algebra. Thus it suffices to consider the case of finite measure.

2.3.4 Markov chains and Martingales

Definition 2.10 A sequence of random vectors \(\{X_n:n\in\mathbb{N}\}\) is said to be a (discrete time) Markov chain or Markov process if \[P(B|X_1,\cdots,X_n)=P(B|X_n) \quad \mbox{a.s.}\] for any \(B\in \sigma(X_{n+1}), n=2,3,\cdots\)

It can be seen that for a Markov chain \(\{X_n\}\), \(X_{n+1}\) is conditionally independent of \((X_1,\cdots,X_{n-1})\) given \(X_n\). We will list some equivalent conditions of Markov chain without proving below.

Proposition 2.7 A sequence of random vectors \(X_n\) is a Markov chain if and only if any one of the following three conditions holds.

  1. For any integrable \(h(X_{n+1})\) with Borel function \(h\), \(E[h(X_{n+1})|X_1,\cdots,X_n]=E[h(X_{n+1})|X_n]\) a.s. for \(n\geq 2\).
  2. \(P(B|X_1,\cdots,X_n)=P(B|X_n)\) a.s. for \(n\in \mathbb{N}\) and \(B\in \sigma(X_{n+1},X_{n+2},\cdots)\).
  3. For any \(n\geq 2\), \(A\in \sigma(X_1,\cdots,X_n)\), and \(B\in \sigma(X_{n+1},X_{n+2},\cdots)\), \(P(A\cap B|X_n)=P(A|X_n)P(B|X_n)\) a.s.

Further properties(periodic, invariant, irreducible, etc.) and applications of Markov chains like MCMC can be referred to chapter 4 in the textbook. Thereafter, we introduce martingale which is quite important in stochastic process and financial applications. Current studies of sequential analysis and game-theoretic statistics are also strongly connected with this topic.

Definition 2.11 The sequence of \(\{X_n,\cal F_n\}\) with \(X_n\) being a sequence of random variables defined on on a probability space \((\Omega, \cal F, P)\) and \(\cal F_1\subset F_2\subset \cdots \subset F\) being a sequence of \(\sigma\)-fields (called a ``filtration’’) such that \(\sigma(X_n)\subset \cal F_n\) is said to be a martingale if \[E(X_{n+1}|\cal F_n)=\rm X_n \quad \mbox{a.s.}\] for all \(n\in \mathbb{N}\).

Furthermore, \(\{X_n,\cal F_n\}\) is said to be a submartingale (supermartingale) if the “\(=\)” of the formula is replaced by “\(\geq\)” (or “\(\leq\)”). We can derive that for a martingale \(X_n\) (i) \(E(X_{n+j}|\cal F_n)=\rm X_n\) a.s. ,and (ii) \(EX_1=EX_j\) for all \(j=1,2,\cdots\) by induction or iterating the formula in the definition. We say \(\{X_n\}\) is a martingale (sub or super) if and only if \(\{X_n,\sigma(X_1,\cdots,X_n)\}\) is a martingale (sub or super). In fact, since \(\sigma(X_n)\subset \cal F_n\), therefore \(\sigma(X_1,\cdots,X_n)=\sigma(\sigma(X_1)\cup\cdots\cup \sigma(X_n))\subset \cal F_n\) and thus \(\sigma(X_1,\cdots,X_n)\) is the smallest filtration \(\cal G_n\) that satisfies \(\sigma(X_n) \subset \cal G_n\) (or we say that \(X_n\) is ``adapted’’ to \(\cal G_n\)).

A construction of a martingale is to consider \(E(Y|\cal F_n)\) for a integrable random variable \(Y\) and a filtration \(\{\cal F_n\}\) by the rule of conditional expectation. The following example is known as likelihood ratio martingale.

Example 2.13 Consider a sequence of random variables \(\{X_n\}\) from either \(P\) or \(Q\) that are measures on the space \((\Omega, \cal F)\). Let \(P_n\) and \(Q_n\) be \(P\) and \(Q\) restricted to \(\cal F_n=\sigma(\rm X_1,\cdots,X_n)\). Suppose that \(Q_n \ll P_n\) for each \(n\). Then \(\{L_n,\cal F_n\}\) is a martingale where \(L_n=\frac{dQ_n}{dP_n}\). Moreover, suppose there exists a \(\sigma\)-finite measure \(\nu_n\) on \(\cal F_n\) which is equivalent to \(P_n\). Then \(L_n=\frac{dQ_n}{d\nu_n}/\frac{dP_n}{d\nu_n}\) is called the likelihood ratio.

The second example is known to be a simple random walk, which is a Markov chain as well as a martingale.

Example 2.14 Let \(\{\epsilon_n\}\) be a sequence of independent and integrable random variables. Let \(X_n=\sum_{i=1}^n \epsilon_i\). Then \(\{X_n\}\) is a martingale since \[E(X_{n+1}|X_1,\cdots,X_n)=E(X_n+\epsilon_{n+1}|X_1,\cdots,X_n)=X_n.\]

The following theorem can be intermediately derived by Jensen’s inequality on conditional expectation, i.e., for convex function \(\phi\),\[\phi(X_n)=\phi(E(X_{n+1}|\cal F_n))\leq \rm E(\phi(X_{n+1})|\cal F_n).\]

Theorem 2.5 Let \(\phi\) be a convex function on \(\mathbb{R}\).

  1. If \(\{X_n,\cal F_n\}\) is a martingale and \(\phi(X_n)\) is integrable for all \(n\), then \(\{\phi(X_n),\cal F_n\}\) is a submartingale.
  2. If \(\{X_n,\cal F_n\}\) is a submartingale and \(\phi(X_n)\) is integrable for all \(n\), and \(\phi\) is non-decreasing, then \(\{\phi(X_n),\cal F_n\}\) is a submartingale.

A well-known result of martingale comes from Doob’s decomposition which decompose any adapted stochastic process \(X_n\) (i.e. \(\sigma(X_n)\subset \cal F_n\) for all \(n\)) into a martingale \(Y_n\) and a predictable process \(Z_n\) (i.e. \(Z_n\) is measurable with respect to \(\cal F_n\) for all \(n\)). The following theorem is an extension of Doob’s decomposition.

Theorem 2.6 Let \(\{X_n, \cal F_n\}\) be a submartingale (supermartingale). Then \(X_n=Y_n+Z_n\) for all \(n\) where \(\{Y_n, \cal F_n\}\) is a martingale, and \(Z_n\) is an increasing (decreasing) sequence with \(EZ_n<\infty\) for all \(n\). Furthermore, if \(\mathrm{sup}_n E|X_n|<\infty\), then \(\mathrm{sup}_n E|Y_n|<\infty\) and \(\mathrm{sup}_n EZ_n<\infty\).

The decomposition can be constructed by letting \(Y_n=\sum_{i=1}^{n} \eta_i\) and \(Z_n=\sum_{i=1}^{n} \xi_i\) for \(\eta_i=X_i-X_{i-1}-E(X_i-X_{i-1}|\cal F_{i-1})\), \(\xi_i=E(X_i-X_{i-1}|\cal F_{i-1})\) and \(\eta_1=\xi_1=0\). The last theorem in this chapter is the convergence theorem of martingales given by Doob.

Theorem 2.7 Let \(\{X_n, \cal F_n\}\) be a submartingale. If \(c:=\sup_n E|X_n|<\infty\), then \(\lim_{n\to \infty}X_n \to X\) a.s., where \(X\) is a random variable satisfying \(E|X|\leq c\).