\section{Background}

Consider Markov decision process (MDP)
$\langle \mathcal{S}$, $\mathcal{A}$, $\mathcal{R}$, $\mathcal{T}$$\rangle$, where 
$\mathcal{S}=\{1,2,3,\ldots\}$ is a finite state space, $|\mathcal{S}|=n$, $\mathcal{A}$ is an action space,
$\mathcal{T}:\mathcal{S}\times \mathcal{A}\times \mathcal{S}\rightarrow [0,1]$ 
is a transition function,
$\mathcal{R}:\mathcal{S}\times \mathcal{A}\times \mathcal{S}\rightarrow \mathbb{R}$ is a reward function.
Policy $\pi:S\times A\rightarrow [0,1]$ 
selects an action $a$ in state $s$ 
with probability $\pi(a|s)$.
State value function under policy $\pi$, denoted $V^{\pi}:S\rightarrow
\mathbb{R}$, represents the expected sum of rewards in
the MDP under policy $\pi$:
$V^{\pi}(s)=\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}r_t|s_0=s\right]$.

Given a steady policy $\pi$, MDP becomes a Markov chain on state space
$\mathcal{S}$ with a   matrix
$P_{\pi}\in[0,1]^{n\times n}$, where
$P_{\pi}(s_1,s_2)=\sum_{a\in \mathcal{A}}\pi(a|s_1)\mathcal{T}(s_1,a,s_2)$
is the transition probobility from $s_1$ to $s_2$, 
$\forall s\in \mathcal{S}$,   $\sum_{s'\in \mathcal{S}}P_{\pi}(s,s')=1$. 
A stationary  measure for $P_{\pi}$ is a distribution measure
$d_{\pi}$ on $\mathcal{S}$ such that
\begin{equation}
d_{\pi}=P_{\pi}^{\top}d_{\pi}.
\label{invariance}
\end{equation}
That is $\forall s\in \mathcal{S}$, we have
\begin{equation}
\sum_{s'\in \mathcal{S}}P_{\pi}(s',s)d_{\pi}(s')=d_{\pi}(s).
\end{equation}

\begin{definition}[Ergodicity]
Ergodicity assumption about the MDP assume that
$d_{\pi}(s)$  exist for any policy $\pi$ and are independent of
initial states \cite{Sutton2018book}.
\end{definition}

This mean all states are reachable under any policy from the
current state after sufficiently many steps \cite{majeed2018q}.
A sufficient condition for this assumption is that
1 is a simple eigenvalue of the matrix $P_{\pi}$ and 
all other eigenvalues of $P_{\pi}$ are of modulus <1.


\subsection{Ergodicity and Non-ergodicity of Markov Chains}

\input{pic/randomWalk}

Random walk, see Figure \ref{randomwalk},
is a Markov chain, where agent starts from node C, and takes
a probability of 0.5 to move left or right, until 
reaching the leftmost or rightmost node where it terminates.
The terminal states are usually called absorbing states.
The transition probobility matrix
of random walk with absorbing states 
$P_{\text{absorbing}}$ is defined as follows:
\[
P_{\text{absorbing}}\dot{=}\begin{array}{c|ccccccc}
&\text{T}_1 & \text{A} & \text{B} & \text{C} & \text{D} & \text{E} & \text{T}_2 \\\hline
\text{T}_1 & 1 & 0 & 0 & 0 & 0 & 0& 0 \\
\text{A} & \frac{1}{2} & 0 & \frac{1}{2} & 0 & 0 & 0 & 0\\
\text{B} & 0 & \frac{1}{2} & 0 & \frac{1}{2} & 0 & 0 & 0\\
\text{C} & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2} & 0 & 0\\
\text{D} & 0 & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2} & 0 \\
\text{E} & 0 & 0 & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2}  \\
\text{T}_2 & 0 & 0 & 0 & 0 & 0 & 0 & 1
\end{array}
\]
According to (\ref{invariance}),
the distribution $d_{\text{absorbing}}=\{\frac{1}{2}$,
 $0$, $0$, $0$, $0$, $0$, $\frac{1}{2}\}$.
 Since the probability of A, B, C, D, E are all zeros,
 random walk with absorbing states are non-ergodic.
 
 \input{pic/randomWalkRestart}
 However, in reinforcement learning, we always assume the ergodicity assumption.
 When encountering an absorbing state, we immediately reset and 
 transit to the initial states. Figure \ref{randomwalkRestart}
 is random walk with restarts.
 The transition probobility matrix
of random walk with restarts 
$P_{\text{restart}}$ is defined as follows:
\[
P_{\text{restart}}\dot{=}\begin{array}{c|ccccccc}
&\text{T}_1 & \text{A} & \text{B} & \text{C} & \text{D} & \text{E} & \text{T}_2 \\\hline
\text{T}_1 & 0 & 0 & 0 & 1 & 0 & 0& 0 \\
\text{A} & \frac{1}{2} & 0 & \frac{1}{2} & 0 & 0 & 0 & 0\\
\text{B} & 0 & \frac{1}{2} & 0 & \frac{1}{2} & 0 & 0 & 0\\
\text{C} & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2} & 0 & 0\\
\text{D} & 0 & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2} & 0 \\
\text{E} & 0 & 0 & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2}  \\
\text{T}_2 & 0 & 0 & 0 & 1 & 0 & 0 & 0
\end{array}
\]
 
According to (\ref{invariance}),
the distribution $d_{\text{restart}}=\{0.05$,
 $0.1$, $0.2$, $0.3$, $0.2$, $0.1$, $0.05\}$.
 Since the probability of T, A, B, C, D, E are non-zeros,
 random walk with restarts are ergodic.

给出Markov Chain的遍历性定义，和充分条件。
根据随机游走例子说明 带有Absorbing state的是不满足遍历性的，
带有重启的强化学习训练设定是满足遍历性的。

本文关注的是去除吸收态时，非吸收态之间的遍历性。
通过圣彼得堡例子说明，圣彼得堡不满足非吸收态之间的遍历性。
给出定理，同样证明2048游戏不满足非吸收态之间的遍历性。