先写到这里

38145d79 · Lenovo · 033f0223 · 033f0223 · 38145d79 · 38145d79
Commit 38145d79 authored May 24, 2024 by Lenovo
Showing with 281 additions and 78 deletions

documen.synctex.gz
+0 -0

document.tex
+11 -7

main/2048prove.tex
+2 -2

main/background.tex
+78 -7

main/introduction.tex
+102 -33

pic/randomWalk.tex
+23 -14

pic/randomWalkRestart.tex
+28 -13

references.bib
+37 -2

No files found.
--- a/documen.synctex.gz
+++ b/documen.synctex.gz
--- a/document.tex
+++ b/document.tex
 \documentclass[lettersize,journal]{IEEEtran}
 \usepackage{amsmath,amsfonts}
+\usepackage{nicematrix}
 \usepackage{algorithmic}
 \usepackage{algorithm}
 \usepackage{array}
@@ -9,6 +10,7 @@
 \usepackage{url}
 \usepackage{verbatim}
 \usepackage{graphicx}
+%\usepackage{natbib}

 \newtheorem{theorem}{Theorem}
 \newtheorem{proposition}[theorem]{Proposition}
@@ -26,7 +28,7 @@
 \usetikzlibrary{decorations.markings}
 \hyphenation{op-tical net-works semi-conduc-tor IEEE-Xplore}
 % updated with editorial comments 8/9/2021
-
+\newcommand{\highlight}[1]{\textcolor{red}{#1}}
 \begin{document}

 \title{Non-ergodicity of Game 2048}
@@ -55,7 +57,7 @@ wangwenhao11@nudt.edu.cn).
 \markboth{IEEE Transaction on Games,~Vol.~14, No.~8, August~202X}%
 {Shell \MakeLowercase{\textit{et al.}}: A Sample Article Using IEEEtran.cls for IEEE Journals}

-\IEEEpubid{0000--0000/00\$00.00~\copyright~2024 IEEE}
+%\IEEEpubid{0000--0000/00\$00.00~\copyright~2024 IEEE}
 % Remember, if you use this you must call \IEEEpubidadjcol in the second
 % column for its text to clear the IEEEpubid mark.

@@ -69,12 +71,14 @@ wangwenhao11@nudt.edu.cn).

 \end{IEEEkeywords}

-\input{main/background}
+
 \input{main/introduction}
-\input{main/nonergodicity}
-\input{main/paradox}
-\input{main/theorem}
-\input{main/2048prove}
+\input{main/background}
+
+%\input{main/nonergodicity}
+%\input{main/paradox}
+%\input{main/theorem}
+%\input{main/2048prove}

 


--- a/main/2048prove.tex
+++ b/main/2048prove.tex
@@ -21,7 +21,7 @@ p=2^{64} \cdot \sum_{m=0}^{15} I(B_m \neq 0) \cdot 2^{B_m} + \sum_{m=0}^{15} (1 
 本文将这个结果放在比64bit更高的位置上，也就是 64-84bit的位置。这个编码的主要含义是，将局面所有数字之和放在高bit位置上，排序时局面之和大的排在后面，
 状态转移时就是从小的下标转移到大的下标上。另外后面64bit就是局面的编码，来保证这个值的唯一性，一个局面会对应一个唯一的值。

-\input{../pic/2048encode}
+\input{pic/2048encode}

 上面的图中的这个局面的编码$p=(1≪64)∙30784+0x FEDC 5432 0000 0020$。
 本文是按照从下往上，从右往左的顺序给格子进行排列，右下角的格子是最低位，左上角的格子是最高位。
@@ -39,7 +39,7 @@ p=2^{64} \cdot \sum_{m=0}^{15} I(B_m \neq 0) \cdot 2^{B_m} + \sum_{m=0}^{15} (1 

 根据游戏规则，两个相同幂次的方块碰撞会合并成为一个幂次加一的方块，
 然后会在一个空格位置随机生成一个2或者4的方块，这一过程本文记为$S_i\to S_(i^')\to S_j$。
-\input{../pic/2048example-p}
+\input{pic/2048example-p}

 如图3.5所示根据我们的规则可以保证，状态在后的排序也靠后。也就是说在$S_i\to S_j$的过程中，能够保证$p_i<p_j$,
 也就保证了非终结状态转移矩阵中状态转移总是小下标向大下标位置上转移的。在$S_i\to S_{i^{,}}$的过程中，局面和不发生变化。

--- a/main/background.tex
+++ b/main/background.tex
@@ -16,20 +16,91 @@ $V^{\pi}(s)=\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}r_t|s_0=s\right]$.

 Given a steady policy $\pi$, MDP becomes a Markov chain on state space
 $\mathcal{S}$ with a   matrix
-$P^{\pi}\in[0,1]^{n\times n}$, where
-$P^{\pi}(s_1,s_2)=\sum_{a\in \mathcal{A}}\pi(a|s_1)\mathcal{T}(s_1,a,s_2)$
+$P_{\pi}\in[0,1]^{n\times n}$, where
+$P_{\pi}(s_1,s_2)=\sum_{a\in \mathcal{A}}\pi(a|s_1)\mathcal{T}(s_1,a,s_2)$
 is the transition probobility from $s_1$ to $s_2$, 
-$\forall s\in \mathcal{S}$,   $\sum_{s'\in \mathcal{S}}P^{\pi}(s,s')=1$. 
-A stationary  measure for $P$ is a distribution measure
-$d$ on $\mathcal{S}$ such that
+$\forall s\in \mathcal{S}$,   $\sum_{s'\in \mathcal{S}}P_{\pi}(s,s')=1$. 
+A stationary  measure for $P_{\pi}$ is a distribution measure
+$d_{\pi}$ on $\mathcal{S}$ such that
 \begin{equation}
-d^{\top}=d^{\top}P^{\pi}.
+d_{\pi}=P_{\pi}^{\top}d_{\pi}.
+\label{invariance}
 \end{equation}
 That is $\forall s\in \mathcal{S}$, we have
 \begin{equation}
-\sum_{s'\in \mathcal{S}}P^{\pi}(s',s)d(s')=d(s).
+\sum_{s'\in \mathcal{S}}P_{\pi}(s',s)d_{\pi}(s')=d_{\pi}(s).
 \end{equation}

+\begin{definition}[Ergodicity]
+Ergodicity assumption about the MDP assume that
+$d_{\pi}(s)$  exist for any policy $\pi$ and are independent of
+initial states \cite{Sutton2018book}.
+\end{definition}
+
+This mean all states are reachable under any policy from the
+current state after sufficiently many steps \cite{majeed2018q}.
+A sufficient condition for this assumption is that
+1 is a simple eigenvalue of the matrix $P_{\pi}$ and 
+all other eigenvalues of $P_{\pi}$ are of modulus <1.
+
+
+\subsection{Ergodicity and Non-ergodicity of Markov Chains}
+
+\input{pic/randomWalk}
+
+Random walk, see Figure \ref{randomwalk},
+is a Markov chain, where agent starts from node C, and takes
+a probability of 0.5 to move left or right, until 
+reaching the leftmost or rightmost node where it terminates.
+The terminal states are usually called absorbing states.
+The transition probobility matrix
+of random walk with absorbing states 
+$P_{\text{absorbing}}$ is defined as follows:
+\[
+P_{\text{absorbing}}\dot{=}\begin{array}{c|ccccccc}
+&\text{T}_1 & \text{A} & \text{B} & \text{C} & \text{D} & \text{E} & \text{T}_2 \\\hline
+\text{T}_1 & 1 & 0 & 0 & 0 & 0 & 0& 0 \\
+\text{A} & \frac{1}{2} & 0 & \frac{1}{2} & 0 & 0 & 0 & 0\\
+\text{B} & 0 & \frac{1}{2} & 0 & \frac{1}{2} & 0 & 0 & 0\\
+\text{C} & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2} & 0 & 0\\
+\text{D} & 0 & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2} & 0 \\
+\text{E} & 0 & 0 & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2}  \\
+\text{T}_2 & 0 & 0 & 0 & 0 & 0 & 0 & 1
+\end{array}
+\]
+According to (\ref{invariance}),
+the distribution $d_{\text{absorbing}}=\{\frac{1}{2}$,
+ $0$, $0$, $0$, $0$, $0$, $\frac{1}{2}\}$.
+ Since the probability of A, B, C, D, E are all zeros,
+ random walk with absorbing states are non-ergodic.
+ 
+ \input{pic/randomWalkRestart}
+ However, in reinforcement learning, we always assume the ergodicity assumption.
+ When encountering an absorbing state, we immediately reset and 
+ transit to the initial states. Figure \ref{randomwalkRestart}
+ is random walk with restarts.
+ The transition probobility matrix
+of random walk with restarts 
+$P_{\text{restart}}$ is defined as follows:
+\[
+P_{\text{restart}}\dot{=}\begin{array}{c|ccccccc}
+&\text{T}_1 & \text{A} & \text{B} & \text{C} & \text{D} & \text{E} & \text{T}_2 \\\hline
+\text{T}_1 & 0 & 0 & 0 & 1 & 0 & 0& 0 \\
+\text{A} & \frac{1}{2} & 0 & \frac{1}{2} & 0 & 0 & 0 & 0\\
+\text{B} & 0 & \frac{1}{2} & 0 & \frac{1}{2} & 0 & 0 & 0\\
+\text{C} & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2} & 0 & 0\\
+\text{D} & 0 & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2} & 0 \\
+\text{E} & 0 & 0 & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2}  \\
+\text{T}_2 & 0 & 0 & 0 & 1 & 0 & 0 & 0
+\end{array}
+\]
+ 
+According to (\ref{invariance}),
+the distribution $d_{\text{restart}}=\{0.05$,
+ $0.1$, $0.2$, $0.3$, $0.2$, $0.1$, $0.05\}$.
+ Since the probability of T, A, B, C, D, E are non-zeros,
+ random walk with restarts are ergodic.
+
 给出Markov Chain的遍历性定义，和充分条件。
 根据随机游走例子说明 带有Absorbing state的是不满足遍历性的，
 带有重启的强化学习训练设定是满足遍历性的。

--- a/main/introduction.tex
+++ b/main/introduction.tex
@@ -5,42 +5,111 @@ can move the tiles in four directions - up, down, left, and right,
 and the objective is to reach 2048 tile or higher tile.
 While the game is simple to understand, it requires strategic
 thinking and planning to reach the 2048 tile.
+ Natural decision problems in 2048 is proved to be NP-Complete
+  \cite{abdelkader20152048}.
 2048 has gained widespread popularity due to its addictive
  gameplay and simple mechanics, making it a favorite 
  among puzzle game enthusiasts.

-
-\cite{szubert2014temporal}
-
-\cite{wu2014multi}
-
-\cite{oka2016systematic}
-
-\cite{matsuzaki2016systematic}
-
-\cite{yeh2016multistage}
-
-\cite{jaskowski2017mastering}
-
-\cite{matsuzaki2017developing}
-
-\cite{kondo2019playing}
-
-\cite{matsuzaki2020further}
-
-\cite{matsuzaki2021developing}
-
-\cite{guei2021optimistic}
-
-\cite{bangole2023game}
-
-
-\includegraphics{pic/2048epsilon-greedy}
-\includegraphics{pic/maze-eps-greedy}
-
-
-
-
-
+In 2014, Rodgers and Levine investigated search methods for
+2048 AI strategies including
+mini-max search, expectimax search, Monte-Carlo tree search,
+ and averaged depth-limited search \cite{rodgers2014an}.
+Szubert and Ja{\'s}kowski first employed  temporal difference learning
+to train the 2048 AI, where  afterstate values were 
+ approximated with N-tuple networks \cite{szubert2014temporal}.
+Wu et al.  proposed multi-stage TD learning incorporating shallow 
+expectimax search to improve the performance \cite{wu2014multi}.
+In 2016, a set of N-tuple network combinations  that yield a better performance
+is selected by local greedy search  
+\cite{oka2016systematic,matsuzaki2016systematic}.
+In 2017, Ja{\'s}kowski employed temporal coherence learning
+ with multi-state weight promotion, redudant encoding
+  and carousel shaping for \cite{jaskowski2017mastering}.
+Matsuzaki developped backward temporal coherence learning 
+and restart for fast training \cite{matsuzaki2017developing}.
+In 2022, Guei et al. introduced optimistic initialization
+to encourage exploration for 2048, and improved the learning 
+quality \cite{guei2021optimistic}.
+The optimistic initialization is equivalent to
+a static potential function in reward shaping
+ \cite{wiewiora2003potential,devlin2012dynamic}.
+Besides, neural network approximators are developped
+for training 2048 AI 
+\cite{kondo2019playing,matsuzaki2020further,matsuzaki2021developing,bangole2023game}.
+
+We observed a phenomenon in 2048 AI studies 
+where no one has utilized explicit exploration strategies 
+such as softmax or $\epsilon$-greedy to prevent reinforcement learning
+ methods from getting stuck in local optima.
+ Szubert and Ja{\'s}kowski said 
+ ``The exploration is
+not needed in this game, as the environment is inherently
+stochastic and thus provides sufficiently diversified experience'',
+and they found that experiments with $\epsilon$-greedy exploration
+did not improve the performance \cite{szubert2014temporal}.
+
+We argue that in 2048 AI training, it's not that
+exploaration is not needed, but rather that it cannot be explored with
+softmax or $\epsilon$-greedy strategies.
+
+% \begin{figure}
+% \centering
+% \includegraphics[width=2.5in][Maze]{pic/maze-eps-greedy}
+% \includegraphics[width=2.5in][2048 Game]{pic/2048epsilon-greedy}
+% \caption{Comparison of returns of $\epsilon$-greedy strageties.}
+% \label{fig1}
+% \end{figure}
+
+\begin{figure*}[!t]
+\centering
+\subfloat[2048 Game]{\includegraphics[width=3in]{pic/2048epsilon-greedy}%
+\label{fig_second_case}}
+\hfil
+\subfloat[Maze]{\includegraphics[width=3in]{pic/maze-eps-greedy}%
+\label{fig_first_case}}
+\caption{Comparison of returns of $\epsilon$-greedy strageties.}
+\label{fig_sim}
+\end{figure*}
+
+To validate the above point, we designed two sets of experiments,
+ one with 2048 and the other with a maze. 
+ In the experiments, we used nearly optimal value functions
+  combined with an $\epsilon$-greedy exploration strategy, 
+  testing the average score and standard deviation obtained 
+  for different  values $\epsilon\in$\{0, 0.001, 0.002, 0.004, 
+  0.008, 0.016, 0.032, 0.064, 0.128, 0.256, 0.512\}. 
+  In the 2048 game, the value function is based on N-tuple network 
+  trained with optimistic initialization \cite{guei2021optimistic}, 
+  achieving an average score of \highlight{300,000}. 
+  In the maze game, the optimal value function is used, 
+  with the optimal policy achieving a score of \highlight{-58} points.
+  As shown in Figure \ref{fig_sim}, 
+  the x-axis represents $\epsilon$, 
+  the y-axis represents the average score per game,
+   and the shaded area represents the standard deviation.
+  We can find that in the 2048 game, the total score
+   sharply decreases as $\epsilon$ increases,
+    while in the maze game, the total score 
+    decreases gradually with increasing $\epsilon$.
+
+The comparison in this set of experiments indicates that
+ the $\epsilon$-greedy exploration strategy in the maze game still 
+ results in the behavioral policy being an $\epsilon$-greedy policy, 
+ while in the 2048 game, the behavioral policy is no 
+ longer an $\epsilon$-greedy policy.
+ 
+ This has sparked our curiosity: 
+ what are the fundamental differences between the
+  2048 game and maze games?
+ In a maze game, when the agent deviates from
+  the optimal path during exploration,
+   it can immediately return to the optimal path,
+    while in the 2048 game, when the agent deviates 
+    from the optimal state, it may never have the 
+    chance to return to the previous state. 
+    This relates to the game's property of traversability.
+
+In this paper, we proved that the game 2048 is non-ergodic.


--- a/pic/randomWalk.tex
+++ b/pic/randomWalk.tex
+\begin{figure}[!t]
+\centering
+\scalebox{0.9}{
 \begin{tikzpicture}
-    \node[draw, rectangle, fill=gray!50] (DEAD) at (-2,0) ;
-    \node[draw, rectangle, fill=gray!50] (DEAD2) at (10,0) ;
-    \node[draw, circle] (A) at (0,0) {A};
-    \node[draw, circle] (B) at (2,0) {B};
-    \node[draw, circle] (C) at (4,0) {C};
+    \node[draw, rectangle, fill=gray!50] (DEAD) at (0,0) {T$_1$};
+    \node[draw, rectangle, fill=gray!50] (DEAD2) at (9,0) {T$_2$};
+    \node[draw, circle] (A) at (1.5,0) {A};
+    \node[draw, circle] (B) at (3,0) {B};
+    \node[draw, circle] (C) at (4.5,0) {C};
    \node[draw, circle] (D) at (6,0) {D};
-    \node[draw, circle] (E) at (8,0) {E};
+    \node[draw, circle] (E) at (7.5,0) {E};

-    \draw[->] (A) -- (DEAD);
-    \draw[->] (B) -- (A);
-    \draw[->] (B) to [bend left=30] (C);
-    \draw[->] (C) to [bend left=30] (B);
-    \draw[->] (C) to [bend left=30] (D);
-    \draw[->] (D) to [bend left=30] (C);
-    \draw[->] (D) -- (E);
-    \draw[->] (E) -- (DEAD2);
+    \draw[->] (A) -- node {0.5}  (DEAD);
+    \draw[->] (A) to [bend left=30] node {0.5} (B);
+    \draw[->] (B) to [bend left=30] node {0.5} (A);
+    \draw[->] (B) to [bend left=30] node {0.5} (C);
+    \draw[->] (C) to [bend left=30] node {0.5} (B);
+    \draw[->] (C) to [bend left=30] node {0.5} (D);
+    \draw[->] (D) to [bend left=30] node {0.5} (C);
+    \draw[->] (D) to [bend left=30] node {0.5} (E);
+    \draw[->] (E) to [bend left=30] node {0.5} (D);
+    \draw[->] (E) -- node {0.5}  (DEAD2);

    \draw[->] ([yshift=4ex]C.north) -- ([yshift=4.5ex]C.south);
 \end{tikzpicture}
+}
+\caption{Random walk with absorbing states.}
+\label{randomwalk}
+\end{figure}
--- a/pic/randomWalkRestart.tex
+++ b/pic/randomWalkRestart.tex
+\begin{figure}[!t]
+\centering
+\scalebox{0.9}{
 \begin{tikzpicture}
-
-    \node[draw, circle] (A) at (0,0) {A};
-    \node[draw, circle] (B) at (2,0) {B};
-    \node[draw, circle] (C) at (4,0) {C};
+    \node[draw, rectangle, fill=gray!50] (DEAD) at (0,0) {T$_1$};
+    \node[draw, rectangle, fill=gray!50] (DEAD2) at (9,0) {T$_2$};
+    \node[draw, circle] (A) at (1.5,0) {A};
+    \node[draw, circle] (B) at (3,0) {B};
+    \node[draw, circle] (C) at (4.5,0) {C};
    \node[draw, circle] (D) at (6,0) {D};
-    \node[draw, circle] (E) at (8,0) {E};
+    \node[draw, circle] (E) at (7.5,0) {E};

-    \draw[->] (A.north) to [bend left=30] (C.north)
-    \draw[->] (B) -- (A);
-    \draw[->] (B) to [bend left=30] (C);
-    \draw[->] (C) to [bend left=30] (B);
-    \draw[->] (C) to [bend left=30] (D);
-    \draw[->] (D) to [bend left=30] (C);
-    \draw[->] (D) -- (E);
-    \draw[->] (E.south) to [bend left=30] (C.south)
+    \draw[->] (DEAD.south) to [bend right=30] node {0.5} (C.south);
+    \draw[->] (A) -- node {0.5}  (DEAD);
+    \draw[->] (A) to [bend left=30] node {0.5} (B);
+    \draw[->] (B) to [bend left=30] node {0.5} (A);
+    \draw[->] (B) to [bend left=30] node {0.5} (C);
+    \draw[->] (C) to [bend left=30] node {0.5} (B);
+    \draw[->] (C) to [bend left=30] node {0.5} (D);
+    \draw[->] (D) to [bend left=30] node {0.5} (C);
+    \draw[->] (D) to [bend left=30] node {0.5} (E);
+    \draw[->] (E) to [bend left=30] node {0.5} (D);
+    \draw[->] (E) -- node {0.5}  (DEAD2);
+    \draw[->] (DEAD2.south) to [bend left=30] node {0.5} (C.south);

+    \draw[->] ([yshift=4ex]C.north) -- ([yshift=4.5ex]C.south);
 \end{tikzpicture}
+}
+\caption{Random walk with restarts.}
+\label{randomwalkRestart}
+\end{figure}
+
+
--- a/references.bib
+++ b/references.bib
@@ -19,6 +19,27 @@
  year={1979},
  publisher={IEEE}
 }
+@article{wiewiora2003potential,
+  title={Potential-based shaping and Q-value initialization are equivalent},
+  author={Wiewiora, Eric},
+  journal={Journal of Artificial Intelligence Research},
+  volume={19},
+  pages={205--208},
+  year={2003}
+}
+@inproceedings{devlin2012dynamic,
+  title={Dynamic potential-based reward shaping},
+  author={Devlin, Sam Michael and Kudenko, Daniel},
+  booktitle={11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2012)},
+  pages={433--440},
+  year={2012},
+  organization={IFAAMAS}
+}
+@misc{abdelkader20152048,
+  title={2048 is NP-Complete},
+  author={Abdelkader, Ahmed and Acharya, Aditya and Dasler, Philip},
+  year={2015},
+}
 @incollection{bangole2023game,
  title={Game Playing (2048) Using Deep Neural Networks},
  author={Bangole, Narendra Kumar Rao and Moulya, RB and Pranthi, R and Reddy, Sreelekha and Namratha, R},
@@ -50,9 +71,17 @@
  volume={14},
  number={3},
  pages={478--487},
-  year={2021},
+  year={2022},
  publisher={IEEE}
 }
+@inproceedings{rodgers2014an,
+  title={An investigation into 2048 AI strategies},
+  author={Rodgers, Philip  and  Levine, John},
+  booktitle={2014 IEEE Conference on Computational Intelligence and Games},
+  pages={1--2},
+  year={2014},
+  organization={IEEE}
+}
 @inproceedings{szubert2014temporal,
  title={Temporal difference learning of n-tuple networks for the game 2048},
  author={Szubert, Marcin and Ja{\'s}kowski, Wojciech},
@@ -131,7 +160,13 @@
  year={2021},
  publisher={Information Processing Society of Japan}
 }
-
+@book{Sutton2018book,
+  author = {Sutton, Richard S. and Barto, Andrew G.},
+  edition = {Second},
+  publisher = {The MIT Press},
+  title = {Reinforcement Learning: An Introduction},
+  year = {2018 }
+}