Commit 38145d79 by Lenovo

先写到这里

parent 033f0223
\documentclass[lettersize,journal]{IEEEtran}
\usepackage{amsmath,amsfonts}
\usepackage{nicematrix}
\usepackage{algorithmic}
\usepackage{algorithm}
\usepackage{array}
......@@ -9,6 +10,7 @@
\usepackage{url}
\usepackage{verbatim}
\usepackage{graphicx}
%\usepackage{natbib}
\newtheorem{theorem}{Theorem}
\newtheorem{proposition}[theorem]{Proposition}
......@@ -26,7 +28,7 @@
\usetikzlibrary{decorations.markings}
\hyphenation{op-tical net-works semi-conduc-tor IEEE-Xplore}
% updated with editorial comments 8/9/2021
\newcommand{\highlight}[1]{\textcolor{red}{#1}}
\begin{document}
\title{Non-ergodicity of Game 2048}
......@@ -55,7 +57,7 @@ wangwenhao11@nudt.edu.cn).
\markboth{IEEE Transaction on Games,~Vol.~14, No.~8, August~202X}%
{Shell \MakeLowercase{\textit{et al.}}: A Sample Article Using IEEEtran.cls for IEEE Journals}
\IEEEpubid{0000--0000/00\$00.00~\copyright~2024 IEEE}
%\IEEEpubid{0000--0000/00\$00.00~\copyright~2024 IEEE}
% Remember, if you use this you must call \IEEEpubidadjcol in the second
% column for its text to clear the IEEEpubid mark.
......@@ -69,12 +71,14 @@ wangwenhao11@nudt.edu.cn).
\end{IEEEkeywords}
\input{main/background}
\input{main/introduction}
\input{main/nonergodicity}
\input{main/paradox}
\input{main/theorem}
\input{main/2048prove}
\input{main/background}
%\input{main/nonergodicity}
%\input{main/paradox}
%\input{main/theorem}
%\input{main/2048prove}
......
......@@ -21,7 +21,7 @@ p=2^{64} \cdot \sum_{m=0}^{15} I(B_m \neq 0) \cdot 2^{B_m} + \sum_{m=0}^{15} (1
本文将这个结果放在比64bit更高的位置上,也就是 64-84bit的位置。这个编码的主要含义是,将局面所有数字之和放在高bit位置上,排序时局面之和大的排在后面,
状态转移时就是从小的下标转移到大的下标上。另外后面64bit就是局面的编码,来保证这个值的唯一性,一个局面会对应一个唯一的值。
\input{../pic/2048encode}
\input{pic/2048encode}
上面的图中的这个局面的编码$p=(164)30784+0x FEDC 5432 0000 0020$
本文是按照从下往上,从右往左的顺序给格子进行排列,右下角的格子是最低位,左上角的格子是最高位。
......@@ -39,7 +39,7 @@ p=2^{64} \cdot \sum_{m=0}^{15} I(B_m \neq 0) \cdot 2^{B_m} + \sum_{m=0}^{15} (1
根据游戏规则,两个相同幂次的方块碰撞会合并成为一个幂次加一的方块,
然后会在一个空格位置随机生成一个2或者4的方块,这一过程本文记为$S_i\to S_(i^')\to S_j$
\input{../pic/2048example-p}
\input{pic/2048example-p}
如图3.5所示根据我们的规则可以保证,状态在后的排序也靠后。也就是说在$S_i\to S_j$的过程中,能够保证$p_i<p_j$,
也就保证了非终结状态转移矩阵中状态转移总是小下标向大下标位置上转移的。在$S_i\to S_{i^{,}}$的过程中,局面和不发生变化。
......
......@@ -16,20 +16,91 @@ $V^{\pi}(s)=\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}r_t|s_0=s\right]$.
Given a steady policy $\pi$, MDP becomes a Markov chain on state space
$\mathcal{S}$ with a matrix
$P^{\pi}\in[0,1]^{n\times n}$, where
$P^{\pi}(s_1,s_2)=\sum_{a\in \mathcal{A}}\pi(a|s_1)\mathcal{T}(s_1,a,s_2)$
$P_{\pi}\in[0,1]^{n\times n}$, where
$P_{\pi}(s_1,s_2)=\sum_{a\in \mathcal{A}}\pi(a|s_1)\mathcal{T}(s_1,a,s_2)$
is the transition probobility from $s_1$ to $s_2$,
$\forall s\in \mathcal{S}$, $\sum_{s'\in \mathcal{S}}P^{\pi}(s,s')=1$.
A stationary measure for $P$ is a distribution measure
$d$ on $\mathcal{S}$ such that
$\forall s\in \mathcal{S}$, $\sum_{s'\in \mathcal{S}}P_{\pi}(s,s')=1$.
A stationary measure for $P_{\pi}$ is a distribution measure
$d_{\pi}$ on $\mathcal{S}$ such that
\begin{equation}
d^{\top}=d^{\top}P^{\pi}.
d_{\pi}=P_{\pi}^{\top}d_{\pi}.
\label{invariance}
\end{equation}
That is $\forall s\in \mathcal{S}$, we have
\begin{equation}
\sum_{s'\in \mathcal{S}}P^{\pi}(s',s)d(s')=d(s).
\sum_{s'\in \mathcal{S}}P_{\pi}(s',s)d_{\pi}(s')=d_{\pi}(s).
\end{equation}
\begin{definition}[Ergodicity]
Ergodicity assumption about the MDP assume that
$d_{\pi}(s)$ exist for any policy $\pi$ and are independent of
initial states \cite{Sutton2018book}.
\end{definition}
This mean all states are reachable under any policy from the
current state after sufficiently many steps \cite{majeed2018q}.
A sufficient condition for this assumption is that
1 is a simple eigenvalue of the matrix $P_{\pi}$ and
all other eigenvalues of $P_{\pi}$ are of modulus <1.
\subsection{Ergodicity and Non-ergodicity of Markov Chains}
\input{pic/randomWalk}
Random walk, see Figure \ref{randomwalk},
is a Markov chain, where agent starts from node C, and takes
a probability of 0.5 to move left or right, until
reaching the leftmost or rightmost node where it terminates.
The terminal states are usually called absorbing states.
The transition probobility matrix
of random walk with absorbing states
$P_{\text{absorbing}}$ is defined as follows:
\[
P_{\text{absorbing}}\dot{=}\begin{array}{c|ccccccc}
&\text{T}_1 & \text{A} & \text{B} & \text{C} & \text{D} & \text{E} & \text{T}_2 \\\hline
\text{T}_1 & 1 & 0 & 0 & 0 & 0 & 0& 0 \\
\text{A} & \frac{1}{2} & 0 & \frac{1}{2} & 0 & 0 & 0 & 0\\
\text{B} & 0 & \frac{1}{2} & 0 & \frac{1}{2} & 0 & 0 & 0\\
\text{C} & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2} & 0 & 0\\
\text{D} & 0 & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2} & 0 \\
\text{E} & 0 & 0 & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2} \\
\text{T}_2 & 0 & 0 & 0 & 0 & 0 & 0 & 1
\end{array}
\]
According to (\ref{invariance}),
the distribution $d_{\text{absorbing}}=\{\frac{1}{2}$,
$0$, $0$, $0$, $0$, $0$, $\frac{1}{2}\}$.
Since the probability of A, B, C, D, E are all zeros,
random walk with absorbing states are non-ergodic.
\input{pic/randomWalkRestart}
However, in reinforcement learning, we always assume the ergodicity assumption.
When encountering an absorbing state, we immediately reset and
transit to the initial states. Figure \ref{randomwalkRestart}
is random walk with restarts.
The transition probobility matrix
of random walk with restarts
$P_{\text{restart}}$ is defined as follows:
\[
P_{\text{restart}}\dot{=}\begin{array}{c|ccccccc}
&\text{T}_1 & \text{A} & \text{B} & \text{C} & \text{D} & \text{E} & \text{T}_2 \\\hline
\text{T}_1 & 0 & 0 & 0 & 1 & 0 & 0& 0 \\
\text{A} & \frac{1}{2} & 0 & \frac{1}{2} & 0 & 0 & 0 & 0\\
\text{B} & 0 & \frac{1}{2} & 0 & \frac{1}{2} & 0 & 0 & 0\\
\text{C} & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2} & 0 & 0\\
\text{D} & 0 & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2} & 0 \\
\text{E} & 0 & 0 & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2} \\
\text{T}_2 & 0 & 0 & 0 & 1 & 0 & 0 & 0
\end{array}
\]
According to (\ref{invariance}),
the distribution $d_{\text{restart}}=\{0.05$,
$0.1$, $0.2$, $0.3$, $0.2$, $0.1$, $0.05\}$.
Since the probability of T, A, B, C, D, E are non-zeros,
random walk with restarts are ergodic.
给出Markov Chain的遍历性定义,和充分条件。
根据随机游走例子说明 带有Absorbing state的是不满足遍历性的,
带有重启的强化学习训练设定是满足遍历性的。
......
......@@ -5,42 +5,111 @@ can move the tiles in four directions - up, down, left, and right,
and the objective is to reach 2048 tile or higher tile.
While the game is simple to understand, it requires strategic
thinking and planning to reach the 2048 tile.
Natural decision problems in 2048 is proved to be NP-Complete
\cite{abdelkader20152048}.
2048 has gained widespread popularity due to its addictive
gameplay and simple mechanics, making it a favorite
among puzzle game enthusiasts.
\cite{szubert2014temporal}
\cite{wu2014multi}
\cite{oka2016systematic}
\cite{matsuzaki2016systematic}
\cite{yeh2016multistage}
\cite{jaskowski2017mastering}
\cite{matsuzaki2017developing}
\cite{kondo2019playing}
\cite{matsuzaki2020further}
\cite{matsuzaki2021developing}
\cite{guei2021optimistic}
\cite{bangole2023game}
\includegraphics{pic/2048epsilon-greedy}
\includegraphics{pic/maze-eps-greedy}
In 2014, Rodgers and Levine investigated search methods for
2048 AI strategies including
mini-max search, expectimax search, Monte-Carlo tree search,
and averaged depth-limited search \cite{rodgers2014an}.
Szubert and Ja{\'s}kowski first employed temporal difference learning
to train the 2048 AI, where afterstate values were
approximated with N-tuple networks \cite{szubert2014temporal}.
Wu et al. proposed multi-stage TD learning incorporating shallow
expectimax search to improve the performance \cite{wu2014multi}.
In 2016, a set of N-tuple network combinations that yield a better performance
is selected by local greedy search
\cite{oka2016systematic,matsuzaki2016systematic}.
In 2017, Ja{\'s}kowski employed temporal coherence learning
with multi-state weight promotion, redudant encoding
and carousel shaping for \cite{jaskowski2017mastering}.
Matsuzaki developped backward temporal coherence learning
and restart for fast training \cite{matsuzaki2017developing}.
In 2022, Guei et al. introduced optimistic initialization
to encourage exploration for 2048, and improved the learning
quality \cite{guei2021optimistic}.
The optimistic initialization is equivalent to
a static potential function in reward shaping
\cite{wiewiora2003potential,devlin2012dynamic}.
Besides, neural network approximators are developped
for training 2048 AI
\cite{kondo2019playing,matsuzaki2020further,matsuzaki2021developing,bangole2023game}.
We observed a phenomenon in 2048 AI studies
where no one has utilized explicit exploration strategies
such as softmax or $\epsilon$-greedy to prevent reinforcement learning
methods from getting stuck in local optima.
Szubert and Ja{\'s}kowski said
``The exploration is
not needed in this game, as the environment is inherently
stochastic and thus provides sufficiently diversified experience'',
and they found that experiments with $\epsilon$-greedy exploration
did not improve the performance \cite{szubert2014temporal}.
We argue that in 2048 AI training, it's not that
exploaration is not needed, but rather that it cannot be explored with
softmax or $\epsilon$-greedy strategies.
% \begin{figure}
% \centering
% \includegraphics[width=2.5in][Maze]{pic/maze-eps-greedy}
% \includegraphics[width=2.5in][2048 Game]{pic/2048epsilon-greedy}
% \caption{Comparison of returns of $\epsilon$-greedy strageties.}
% \label{fig1}
% \end{figure}
\begin{figure*}[!t]
\centering
\subfloat[2048 Game]{\includegraphics[width=3in]{pic/2048epsilon-greedy}%
\label{fig_second_case}}
\hfil
\subfloat[Maze]{\includegraphics[width=3in]{pic/maze-eps-greedy}%
\label{fig_first_case}}
\caption{Comparison of returns of $\epsilon$-greedy strageties.}
\label{fig_sim}
\end{figure*}
To validate the above point, we designed two sets of experiments,
one with 2048 and the other with a maze.
In the experiments, we used nearly optimal value functions
combined with an $\epsilon$-greedy exploration strategy,
testing the average score and standard deviation obtained
for different values $\epsilon\in$\{0, 0.001, 0.002, 0.004,
0.008, 0.016, 0.032, 0.064, 0.128, 0.256, 0.512\}.
In the 2048 game, the value function is based on N-tuple network
trained with optimistic initialization \cite{guei2021optimistic},
achieving an average score of \highlight{300,000}.
In the maze game, the optimal value function is used,
with the optimal policy achieving a score of \highlight{-58} points.
As shown in Figure \ref{fig_sim},
the x-axis represents $\epsilon$,
the y-axis represents the average score per game,
and the shaded area represents the standard deviation.
We can find that in the 2048 game, the total score
sharply decreases as $\epsilon$ increases,
while in the maze game, the total score
decreases gradually with increasing $\epsilon$.
The comparison in this set of experiments indicates that
the $\epsilon$-greedy exploration strategy in the maze game still
results in the behavioral policy being an $\epsilon$-greedy policy,
while in the 2048 game, the behavioral policy is no
longer an $\epsilon$-greedy policy.
This has sparked our curiosity:
what are the fundamental differences between the
2048 game and maze games?
In a maze game, when the agent deviates from
the optimal path during exploration,
it can immediately return to the optimal path,
while in the 2048 game, when the agent deviates
from the optimal state, it may never have the
chance to return to the previous state.
This relates to the game's property of traversability.
In this paper, we proved that the game 2048 is non-ergodic.
\begin{figure}[!t]
\centering
\scalebox{0.9}{
\begin{tikzpicture}
\node[draw, rectangle, fill=gray!50] (DEAD) at (-2,0) ;
\node[draw, rectangle, fill=gray!50] (DEAD2) at (10,0) ;
\node[draw, circle] (A) at (0,0) {A};
\node[draw, circle] (B) at (2,0) {B};
\node[draw, circle] (C) at (4,0) {C};
\node[draw, rectangle, fill=gray!50] (DEAD) at (0,0) {T$_1$};
\node[draw, rectangle, fill=gray!50] (DEAD2) at (9,0) {T$_2$};
\node[draw, circle] (A) at (1.5,0) {A};
\node[draw, circle] (B) at (3,0) {B};
\node[draw, circle] (C) at (4.5,0) {C};
\node[draw, circle] (D) at (6,0) {D};
\node[draw, circle] (E) at (8,0) {E};
\node[draw, circle] (E) at (7.5,0) {E};
\draw[->] (A) -- (DEAD);
\draw[->] (B) -- (A);
\draw[->] (B) to [bend left=30] (C);
\draw[->] (C) to [bend left=30] (B);
\draw[->] (C) to [bend left=30] (D);
\draw[->] (D) to [bend left=30] (C);
\draw[->] (D) -- (E);
\draw[->] (E) -- (DEAD2);
\draw[->] (A) -- node {0.5} (DEAD);
\draw[->] (A) to [bend left=30] node {0.5} (B);
\draw[->] (B) to [bend left=30] node {0.5} (A);
\draw[->] (B) to [bend left=30] node {0.5} (C);
\draw[->] (C) to [bend left=30] node {0.5} (B);
\draw[->] (C) to [bend left=30] node {0.5} (D);
\draw[->] (D) to [bend left=30] node {0.5} (C);
\draw[->] (D) to [bend left=30] node {0.5} (E);
\draw[->] (E) to [bend left=30] node {0.5} (D);
\draw[->] (E) -- node {0.5} (DEAD2);
\draw[->] ([yshift=4ex]C.north) -- ([yshift=4.5ex]C.south);
\end{tikzpicture}
}
\caption{Random walk with absorbing states.}
\label{randomwalk}
\end{figure}
\begin{figure}[!t]
\centering
\scalebox{0.9}{
\begin{tikzpicture}
\node[draw, circle] (A) at (0,0) {A};
\node[draw, circle] (B) at (2,0) {B};
\node[draw, circle] (C) at (4,0) {C};
\node[draw, rectangle, fill=gray!50] (DEAD) at (0,0) {T$_1$};
\node[draw, rectangle, fill=gray!50] (DEAD2) at (9,0) {T$_2$};
\node[draw, circle] (A) at (1.5,0) {A};
\node[draw, circle] (B) at (3,0) {B};
\node[draw, circle] (C) at (4.5,0) {C};
\node[draw, circle] (D) at (6,0) {D};
\node[draw, circle] (E) at (8,0) {E};
\node[draw, circle] (E) at (7.5,0) {E};
\draw[->] (A.north) to [bend left=30] (C.north)
\draw[->] (B) -- (A);
\draw[->] (B) to [bend left=30] (C);
\draw[->] (C) to [bend left=30] (B);
\draw[->] (C) to [bend left=30] (D);
\draw[->] (D) to [bend left=30] (C);
\draw[->] (D) -- (E);
\draw[->] (E.south) to [bend left=30] (C.south)
\draw[->] (DEAD.south) to [bend right=30] node {0.5} (C.south);
\draw[->] (A) -- node {0.5} (DEAD);
\draw[->] (A) to [bend left=30] node {0.5} (B);
\draw[->] (B) to [bend left=30] node {0.5} (A);
\draw[->] (B) to [bend left=30] node {0.5} (C);
\draw[->] (C) to [bend left=30] node {0.5} (B);
\draw[->] (C) to [bend left=30] node {0.5} (D);
\draw[->] (D) to [bend left=30] node {0.5} (C);
\draw[->] (D) to [bend left=30] node {0.5} (E);
\draw[->] (E) to [bend left=30] node {0.5} (D);
\draw[->] (E) -- node {0.5} (DEAD2);
\draw[->] (DEAD2.south) to [bend left=30] node {0.5} (C.south);
\draw[->] ([yshift=4ex]C.north) -- ([yshift=4.5ex]C.south);
\end{tikzpicture}
}
\caption{Random walk with restarts.}
\label{randomwalkRestart}
\end{figure}
......@@ -19,6 +19,27 @@
year={1979},
publisher={IEEE}
}
@article{wiewiora2003potential,
title={Potential-based shaping and Q-value initialization are equivalent},
author={Wiewiora, Eric},
journal={Journal of Artificial Intelligence Research},
volume={19},
pages={205--208},
year={2003}
}
@inproceedings{devlin2012dynamic,
title={Dynamic potential-based reward shaping},
author={Devlin, Sam Michael and Kudenko, Daniel},
booktitle={11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2012)},
pages={433--440},
year={2012},
organization={IFAAMAS}
}
@misc{abdelkader20152048,
title={2048 is NP-Complete},
author={Abdelkader, Ahmed and Acharya, Aditya and Dasler, Philip},
year={2015},
}
@incollection{bangole2023game,
title={Game Playing (2048) Using Deep Neural Networks},
author={Bangole, Narendra Kumar Rao and Moulya, RB and Pranthi, R and Reddy, Sreelekha and Namratha, R},
......@@ -50,9 +71,17 @@
volume={14},
number={3},
pages={478--487},
year={2021},
year={2022},
publisher={IEEE}
}
@inproceedings{rodgers2014an,
title={An investigation into 2048 AI strategies},
author={Rodgers, Philip and Levine, John},
booktitle={2014 IEEE Conference on Computational Intelligence and Games},
pages={1--2},
year={2014},
organization={IEEE}
}
@inproceedings{szubert2014temporal,
title={Temporal difference learning of n-tuple networks for the game 2048},
author={Szubert, Marcin and Ja{\'s}kowski, Wojciech},
......@@ -131,7 +160,13 @@
year={2021},
publisher={Information Processing Society of Japan}
}
@book{Sutton2018book,
author = {Sutton, Richard S. and Barto, Andrew G.},
edition = {Second},
publisher = {The MIT Press},
title = {Reinforcement Learning: An Introduction},
year = {2018 }
}
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment