diff --git a/document.tex b/document.tex index 53cf121..3889fb1 100644 --- a/document.tex +++ b/document.tex @@ -26,7 +26,9 @@ \usetikzlibrary{automata, positioning} \usetikzlibrary{positioning} \usetikzlibrary{decorations.markings} -\usepackage{cuted} +\usepackage{multicol} +% \usepackage{cuted} +% \usepackage{widetext} \hyphenation{op-tical net-works semi-conduc-tor IEEE-Xplore} % updated with editorial comments 8/9/2021 \newcommand{\highlight}[1]{\textcolor{red}{#1}} @@ -65,11 +67,20 @@ wangwenhao11@nudt.edu.cn). \maketitle \begin{abstract} - +In reinforcement learning of 2048 game, +we are intrigued by the absence of successful cases +involving explicit exploration, e.g., $\epsilon-$greedy, +softmax. +Through experiments comparing 2048 game and maze, +we argue that explicit exploration strategies + cannot be effectively combined to learn in the 2048 game, + and demonstrate the acyclic nature of the 2048 game. + The successful experiences in 2048 game AI + will contribute to solving acyclic MDPs. \end{abstract} \begin{IEEEkeywords} - +Acyclicity, 2048 game, ergodicity, backward learning. \end{IEEEkeywords} diff --git a/main/acyclic.tex b/main/acyclic.tex index 319617c..d81a21e 100644 --- a/main/acyclic.tex +++ b/main/acyclic.tex @@ -44,8 +44,8 @@ Q_{\text{bo}}\dot{=}\begin{tiny}\left[ \begin{array}{cccccccccccc} 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{array}\right] \end{tiny} \] -Then, -\begin{strip} +Then, $N_{\text{bo}}$ is as (\ref{nbo}). +\begin{figure*} \begin{equation} \begin{split} N_{\text{bo}}=&(I_{12}-Q_{\text{bo}})^{-1}\\ @@ -65,8 +65,9 @@ N_{\text{bo}}=&(I_{12}-Q_{\text{bo}})^{-1}\\ \end{array}\right] \end{tiny} \end{split} +\label{nbo} \end{equation} -\end{strip} +\end{figure*} Bases on Definition \ref{definition3}, Boyan chain is acyclic between non-absorbing states. diff --git a/main/discussion.tex b/main/discussion.tex index 68952af..4a6628d 100644 --- a/main/discussion.tex +++ b/main/discussion.tex @@ -1,14 +1,45 @@ \section{Discussions} - -\cite{boyan1996learning} - - - - - - - +Maze game is cyclic between non-absorbing states. +When an agent deviates from the optimal + path during exploration, it can quickly adjust back. +Therefore, in a maze, a policy +combined with an $\epsilon$-greedy exploration +remains $\epsilon$-greedy policy. + + However, 2048 game is acyclic between non-absorbing states. + Any choice error caused by exploration will persist until the end of the game. + Therefore, in the 2048 game, a policy +combined with an $\epsilon$-greedy exploration +is no more an $\epsilon$-greedy policy. + This is why in AI training for the 2048 game, + explicit exploration strategies such as +$\epsilon$-greedy and soft-max do not work; +exploration can only be encouraged through optimistic initialization. + + +Early in 1996, for large acyclic domains, + Boyan and Moore proposed a + backward algorithm +ROUT with function approximations +to improve learning +\cite{boyan1996learning}. + +In 2017, Matsuzaki point out that +2048 game has two important unique +characteristics compared with conventional +board games: (1) ``It has a long sequence of moves''; +(2) ``The difficulty increases toward the end of the game'' +\cite{matsuzaki2017developing}. +Then, he applied backward learning and restart to improve +learning. + +We declare that the acyclic nature of +game 2048 leads to the efficient performance of backward learning. + + +Finally, MDPs with acyclic structures can benefit from +the algorithmic insights that have led to the success of the 2048 AI.