Commit cfdf78c5 by Lenovo

内容还是略短了些

parent 5c6485a5
......@@ -26,7 +26,9 @@
\usetikzlibrary{automata, positioning}
\usetikzlibrary{positioning}
\usetikzlibrary{decorations.markings}
\usepackage{cuted}
\usepackage{multicol}
% \usepackage{cuted}
% \usepackage{widetext}
\hyphenation{op-tical net-works semi-conduc-tor IEEE-Xplore}
% updated with editorial comments 8/9/2021
\newcommand{\highlight}[1]{\textcolor{red}{#1}}
......@@ -65,11 +67,20 @@ wangwenhao11@nudt.edu.cn).
\maketitle
\begin{abstract}
In reinforcement learning of 2048 game,
we are intrigued by the absence of successful cases
involving explicit exploration, e.g., $\epsilon-$greedy,
softmax.
Through experiments comparing 2048 game and maze,
we argue that explicit exploration strategies
cannot be effectively combined to learn in the 2048 game,
and demonstrate the acyclic nature of the 2048 game.
The successful experiences in 2048 game AI
will contribute to solving acyclic MDPs.
\end{abstract}
\begin{IEEEkeywords}
Acyclicity, 2048 game, ergodicity, backward learning.
\end{IEEEkeywords}
......
......@@ -44,8 +44,8 @@ Q_{\text{bo}}\dot{=}\begin{tiny}\left[ \begin{array}{cccccccccccc}
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0
\end{array}\right] \end{tiny}
\]
Then,
\begin{strip}
Then, $N_{\text{bo}}$ is as (\ref{nbo}).
\begin{figure*}
\begin{equation}
\begin{split}
N_{\text{bo}}=&(I_{12}-Q_{\text{bo}})^{-1}\\
......@@ -65,8 +65,9 @@ N_{\text{bo}}=&(I_{12}-Q_{\text{bo}})^{-1}\\
\end{array}\right]
\end{tiny}
\end{split}
\label{nbo}
\end{equation}
\end{strip}
\end{figure*}
Bases on Definition \ref{definition3},
Boyan chain
is acyclic between non-absorbing states.
......
\section{Discussions}
\cite{boyan1996learning}
Maze game is cyclic between non-absorbing states.
When an agent deviates from the optimal
path during exploration, it can quickly adjust back.
Therefore, in a maze, a policy
combined with an $\epsilon$-greedy exploration
remains $\epsilon$-greedy policy.
However, 2048 game is acyclic between non-absorbing states.
Any choice error caused by exploration will persist until the end of the game.
Therefore, in the 2048 game, a policy
combined with an $\epsilon$-greedy exploration
is no more an $\epsilon$-greedy policy.
This is why in AI training for the 2048 game,
explicit exploration strategies such as
$\epsilon$-greedy and soft-max do not work;
exploration can only be encouraged through optimistic initialization.
Early in 1996, for large acyclic domains,
Boyan and Moore proposed a
backward algorithm
ROUT with function approximations
to improve learning
\cite{boyan1996learning}.
In 2017, Matsuzaki point out that
2048 game has two important unique
characteristics compared with conventional
board games: (1) ``It has a long sequence of moves'';
(2) ``The difficulty increases toward the end of the game''
\cite{matsuzaki2017developing}.
Then, he applied backward learning and restart to improve
learning.
We declare that the acyclic nature of
game 2048 leads to the efficient performance of backward learning.
Finally, MDPs with acyclic structures can benefit from
the algorithmic insights that have led to the success of the 2048 AI.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment