内容还是略短了些

cfdf78c5 · Lenovo · 5c6485a5 · cfdf78c5 · cfdf78c5 · cfdf78c5
Commit cfdf78c5 authored Jun 01, 2024 by Lenovo
Hide whitespace changes
Inline Side-by-side

Showing with 58 additions and 15 deletions

document.tex
+14 -3

main/acyclic.tex
+4 -3

main/discussion.tex
+40 -9

No files found.
--- a/document.tex
+++ b/document.tex
@@ -26,7 +26,9 @@
 \usetikzlibrary{automata, positioning}
 \usetikzlibrary{positioning}
 \usetikzlibrary{decorations.markings}
-\usepackage{cuted}
+\usepackage{multicol}
+% \usepackage{cuted}
+% \usepackage{widetext}
 \hyphenation{op-tical net-works semi-conduc-tor IEEE-Xplore}
 % updated with editorial comments 8/9/2021
 \newcommand{\highlight}[1]{\textcolor{red}{#1}}
@@ -65,11 +67,20 @@ wangwenhao11@nudt.edu.cn).
 \maketitle

 \begin{abstract}
-
+In  reinforcement learning  of 2048 game, 
+we are intrigued by the absence of successful cases 
+involving explicit exploration, e.g., $\epsilon-$greedy, 
+softmax. 
+Through experiments comparing 2048 game and maze, 
+we argue that explicit exploration strategies
+ cannot be effectively combined to learn in the 2048 game,
+  and demonstrate the acyclic nature of the 2048 game. 
+  The successful experiences in 2048 game AI 
+  will contribute to solving acyclic MDPs.
 \end{abstract}

 \begin{IEEEkeywords}
-
+Acyclicity, 2048 game, ergodicity, backward learning.
 \end{IEEEkeywords}



--- a/main/acyclic.tex
+++ b/main/acyclic.tex
@@ -44,8 +44,8 @@ Q_{\text{bo}}\dot{=}\begin{tiny}\left[ \begin{array}{cccccccccccc}
 0 & 0 &  0 & 0 &  0 & 0 &  0 & 0 &  0 & 0 &  0 & 0 
 \end{array}\right] \end{tiny}
 \]
-Then,
-\begin{strip}
+Then, $N_{\text{bo}}$ is as (\ref{nbo}).
+\begin{figure*}
 \begin{equation}
 \begin{split}
 N_{\text{bo}}=&(I_{12}-Q_{\text{bo}})^{-1}\\
@@ -65,8 +65,9 @@ N_{\text{bo}}=&(I_{12}-Q_{\text{bo}})^{-1}\\
 \end{array}\right]
 \end{tiny}
 \end{split}
+\label{nbo}
 \end{equation}
-\end{strip}
+\end{figure*}
 Bases on Definition \ref{definition3}, 
 Boyan chain 
 is acyclic between non-absorbing states.

--- a/main/discussion.tex
+++ b/main/discussion.tex
 \section{Discussions}

-
-\cite{boyan1996learning}
-
-
-
-
-
-
-
+Maze game is cyclic between non-absorbing states.
+When an agent deviates from the optimal
+ path during exploration, it can quickly adjust back.
+Therefore, in a maze,  a policy 
+combined with an $\epsilon$-greedy exploration  
+remains $\epsilon$-greedy  policy.
+ 
+ However, 2048 game is acyclic between non-absorbing states.
+ Any choice error caused by exploration will persist until the end of the game.
+ Therefore, in the 2048 game,  a policy 
+combined with an $\epsilon$-greedy exploration  
+is no more an $\epsilon$-greedy  policy.
+ This is why in AI training for the 2048 game, 
+ explicit exploration strategies such as 
+$\epsilon$-greedy and soft-max  do not work; 
+exploration can only be encouraged through optimistic initialization. 
+
+
+Early in 1996, for large acyclic domains,
+ Boyan and Moore proposed a 
+ backward  algorithm
+ROUT  with function approximations
+to improve learning 
+\cite{boyan1996learning}.
+
+In 2017, Matsuzaki point out that
+2048 game has two important unique
+characteristics compared with conventional
+board games: (1) ``It has a long sequence of moves'';
+(2) ``The difficulty increases toward the end of the game''
+\cite{matsuzaki2017developing}.
+Then, he applied backward learning and restart to improve
+learning.
+
+We declare that the acyclic nature of
+game 2048 leads to the efficient performance of backward learning.
+
+
+Finally, MDPs with acyclic structures can benefit from 
+the algorithmic insights that have led to the success of the 2048 AI.