准备投稿了

00d7d8d4 · Lenovo · 2e608808 · 00d7d8d4 · 00d7d8d4 · 00d7d8d4
Commit 00d7d8d4 authored Jun 03, 2024 by Lenovo
Hide whitespace changes
Inline Side-by-side

Showing with 36 additions and 34 deletions

document.tex
+4 -3

main/2048isAcyclic.tex
+5 -5

main/acyclic.tex
+2 -2

main/background.tex
+8 -8

main/biography.tex
+4 -3

main/discussion.tex
+4 -4

main/introduction.tex
+9 -9

No files found.
--- a/document.tex
+++ b/document.tex
@@ -71,12 +71,13 @@ In  reinforcement learning  of 2048 game,
 we are intrigued by the absence of successful cases 
 involving explicit exploration, e.g., $\epsilon-$greedy, 
 softmax. 
-Through experiments comparing 2048 game and maze, 
+Through experiments comparing the 2048 game and maze, 
 we argue that explicit exploration strategies
 cannot be effectively combined to learn in the 2048 game,
  and demonstrate the acyclic nature of the 2048 game. 
-  The successful experiences in 2048 game AI 
-  will contribute to solving acyclic MDPs.
+  The successful experiences in the 2048 game AI 
+  will contribute to solving acyclic MDPs and
+  MDPs with acyclic structures.
 \end{abstract}

 \begin{IEEEkeywords}

--- a/main/2048isAcyclic.tex
+++ b/main/2048isAcyclic.tex
@@ -15,12 +15,12 @@ The 2048 game consists of a 4$\times$4 grid board, totaling 16 squares.
   a tile with the sum of the original numbers.
   Each tile can only participate in one merge operation per move.
   After each move, a new tile appears on a random empty square.
-   The new tile is 2 with  probability 0.1, and 4 with probability 0.9.
+   The new tile is 2 with probability 0.1, and 4 with probability 0.9.
 The game ends when all squares are filled, and no valid merge operations can be made. 


 \begin{theorem}
-2048 game is acyclic between non-absorbing states.
+The 2048 game is acyclic between non-absorbing states.
 \end{theorem}
 \begin{IEEEproof}
 To apply Theorem \ref{judgmentTheorem}, what we need 
@@ -64,7 +64,7 @@ u(B) = 2^{64}\cdot sum(B)+long(B).
 It is easy to verify that 
 $\forall B_1, B_2\in \mathcal{B}$,
 if $B_1\neq B_2$, then $u(B_1)\neq u(B_2)$.
-For all possible board,
+For all possible boards,
 $\forall B\in \mathcal{B}$, calculate the utility value
 $u(B) $, and sort $B$ by $u(B) $ in ascending order.
 Let $I(B)$ be the index of the board $B$ after sorting,
@@ -77,7 +77,7 @@ For all possible board,
 For any transition $\langle B_1, a, B_1', B_2\rangle$ in the 2048 game,
 we have 
 $sum(B_1)=sum(B_1')$  regardless of whether at least two tiles merge.
-Due to a new generated 2-tile or 4-tile in board $B_2$,
+Due to a newly generated 2-tile or 4-tile in board $B_2$,
 $sum(B_2)>sum(B_1')$, that is $sum(B_2)>sum(B_1)$.
 Figure \ref{2048merge1} and Figure \ref{2048merge2} show transition
 examples with and without a merge. 
@@ -103,7 +103,7 @@ examples with and without a merge.
 Based on (\ref{size}) and (\ref{utility}),
 we have  $u(B_2)>u(B_1)$.
 That means $I(B_2)>I(B_1)$.
-The transition probability between non-absorbing state satisifies (\ref{condition}),
+The transition probability between non-absorbing state satisfies (\ref{condition}),
 the claim follows by applying Theorem \ref{judgmentTheorem}.
 \end{IEEEproof}


--- a/main/acyclic.tex
+++ b/main/acyclic.tex
@@ -79,7 +79,7 @@ it is easy to provide a sufficient condition for acyclicity between non-absorbin
 \label{judgmentTheorem}
 Given a Markov chain with absorbing states, 
 suppose the size of the non-absorbing states $|S\setminus\{\text{T}\}|\geq 2$.
-If the transition matrix $Q$ between non-absorbing states satifies,
+If the transition matrix $Q$ between non-absorbing states satisfies,
 \begin{equation}
 \forall i,j \in S\setminus\{\text{T}\}, Q_{i,j}=\begin{cases}
 \geq 0, & \text{if } i\leq j; \\
@@ -96,7 +96,7 @@ Furthermore, the sum of two upper triangular matrices
 is still an upper triangular matrix.

 Based on Definition \ref{definitionN}, $N\dot{=} \sum_{i=0}^{\infty}Q^i$,
-the $N$ matrix is product and sum of upper triangular matrices.
+the $N$ matrix is the product and sum of upper triangular matrices.
 Then, the $N$ matrix is an upper triangular matrix.

 The claim now follows based on Definition \ref{definition3}.

--- a/main/background.tex
+++ b/main/background.tex
 \section{Background}
  
 \subsection{Ergodicity and Non-ergodicity of Markov Chains}
-Consider Markov decision process (MDP)
+Consider a Markov decision process (MDP)
 $\langle \mathcal{S}$, $\mathcal{A}$, $\mathcal{R}$, $\mathcal{T}$$\rangle$, where 
 $\mathcal{S}=\{1,2,3,\ldots\}$ is a finite state space, $|\mathcal{S}|=n$, $\mathcal{A}$ is an action space,
 $\mathcal{T}:\mathcal{S}\times \mathcal{A}\times \mathcal{S}\rightarrow [0,1]$ 
@@ -44,7 +44,7 @@ $d_{\pi}(s)>0$.



-This mean all states are reachable under any policy from the
+This me an all states are reachable under any policy from the
 current state after sufficiently many steps \cite{majeed2018q}.
 A sufficient condition for this assumption is that
 1 is a simple eigenvalue of the matrix $P_{\pi}$ and 
@@ -56,11 +56,11 @@ all other eigenvalues of $P_{\pi}$ are of modulus <1.
 \input{pic/randomWalk}

 Random walk, see Figure \ref{randomwalk},
-is a Markov chain, where agent starts from node C, and takes
+is a Markov chain, where the agent starts from node C and takes
 a probability of 0.5 to move left or right, until 
 reaching the leftmost or rightmost node where it terminates.
 The terminal states are usually called absorbing states.
-The transition probobility matrix
+The transition probability matrix
 of random walk with absorbing states 
 $P^{\text{ab}}$ is defined as follows:
 \[
@@ -86,7 +86,7 @@ the distribution $d^{\text{ab}}=\{1$,
 When encountering an absorbing state, we immediately reset and 
 transition to the initial states. Figure \ref{randomwalkRestart}
 is random walk with restarts.
- The transition probobility matrix
+ The transition probability matrix
 of random walk with restarts 
 $P^{\text{restart}}$ is defined as follows:
 \[
@@ -104,7 +104,7 @@ P^{\text{restart}}\dot{=}\begin{array}{c|ccccccc}
 According to (\ref{invariance}),
 the distribution $d^{\text{restart}}=\{0.1$,
 $0.1$, $0.2$, $0.3$, $0.2$, $0.1\}$.
- Since the probabilities of T, A, B, C, D, E are non-zeros,
+ Since the probabilities of T, A, B, C, D, and E are non-zeros,
 random walk with restarts is ergodic.

 \subsection{Ergodicity  between non-absorbing states}
@@ -133,7 +133,7 @@ where $Q$ is the matrix of transition probabilities between
  \label{definitionN}
  \end{equation}
  where $I_{n-1}$ is the $(n-1)\times(n-1)$ identity matrix.
-  $N$ is a transitive closure, and a reachability relation. 
+  $N$ is a transitive closure and a reachability relation. 
  From state $i$, it is possible to reach state $j$ in an 
  expected number of steps $N_{ij}$.
 $N_{ij}=0$ means that state $i$ is not reachable to state $j$.
@@ -198,7 +198,7 @@ N^{\text{ab}}=(I_5-Q^{\text{ab}})^{-1}=\begin{array}{c|ccccc}
 \end{array}
 \]

-Bases on Definition \ref{definition2}, 
+Based on Definition \ref{definition2}, 
 random walk with absorbing states
 is ergodic between non-absorbing states.

--- a/main/biography.tex
+++ b/main/biography.tex
@@ -35,11 +35,11 @@ Game AI.
 \vspace{-33pt}
 \begin{IEEEbiography}[{\includegraphics[width=1in,height=1.25in,clip,keepaspectratio]{pic/wangwh}}]{Wenhao Wang}
 is a lecturer in the College of
-Electronic Engineering at National University of
+Electronic Engineering at the National University of
 Defense Technology. He received the Ph.D degree
-in Military Science from National University of
+in Military Science from the National University of
 Defense Technology, Hunan, China, in 2023. 
 His interested research directions include network security,
-penetration testing, game theory and reinforcement
+penetration testing, game theory, and reinforcement
 learning.
 \end{IEEEbiography}
\ No newline at end of file
--- a/main/discussion.tex
+++ b/main/discussion.tex
 \section{Discussions}

-Maze game is cyclic between non-absorbing states.
+The maze game is cyclic between non-absorbing states.
 When an agent deviates from the optimal
 path during exploration, it can quickly adjust back.
 Therefore, in a maze,  a policy 
 combined with an $\epsilon$-greedy exploration  
 remains an $\epsilon$-greedy  policy.
 
- However, 2048 game is acyclic between non-absorbing states.
+ However, the 2048 game is acyclic between non-absorbing states.
 Any choice error caused by exploration will persist until the end of the game.
 Therefore, in the 2048 game,  a policy 
 combined with an $\epsilon$-greedy exploration  
@@ -25,8 +25,8 @@ ROUT  with function approximations
 to improve learning 
 \cite{boyan1996learning}.

-In 2017, Matsuzaki point out that
-2048 game has two important unique
+In 2017, Matsuzaki pointed out that
+The 2048 game has two important unique
 characteristics compared with conventional
 board games: (1) ``It has a long sequence of moves'';
 (2) ``The difficulty increases toward the end of the game''

--- a/main/introduction.tex
+++ b/main/introduction.tex
@@ -2,10 +2,10 @@
 \IEEEPARstart{G}{ame} 2048 is a popular single-player sliding block puzzle game,
 where the game is played on a 4$\times$4 grid,  the player 
 can move the tiles in four directions - up, down, left, and right,
-and the objective is to reach 2048 tile or higher tile.
+and the objective is to reach a 2048 tile or higher tile.
 While the game is simple to understand, it requires strategic
 thinking and planning to reach the 2048 tile.
- Natural decision problems in 2048 is proved to be NP-Complete
+ Natural decision problems in 2048 are proved to be NP-Complete
  \cite{abdelkader20152048}.
 2048 has gained widespread popularity due to its addictive
  gameplay and simple mechanics, making it a favorite 
@@ -20,21 +20,21 @@ to train the 2048 AI, where  afterstate values were
 approximated with N-tuple networks \cite{szubert2014temporal}.
 Wu et al.  proposed multi-stage TD learning incorporating shallow 
 expectimax search to improve the performance \cite{wu2014multi}.
-In 2016, a set of N-tuple network combinations  that yield a better performance
-is selected by local greedy search  
+In 2016, a set of N-tuple network combinations  that yielded a better performance
+was selected by a local greedy search  
 \cite{oka2016systematic,matsuzaki2016systematic}.
 In 2017, Ja{\'s}kowski employed temporal coherence learning
- with multi-state weight promotion, redudant encoding
+ with multi-state weight promotion, redundant encoding
  and carousel shaping for \cite{jaskowski2017mastering}.
-Matsuzaki developped backward temporal coherence learning 
+Matsuzaki developed backward temporal coherence learning 
 and restart for fast training \cite{matsuzaki2017developing}.
 In 2022, Guei et al. introduced optimistic initialization
-to encourage exploration for 2048, and improved the learning 
+to encourage exploration for 2048, and improve the learning 
 quality \cite{guei2021optimistic}.
 The optimistic initialization is equivalent to
 a static potential function in reward shaping
 \cite{wiewiora2003potential,devlin2012dynamic}.
-Besides, neural network approximators are developped
+Besides, neural network approximators are developed
 for training 2048 AI 
 \cite{kondo2019playing,matsuzaki2020further,matsuzaki2021developing,bangole2023game}.

@@ -50,7 +50,7 @@ and they found that experiments with $\epsilon$-greedy exploration
 did not improve the performance \cite{szubert2014temporal}.

 We argue that in 2048 AI training, it's not that
-exploaration is not needed, but rather that it cannot be explored with
+exploration is not needed, but rather that it cannot be explored with
 softmax or $\epsilon$-greedy strategies.

 % \begin{figure}