\section{Discussions}

The maze game is cyclic between non-absorbing states.
When an agent deviates from the optimal
 path during exploration, it can quickly adjust back.
Therefore, in a maze,  a policy 
combined with an $\epsilon$-greedy exploration  
remains an $\epsilon$-greedy  policy.
 
 However, the 2048 game is acyclic between non-absorbing states.
 Any choice error caused by exploration will persist until the end of the game.
 Therefore, in the 2048 game,  a policy 
combined with an $\epsilon$-greedy exploration  
is no more an $\epsilon$-greedy  policy.
 This is why in AI training for the 2048 game, 
 explicit exploration strategies such as 
$\epsilon$-greedy and soft-max  do not work; 
exploration can only be encouraged through optimistic initialization. 


Early in 1996, for large acyclic domains,
 Boyan and Moore proposed a 
 backward  algorithm
ROUT  with function approximations
to improve learning 
\cite{boyan1996learning}.

In 2017, Matsuzaki pointed out that
The 2048 game has two important unique
characteristics compared with conventional
board games: (1) ``It has a long sequence of moves'';
(2) ``The difficulty increases toward the end of the game''
\cite{matsuzaki2017developing}.
Then, he applied backward learning and restart to improve
learning.

We further declare that the acyclic nature of
game 2048 leads to the efficient performance of backward learning.


Finally, acyclic MDPs (e.g. games like
Connect-Four) and MDPs with acyclic structures can benefit from 
the algorithmic insights that have led to the success of the 2048 AI.