\section{Introduction}
\IEEEPARstart{G}{ame} 2048 is a popular single-player sliding block puzzle game,
where the game is played on a 4$\times$4 grid,  the player 
can move the tiles in four directions - up, down, left, and right,
and the objective is to reach 2048 tile or higher tile.
While the game is simple to understand, it requires strategic
 thinking and planning to reach the 2048 tile.
 Natural decision problems in 2048 is proved to be NP-Complete
  \cite{abdelkader20152048}.
 2048 has gained widespread popularity due to its addictive
  gameplay and simple mechanics, making it a favorite 
  among puzzle game enthusiasts.

In 2014, Rodgers and Levine investigated search methods for
2048 AI strategies including
mini-max search, expectimax search, Monte-Carlo tree search,
 and averaged depth-limited search \cite{rodgers2014an}.
Szubert and Ja{\'s}kowski first employed  temporal difference learning
to train the 2048 AI, where  afterstate values were 
 approximated with N-tuple networks \cite{szubert2014temporal}.
Wu et al.  proposed multi-stage TD learning incorporating shallow 
expectimax search to improve the performance \cite{wu2014multi}.
In 2016, a set of N-tuple network combinations  that yield a better performance
is selected by local greedy search  
\cite{oka2016systematic,matsuzaki2016systematic}.
In 2017, Ja{\'s}kowski employed temporal coherence learning
 with multi-state weight promotion, redudant encoding
  and carousel shaping for \cite{jaskowski2017mastering}.
Matsuzaki developped backward temporal coherence learning 
and restart for fast training \cite{matsuzaki2017developing}.
In 2022, Guei et al. introduced optimistic initialization
to encourage exploration for 2048, and improved the learning 
quality \cite{guei2021optimistic}.
The optimistic initialization is equivalent to
a static potential function in reward shaping
 \cite{wiewiora2003potential,devlin2012dynamic}.
Besides, neural network approximators are developped
for training 2048 AI 
\cite{kondo2019playing,matsuzaki2020further,matsuzaki2021developing,bangole2023game}.

We observed a phenomenon in 2048 AI studies 
where no one has utilized explicit exploration strategies 
such as softmax or $\epsilon$-greedy to prevent reinforcement learning
 methods from getting stuck in local optima.
 Szubert and Ja{\'s}kowski said 
 ``The exploration is
not needed in this game, as the environment is inherently
stochastic and thus provides sufficiently diversified experience'',
and they found that experiments with $\epsilon$-greedy exploration
did not improve the performance \cite{szubert2014temporal}.

We argue that in 2048 AI training, it's not that
exploaration is not needed, but rather that it cannot be explored with
softmax or $\epsilon$-greedy strategies.

% \begin{figure}
% \centering
% \includegraphics[width=2.5in][Maze]{pic/maze-eps-greedy}
% \includegraphics[width=2.5in][2048 Game]{pic/2048epsilon-greedy}
% \caption{Comparison of returns of $\epsilon$-greedy strageties.}
% \label{fig1}
% \end{figure}

\begin{figure*}[!t]
\centering
\subfloat[2048 Game]{\includegraphics[width=3in]{pic/2048epsilon-greedy2}%
\label{fig_second_case}}
\hfil
\subfloat[Maze]{\includegraphics[width=3in]{pic/maze-eps-greedy2}%
\label{fig_first_case}}
\caption{Comparison of returns of $\epsilon$-greedy strageties.}
\label{fig_sim}
\end{figure*}

To validate the above point, we designed two sets of experiments,
 one with 2048 and the other with a maze. 
 In the experiments, we used nearly optimal value functions
  combined with an $\epsilon$-greedy exploration strategy, 
  testing the average score and standard deviation obtained 
  for different  values $\epsilon\in$\{0, 0.001, 0.002, 0.004, 
  0.008, 0.016, 0.032, 0.064, 0.128, 0.256, 0.512, 0.6, 0.7, 0.8, 0.9, 1.0\}. 
  In the 2048 game, the value function is based on N-tuple network 
  trained with optimistic initialization \cite{guei2021optimistic}, 
  achieving an average score of {350,000}. 
  In the maze game, the optimal value function is used, 
  with the optimal policy achieving a score of {-54} points.
  As shown in Figure \ref{fig_sim}, 
  the x-axis represents exploration parameter $\epsilon$, 
  the y-axis represents the average score per episode,
   and the shaded area represents the standard deviation.
  We can find that in the 2048 game, the total score
   sharply decreases as $\epsilon$ increases,
    while in the maze game, the total score 
    decreases gradually with increasing $\epsilon$.

The comparison in this set of experiments indicates that
 the $\epsilon$-greedy exploration strategy in the maze game still 
 results in the behavioral policy being an $\epsilon$-greedy policy, 
 while in the 2048 game, the behavioral policy is no 
 longer an $\epsilon$-greedy policy.
 
 This has sparked our curiosity: 
 what are the fundamental differences between the
  2048 game and maze games?
 In a maze game, when the agent deviates from
  the optimal path during exploration,
   it can immediately return to the optimal path,
    while in the 2048 game, when the agent deviates 
    from the optimal state, it may never have the 
    chance to return to the previous state. 
    This relates to the game's property of ergodicity.

In this paper, we proved that the game 2048 is acyclic
between non-absorbing states.