\section{Introduction} \IEEEPARstart{G}{ame} 2048 is a popular single-player sliding block puzzle game, where the game is played on a 4$\times$4 grid, the player can move the tiles in four directions - up, down, left, and right, and the objective is to reach 2048 tile or higher tile. While the game is simple to understand, it requires strategic thinking and planning to reach the 2048 tile. Natural decision problems in 2048 is proved to be NP-Complete \cite{abdelkader20152048}. 2048 has gained widespread popularity due to its addictive gameplay and simple mechanics, making it a favorite among puzzle game enthusiasts. In 2014, Rodgers and Levine investigated search methods for 2048 AI strategies including mini-max search, expectimax search, Monte-Carlo tree search, and averaged depth-limited search \cite{rodgers2014an}. Szubert and Ja{\'s}kowski first employed temporal difference learning to train the 2048 AI, where afterstate values were approximated with N-tuple networks \cite{szubert2014temporal}. Wu et al. proposed multi-stage TD learning incorporating shallow expectimax search to improve the performance \cite{wu2014multi}. In 2016, a set of N-tuple network combinations that yield a better performance is selected by local greedy search \cite{oka2016systematic,matsuzaki2016systematic}. In 2017, Ja{\'s}kowski employed temporal coherence learning with multi-state weight promotion, redudant encoding and carousel shaping for \cite{jaskowski2017mastering}. Matsuzaki developped backward temporal coherence learning and restart for fast training \cite{matsuzaki2017developing}. In 2022, Guei et al. introduced optimistic initialization to encourage exploration for 2048, and improved the learning quality \cite{guei2021optimistic}. The optimistic initialization is equivalent to a static potential function in reward shaping \cite{wiewiora2003potential,devlin2012dynamic}. Besides, neural network approximators are developped for training 2048 AI \cite{kondo2019playing,matsuzaki2020further,matsuzaki2021developing,bangole2023game}. We observed a phenomenon in 2048 AI studies where no one has utilized explicit exploration strategies such as softmax or $\epsilon$-greedy to prevent reinforcement learning methods from getting stuck in local optima. Szubert and Ja{\'s}kowski said ``The exploration is not needed in this game, as the environment is inherently stochastic and thus provides sufficiently diversified experience'', and they found that experiments with $\epsilon$-greedy exploration did not improve the performance \cite{szubert2014temporal}. We argue that in 2048 AI training, it's not that exploaration is not needed, but rather that it cannot be explored with softmax or $\epsilon$-greedy strategies. % \begin{figure} % \centering % \includegraphics[width=2.5in][Maze]{pic/maze-eps-greedy} % \includegraphics[width=2.5in][2048 Game]{pic/2048epsilon-greedy} % \caption{Comparison of returns of $\epsilon$-greedy strageties.} % \label{fig1} % \end{figure} \begin{figure*}[!t] \centering \subfloat[2048 Game]{\includegraphics[width=3in]{pic/2048epsilon-greedy2}% \label{fig_second_case}} \hfil \subfloat[Maze]{\includegraphics[width=3in]{pic/maze-eps-greedy2}% \label{fig_first_case}} \caption{Comparison of returns of $\epsilon$-greedy strageties.} \label{fig_sim} \end{figure*} To validate the above point, we designed two sets of experiments, one with 2048 and the other with a maze. In the experiments, we used nearly optimal value functions combined with an $\epsilon$-greedy exploration strategy, testing the average score and standard deviation obtained for different values $\epsilon\in$\{0, 0.001, 0.002, 0.004, 0.008, 0.016, 0.032, 0.064, 0.128, 0.256, 0.512, 0.6, 0.7, 0.8, 0.9, 1.0\}. In the 2048 game, the value function is based on N-tuple network trained with optimistic initialization \cite{guei2021optimistic}, achieving an average score of {350,000}. In the maze game, the optimal value function is used, with the optimal policy achieving a score of {-54} points. As shown in Figure \ref{fig_sim}, the x-axis represents exploration parameter $\epsilon$, the y-axis represents the average score per episode, and the shaded area represents the standard deviation. We can find that in the 2048 game, the total score sharply decreases as $\epsilon$ increases, while in the maze game, the total score decreases gradually with increasing $\epsilon$. The comparison in this set of experiments indicates that the $\epsilon$-greedy exploration strategy in the maze game still results in the behavioral policy being an $\epsilon$-greedy policy, while in the 2048 game, the behavioral policy is no longer an $\epsilon$-greedy policy. This has sparked our curiosity: what are the fundamental differences between the 2048 game and maze games? In a maze game, when the agent deviates from the optimal path during exploration, it can immediately return to the optimal path, while in the 2048 game, when the agent deviates from the optimal state, it may never have the chance to return to the previous state. This relates to the game's property of ergodicity. In this paper, we proved that the game 2048 is acyclic between non-absorbing states.