% \subsection{Testing Tasks} \begin{figure}[h] \centering \includegraphics[scale=0.2]{main/pic/maze_13_13.pdf} \caption{Maze.} \end{figure} \begin{figure*}[tb] \vskip 0.2in \begin{center} \subfigure[on-policy 2-state]{ \includegraphics[width=0.65\columnwidth, height=0.58\columnwidth]{main/pic/2-state-onpolicy.pdf} \label{2-state} } \subfigure[off-policy 2-state]{ \includegraphics[width=0.65\columnwidth, height=0.58\columnwidth]{main/pic/2-state-offpolicy.pdf} \label{7-state} } \subfigure[Maze]{ \includegraphics[width=0.65\columnwidth, height=0.58\columnwidth]{main/pic/maze.pdf} \label{MazeFull} }\\ \subfigure[Cliff Walking]{ \includegraphics[width=0.65\columnwidth, height=0.58\columnwidth]{main/pic/cl.pdf} \label{CliffWalkingFull} } \subfigure[Mountain Car]{ \includegraphics[width=0.65\columnwidth, height=0.58\columnwidth]{main/pic/mt.pdf} \label{MountainCarFull} } \subfigure[Acrobot]{ \includegraphics[width=0.65\columnwidth, height=0.58\columnwidth]{main/pic/acrobot.pdf} \label{AcrobotFull} } \caption{Learning curses of one evaluation environment and four contral environments.} \label{Complete_full} \end{center} \vskip -0.2in \end{figure*} \section{Experimental Studies} This section assesses algorithm performance through experiments, which are divided into policy evaluation experiments and control experiments. The evaluation experimental environments is the 2-state. In a 2-state environment, we conducted two types of experiments—on-policy and off-policy—to verify the relationship between the convergence speed of the algorithm and the smallest eigenvalue of the key matrix $\textbf{A}$. Control experiments, by allowing the algorithm to interact with the environment to optimize the policy, can evaluate its performance in learning the optimal policy. This provides a more comprehensive assessment of the algorithm's overall capabilities. The control experimental environments are Maze, CliffWalking-v0, MountainCar-v0, and Acrobot-v1. The control algorithms for TDC, ETD, VMTDC, and VMETD are named GQ, EQ, VMGQ, and VMEQ, respectively. For TD and VMTD control algorithms, there are two variants each: Sarsa and Q-learning for TD, and VMSarsa and VMQ for VMTD. % For specific experimental parameters, please refer to the appendix. % \textbf{Baird's off-policy counterexample:} This task is well known as a % counterexample, in which TD diverges \cite{baird1995residual,sutton2009fast}. As % shown in Figure \ref{bairdexample}, reward for each transition is zero. Thus the true values are zeros for all states and for any given policy. The behaviour policy % chooses actions represented by solid lines with a probability of $\frac{1}{7}$ % and actions represented by dotted lines with a probability of $\frac{6}{7}$. The % target policy is expected to choose the solid line with more probability than $\frac{1}{7}$, % and it chooses the solid line with probability of $1$ in this paper. % The discount factor $\gamma =0.99$, and the feature matrix is % defined in Appendix \ref{experimentaldetails} \cite{baird1995residual,sutton2009fast,maei2011gradient}. % \begin{figure} % \begin{center} % \input{main/pic/BairdExample.tex} % \caption{7-state.} % \label{bairdexample} % \end{center} % \end{figure} % The feature matrix of 7-state version of Baird's off-policy counterexample is % defined as follow: % \begin{equation*} % \Phi_{Counter}=\left[ % \begin{array}{cccccccc} % 1 & 2& 0& 0& 0& 0& 0& 0\\ % 1 & 0& 2& 0& 0& 0& 0& 0\\ % 1 & 0& 0& 2& 0& 0& 0& 0\\ % 1 & 0& 0& 0& 2& 0& 0& 0\\ % 1 & 0& 0& 0& 0& 2& 0& 0\\ % 1 & 0& 0& 0& 0& 0& 2& 0\\ % 2 & 0& 0& 0& 0& 0& 0& 1 % \end{array}\right] % \end{equation*} \subsection{Testing Tasks} % \begin{figure}[h] % \centering % \includegraphics[scale=0.2]{main/pic/maze_13_13.pdf} % \caption{Maze.} % \end{figure} \textbf{Maze}: The learning agent should find a shortest path from the upper left corner to the lower right corner. In each state, there are four alternative actions: $up$, $down$, $left$, and $right$, which takes the agent deterministically to the corresponding neighbour state, except when a movement is blocked by an obstacle or the edge of the maze. Rewards are $-1$ in all transitions until the agent reaches the goal state. The discount factor $\gamma=0.99$, and states $s$ are represented by tabular features.The maximum number of moves in the game is set to 1000. \textbf{The other three control environments}: Cliff Walking, Mountain Car, and Acrobot are selected from the gym official website and correspond to the following versions: ``CliffWalking-v0'', ``MountainCar-v0'' and ``Acrobot-v1''. For specific details, please refer to the gym official website. The maximum number of steps for the Mountain Car environment is set to 1000, while the default settings are used for the other two environments. In Mountain car and Acrobot, features are generated by tile coding. For all policy evaluation experiments, each experiment is independently run 100 times. For all control experiments, each experiment is independently run 50 times. For specific experimental parameters, please refer to the appendix. \subsection{Experimental Results and Analysis} Figure \ref{2-state} shows the learning curves for the on-policy 2-state policy evaluation experiment. In this setup, the convergence speed of TD, VMTD, TDC, and VMTDC decreases sequentially. Table \ref{tab:min_eigenvalues} indicates that the smallest eigenvalue of the key matrix for these four algorithms is greater than 0 and decreases sequentially, which is consistent with the experimental curves and table values. Figure B displays the learning curves for the off-policy 2-state policy evaluation experiment. In this setup, the convergence speed of ETD, VMETD, VMTD, VMTDC, and TDC decreases sequentially, while TD diverges. Table \ref{tab:min_eigenvalues} shows that the smallest eigenvalue of the key matrix for ETD, VMETD, VMTD, VMTDC, and TDC is greater than 0 and decreases sequentially, while the smallest eigenvalue for TD is less than 0. This is consistent with the experimental curves and table values. Remarkably, although VMTD is guaranteed to converge under on-policy conditions, it still converges in the off-policy 2-state scenario. The update formula of VMTD indicates that it is essentially an adjustment and correction of the TD update, with the introduction of the parameter $\omega$ making the variance of the gradient estimate more stable, thereby making the update of theta more stable. Figures \ref{MazeFull}, \ref{CliffWalkingFull}, \ref{MountainCarFull} and \ref{AcrobotFull} show the learning curves for four control experiments. A common feature observed across these experiments is that VMEQ outperforms EQ, VMGQ outperforms GQ, VMQ outperforms Q-learning, and VMSarsa outperforms Sarsa. For the Maze and Cliffwalking experiments, VMEQ demonstrated the best performance with the fastest convergence speed. In the Mountain Car and Acrobot experiments, the performance of the four VM algorithms was nearly identical and all outperformed the other algorithms. Overall, whether in policy evaluation experiments or control experiments, the VM algorithms have demonstrated superior performance, especially excelling in the control experiments.