\section{Experimental Studies} This section assesses algorithm performance through experiments, which are divided into policy evaluation experiments and control experiments. \subsection{Testing Tasks} \textbf{Random-walk:} as shown in Figure \ref{randomwalk}, all episodes start in the center state, $C$, and proceed to left or right by one state on each step, equiprobably. Episodes terminate either on the extreme left or the extreme right, and get a reward of $+1$ if terminate on the right, or $0$ in the other case. In this task, the true value for each state is the probability of starting from that state and terminating on the right \cite{Sutton2018book}. Thus, the true values of states from $A$ to $E$ are $\frac{1}{6},\frac{2}{6},\frac{3}{6},\frac{4}{6},\frac{5}{6}$, respectively. The discount factor $\gamma=1.0$. There are three standard kinds of features for random-walk problems: tabular feature, inverted feature and dependent feature \cite{sutton2009fast}. The feature matrices corresponding to three random walks are shown in Appendix. Conduct experiments using an on-policy approach in the Random-walk environment. \begin{figure} \begin{center} \input{main/pic/randomwalk.tex} \caption{Random walk.} \label{randomwalk} \end{center} \end{figure} \begin{figure} \begin{center} \input{main/pic/BairdExample.tex} \caption{7-state version of Baird's off-policy counterexample.} \label{bairdexample} \end{center} \end{figure} \textbf{Baird's off-policy counterexample:} This task is well known as a counterexample, in which TD diverges \cite{baird1995residual,sutton2009fast}. As shown in Figure \ref{bairdexample}, reward for each transition is zero. Thus the true values are zeros for all states and for any given policy. The behaviour policy chooses actions represented by solid lines with a probability of $\frac{1}{7}$ and actions represented by dotted lines with a probability of $\frac{6}{7}$. The target policy is expected to choose the solid line with more probability than $\frac{1}{7}$, and it chooses the solid line with probability of $1$ in this paper. The discount factor $\gamma =0.99$, and the feature matrix is % defined in Appendix \ref{experimentaldetails} \cite{baird1995residual,sutton2009fast,maei2011gradient}. defined in Appendix. \textbf{Maze}: The learning agent should find a shortest path from the upper left corner to the lower right corner. In each state, there are four alternative actions: $up$, $down$, $left$, and $right$, which takes the agent deterministically to the corresponding neighbour state, except when % \begin{wrapfigure}{r}{3cm} % \centering % \includegraphics[scale=0.15]{main/pic/maze_13_13.pdf} % % \caption{The 2-state counterexample.} % \end{wrapfigure} a movement is blocked by an obstacle or the edge of the maze. Rewards are $-1$ in all transitions until the agent reaches the goal state. The discount factor $\gamma=0.99$, and states $s$ are represented by tabular features.The maximum number of moves in the game is set to 1000. \begin{figure} \centering \includegraphics[scale=0.20]{main/pic/maze_13_13.pdf} \caption{Maze.} \end{figure} \textbf{The other three control environments}: Cliff Walking, Mountain Car, and Acrobot are selected from the gym official website and correspond to the following versions: ``CliffWalking-v0'', ``MountainCar-v0'' and ``Acrobot-v1''. For specific details, please refer to the gym official website. The maximum number of steps for the Mountain Car environment is set to 1000, while the default settings are used for the other two environments. In Mountain car and Acrobot, features are generated by tile coding. Please, refer to the Appendix for the selection of learning rates for all experiments. \subsection{Experimental Results and Analysis} % \begin{figure}[htb] % \vskip 0.2in % \begin{center} % \subfigure[Dependent]{ % \includegraphics[width=0.4\columnwidth, height=0.3\columnwidth]{main/pic/dependent_new.pdf} % \label{DependentFull} % } % \subfigure[Tabular]{ % \includegraphics[width=0.4\columnwidth, height=0.3\columnwidth]{main/pic/tabular_new.pdf} % \label{TabularFull} % } % \\ % \subfigure[Inverted]{ % \includegraphics[width=0.4\columnwidth, height=0.3\columnwidth]{main/pic/inverted_new.pdf} % \label{InvertedFull} % } % \subfigure[counterexample]{ % \includegraphics[width=0.4\columnwidth, height=0.3\columnwidth]{main/pic/counterexample_quanju_new.pdf} % \label{CounterExampleFull} % } % \caption{Learning curses of four evaluation environments.} % \label{Evaluation_full} % \end{center} % \vskip -0.2in % \end{figure} % \begin{figure*}[htb] % \vskip 0.2in % \begin{center} % \subfigure[Maze]{ % \includegraphics[width=0.55\columnwidth, height=0.4\columnwidth]{main/pic/maze_complete.pdf} % \label{MazeFull} % } % \subfigure[Cliff Walking]{ % \includegraphics[width=0.55\columnwidth, height=0.4\columnwidth]{main/pic/cw_complete.pdf} % \label{CliffWalkingFull} % } % \\ % \subfigure[Mountain Car]{ % \includegraphics[width=0.55\columnwidth, height=0.4\columnwidth]{main/pic/mt_complete.pdf} % \label{MountainCarFull} % } % \subfigure[Acrobot]{ % \includegraphics[width=0.55\columnwidth, height=0.4\columnwidth]{main/pic/Acrobot_complete.pdf} % \label{AcrobotFull} % } % \caption{Learning curses of four contral environments.} % \label{Complete_full} % \end{center} % \vskip -0.2in % \end{figure*} % \begin{table*}[htb] % \centering % \caption{Difference between R-learning and tabular VMQ.} % \vskip 0.15in % \begin{tabular}{c|cc} % \hline % algorithms&update formula \\ % \hline % R-learning&$Q_{k+1}(s,a)\leftarrow Q_{k}(s,a)+\alpha_k(r_{k+1}-m_{k}+ \max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a))$\\ % &$m_{k+1}\leftarrow m_{k}+\beta_k(r_{k+1}+\max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a)-m_{k})$\\ % tabular VMQ&$Q_{k+1}(s,a)\leftarrow Q_{k}(s,a)+\alpha_k(r_{k+1}+\gamma \max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a)-\omega_k)$\\ % &$\omega_{k+1}\leftarrow \omega_{k}+\beta_k(r_{k+1}+\gamma \max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a)-\omega_{k})$\\ % \hline % \end{tabular} % \label{differenceRandVMQ} % \vskip -0.1in % \end{table*} The experiment needs further elaboration. % For policy evaluation experiments, compare the performance of the VMTD, % VMTDC, TD, and TDC algorithms. % The vertical axis is unified as RVBE. % For policy evaluation experiments, the criteria for evaluating % algorithms vary. The objective function minimized by our proposed % new algorithm differs from that of other algorithms. Therefore, to % ensure fairness in comparisons, this study only contrasts algorithm % experiments in controlled settings. % This study will compare the performance of Sarsa, Q-learning, GQ(0), % AC, VMSarsa, VMQ, and VMGQ(0) in four control environments. % % All experiments involved in this paper were run independently for 100 times. % The learning curses of the algorithms corresponding to % policy evaluation experiments and control experiments are % shown in Figures \ref{Evaluation_full} and \ref{Complete_full}, respectively. % The shaded area in Figure \ref{Evaluation_full}, \ref{Complete_full} represents the standard deviation (std). % In the random-walk tasks, VMTD and VMTDC exhibit excellent performance, % outperforming TD and TDC in the case of dependent random-walk. % In the 7-state example counter task, TD diverges, % while VMTDC converges and performs better than TDC. % From the update formula, it can be observed that the VMTD algorithm, like TDC, % is also an adjustment or correction of the TD update. % What is more surprising is that VMTD also maintains % convergence and demonstrates the best performance. % In Maze, Mountain Car, and Acrobot, % the convergence speed of VMSarsa, VMQ, and VMGQ(0) has % been significantly improved compared to Sarsa, Q-learning, % and GQ(0), respectively. The performance of the AC algorithm % is at an intermediate level. The performances of VMSarsa, % VMQ, and VMGQ(0) in these three experimental environments % have no significant differences. % In Cliff Walking, Sarsa and % VMSarsa converge to slightly worse solutions compared to % other algorithms. The convergence speed of VMSarsa is significantly % better than that of Sarsa. The convergence speed of VMGQ(0) and VMQ % is better than other algorithms, and the performance of VMGQ(0) is % slightly better than that of VMQ. % In summary, the performance of VMSarsa, % VMQ, and VMGQ(0) is better than that of other algorithms. % In the Cliff Walking environment, % the performance of VMGQ(0) is slightly better than that of % VMSarsa and VMQ. In the other three experimental environments, % the performances of VMSarsa, VMQ, and VMGQ(0) are close.