% \subsection{Testing Tasks}
\begin{figure}[h]
    \centering
    \includegraphics[scale=0.2]{main/pic/maze_13_13.pdf} 
    \caption{Maze.}
    \end{figure}
\begin{figure*}[tb]
    \vskip 0.2in
    \begin{center}
    \subfigure[on-policy 2-state]{
        \includegraphics[width=0.65\columnwidth, height=0.58\columnwidth]{main/pic/2-state-onpolicy.pdf}
        \label{2-state}
    }
    \subfigure[off-policy 2-state]{
        \includegraphics[width=0.65\columnwidth, height=0.58\columnwidth]{main/pic/2-state-offpolicy.pdf}
        \label{7-state}
    }
    \subfigure[Maze]{
        \includegraphics[width=0.65\columnwidth, height=0.58\columnwidth]{main/pic/maze.pdf}
        \label{MazeFull}
    }\\
    \subfigure[Cliff Walking]{
        \includegraphics[width=0.65\columnwidth, height=0.58\columnwidth]{main/pic/cl.pdf}
        \label{CliffWalkingFull}
    }
    \subfigure[Mountain Car]{
        \includegraphics[width=0.65\columnwidth, height=0.58\columnwidth]{main/pic/mt.pdf}
        \label{MountainCarFull}
    }
    \subfigure[Acrobot]{
        \includegraphics[width=0.65\columnwidth, height=0.58\columnwidth]{main/pic/acrobot.pdf}
        \label{AcrobotFull}
    }
        \caption{Learning curses of one evaluation environment and four contral environments.}
        \label{Complete_full}
    \end{center}
    \vskip -0.2in
  \end{figure*}
\section{Experimental Studies}
This section assesses algorithm performance through experiments, 
which are divided into policy evaluation experiments and control experiments.
The evaluation experimental environments is the 2-state. 
In a 2-state environment, we conducted two types of experiments—on-policy 
and off-policy—to verify the relationship between the convergence speed of 
the algorithm and the smallest eigenvalue of the key matrix $\textbf{A}$.
Control experiments, by allowing the algorithm to interact 
with the environment to optimize the policy, can evaluate its 
performance in learning the optimal policy. This provides a more 
comprehensive assessment of the algorithm's overall capabilities.
The control experimental environments are Maze, CliffWalking-v0, MountainCar-v0, and Acrobot-v1.
The control algorithms for TDC, ETD, VMTDC, and VMETD are named GQ, EQ, VMGQ, and VMEQ, respectively.
For TD and VMTD control algorithms, there are two variants each: Sarsa and Q-learning for TD, and VMSarsa and VMQ for VMTD.

% For specific experimental parameters, please refer to the appendix.

% \textbf{Baird's off-policy counterexample:} This task is well known as a
% counterexample, in which TD diverges \cite{baird1995residual,sutton2009fast}. As
% shown in Figure \ref{bairdexample}, reward for each transition is zero. Thus the true values are zeros for all states and for any given policy. The behaviour policy
% chooses actions represented by solid lines with a probability of $\frac{1}{7}$
% and actions represented by dotted lines with a probability of $\frac{6}{7}$. The
% target policy is expected to choose the solid line with more probability than $\frac{1}{7}$,
% and it chooses the solid line with probability of $1$ in this paper.
%  The discount factor $\gamma =0.99$, and the feature matrix is
% defined in Appendix \ref{experimentaldetails} \cite{baird1995residual,sutton2009fast,maei2011gradient}.
% \begin{figure}
%     \begin{center}
%     \input{main/pic/BairdExample.tex}
%     \caption{7-state.}
%     \label{bairdexample}
%     \end{center}
% \end{figure}

% The feature matrix of 7-state version of Baird's off-policy counterexample is
% defined as follow:
% \begin{equation*}
% \Phi_{Counter}=\left[ 
% \begin{array}{cccccccc}
% 1 & 2& 0& 0& 0& 0& 0& 0\\
% 1 & 0& 2& 0& 0& 0& 0& 0\\
% 1 & 0& 0& 2& 0& 0& 0& 0\\
% 1 & 0& 0& 0& 2& 0& 0& 0\\
% 1 & 0& 0& 0& 0& 2& 0& 0\\
% 1 & 0& 0& 0& 0& 0& 2& 0\\
% 2 & 0& 0& 0& 0& 0& 0& 1
% \end{array}\right]
% \end{equation*}
\subsection{Testing Tasks}
% \begin{figure}[h]
%     \centering
%     \includegraphics[scale=0.2]{main/pic/maze_13_13.pdf} 
%     \caption{Maze.}
%     \end{figure}
\textbf{Maze}:  The learning agent should find a shortest path from the upper
left corner to the lower right corner. 
 In each state,
there are four alternative actions: $up$, $down$, $left$, and $right$, which
takes the agent deterministically to the corresponding neighbour state,
except when a movement is blocked by an obstacle or the edge
of the maze. Rewards are $-1$ in all transitions until the
agent reaches the goal state.
The discount factor $\gamma=0.99$, and states $s$ are represented by tabular
features.The maximum number of moves in the game is set to 1000.

\textbf{The other three control environments}: Cliff Walking, Mountain Car, and Acrobot are 
selected from the gym official website and correspond to the following 
versions: ``CliffWalking-v0'', ``MountainCar-v0'' and ``Acrobot-v1''. 
For specific details, please refer to the gym official website.
The maximum number of steps for the Mountain Car environment is set to 1000, 
while the default settings are used for the other two environments. In  Mountain car and Acrobot, features are generated by tile coding.

For all policy evaluation experiments, each experiment 
is independently run 100 times.
For all control experiments, each experiment is independently run 50 times.
For specific experimental parameters, please refer to the appendix.

\subsection{Experimental Results and Analysis}
Figure \ref{2-state} shows the learning curves for the on-policy 
2-state policy evaluation experiment. In this setup, 
the convergence speed of TD, VMTD, TDC, and VMTDC decreases 
sequentially. Table \ref{tab:min_eigenvalues} indicates that the smallest eigenvalue 
of the key matrix for these four algorithms is greater than 0 
and decreases sequentially, which is consistent with the 
experimental curves and table values.

Figure B displays the learning curves for the off-policy 
2-state policy evaluation experiment. In this setup, 
the convergence speed of ETD, VMETD, VMTD, VMTDC, and 
TDC decreases sequentially, while TD diverges. Table \ref{tab:min_eigenvalues} 
shows that the smallest eigenvalue of the key matrix for 
ETD, VMETD, VMTD, VMTDC, and TDC is greater than 0 and 
decreases sequentially, while the smallest eigenvalue 
for TD is less than 0. This is consistent with the 
experimental curves and table values. Remarkably, 
although VMTD is guaranteed to converge under 
on-policy conditions, it still converges in the 
off-policy 2-state scenario. The update formula 
of VMTD indicates that it is essentially an 
adjustment and correction of the TD update, 
with the introduction of the parameter $\omega$ 
making the variance of the gradient estimate 
more stable, thereby making the update of theta more stable.

Figures \ref{MazeFull}, \ref{CliffWalkingFull}, \ref{MountainCarFull} and \ref{AcrobotFull} show the learning curves 
for four control experiments. A common feature 
observed across these experiments is that VMEQ 
outperforms EQ, VMGQ outperforms GQ, VMQ outperforms 
Q-learning, and VMSarsa outperforms Sarsa. For the 
Maze and Cliffwalking experiments, VMEQ demonstrated 
the best performance with the fastest convergence speed. 
In the Mountain Car and Acrobot experiments, the performance 
of the four VM algorithms was nearly identical and all 
outperformed the other algorithms.

Overall, whether in policy evaluation experiments or 
control experiments, the VM algorithms have 
demonstrated superior performance, 
especially excelling in the control experiments.