\section{Experimental Studies}
This section assesses algorithm performance through experiments, 
which are divided into policy evaluation experiments and control experiments.
The control algorithms for TDC, ETD, VMTDC, and VMETD are named GQ, EQ, VMGQ, and VMEQ, respectively.
The evaluation experimental environments are the 2-state and 7-state counterexample. 
The control experimental environments are Maze, CliffWalking-v0, MountainCar-v0, and Acrobot-v1.
For specific experimental parameters, please refer to the appendix.

For the evaluation experiment, the experimental results 
align with our previous analysis. In the 2-state counterexample 
environment, the TDC algorithm has the smallest minimum 
eigenvalue of the key matrix, resulting in the slowest 
convergence speed. In contrast, the minimum eigenvalue 
of VMTDC is larger, leading to faster convergence. 
Although VMETD's minimum eigenvalue is larger than ETD's, 
causing VMETD to converge more slowly than ETD in the 
2-state counterexample, the standard deviation (shaded area) 
of VMETD is smaller than that of ETD, indicating that VMETD 
converges more smoothly. In the 7-state counterexample 
environment, VMTDC converges faster than TDC and both VMETD and ETD are diverge.

For the control experiments, the results for the maze and 
cliff walking environments are similar: VMGQ 
outperforms GQ, EQ outperforms VMGQ, and VMEQ performs 
the best. In the mountain car and Acrobot experiments, 
VMGQ and VMEQ show comparable performance, both outperforming 
GQ and EQ. In summary, for control experiments, VM algorithms 
outperform non-VM algorithms.

In summary, the performance of VMSarsa, 
VMQ, and VMGQ(0) is better than that of other algorithms. 
In the Cliff Walking environment, 
the performance of VMGQ(0) is slightly better than that of 
VMSarsa and VMQ. In the other three experimental environments, 
the performances of VMSarsa, VMQ, and VMGQ(0) are close.