\section{Experimental Studies} This section assesses algorithm performance through experiments, which are divided into policy evaluation experiments and control experiments. The control algorithms for TDC, ETD, VMTDC, and VMETD are named GQ, EQ, VMGQ, and VMEQ, respectively. The evaluation experimental environments are the 2-state and 7-state counterexample. The control experimental environments are Maze, CliffWalking-v0, MountainCar-v0, and Acrobot-v1. For specific experimental parameters, please refer to the appendix. For the evaluation experiment, the experimental results align with our previous analysis. In the 2-state counterexample environment, the TDC algorithm has the smallest minimum eigenvalue of the key matrix, resulting in the slowest convergence speed. In contrast, the minimum eigenvalue of VMTDC is larger, leading to faster convergence. Although VMETD's minimum eigenvalue is larger than ETD's, causing VMETD to converge more slowly than ETD in the 2-state counterexample, the standard deviation (shaded area) of VMETD is smaller than that of ETD, indicating that VMETD converges more smoothly. In the 7-state counterexample environment, VMTDC converges faster than TDC and both VMETD and ETD are diverge. For the control experiments, the results for the maze and cliff walking environments are similar: VMGQ outperforms GQ, EQ outperforms VMGQ, and VMEQ performs the best. In the mountain car and Acrobot experiments, VMGQ and VMEQ show comparable performance, both outperforming GQ and EQ. In summary, for control experiments, VM algorithms outperform non-VM algorithms. In summary, the performance of VMSarsa, VMQ, and VMGQ(0) is better than that of other algorithms. In the Cliff Walking environment, the performance of VMGQ(0) is slightly better than that of VMSarsa and VMQ. In the other three experimental environments, the performances of VMSarsa, VMQ, and VMGQ(0) are close.