\begin{figure*}[htb] \vskip 0.2in \begin{center} \subfigure[Maze]{ \includegraphics[width=0.3\columnwidth, height=0.2\columnwidth]{main/pic/maze_complete.pdf} \label{MazeFull} } \subfigure[Cliff Walking]{ \includegraphics[width=0.3\columnwidth, height=0.2\columnwidth]{main/pic/cw_complete.pdf} \label{CliffWalkingFull} } \\ \subfigure[Mountain Car]{ \includegraphics[width=0.3\columnwidth, height=0.2\columnwidth]{main/pic/mt_complete.pdf} \label{MountainCarFull} } \subfigure[Acrobot]{ \includegraphics[width=0.3\columnwidth, height=0.2\columnwidth]{main/pic/Acrobot_complete.pdf} \label{AcrobotFull} } \caption{Learning curses of four contral environments.} \label{Complete_full} \end{center} \vskip -0.2in \end{figure*} \section{Related Work} \subsection{Difference between VMQ and R-learning} \begin{table*}[htb] \centering \caption{Difference between R-learning and tabular VMQ.} \vskip 0.15in \begin{tabular}{c|cc} \hline algorithms&update formula \\ \hline R-learning&$Q_{k+1}(s,a)\leftarrow Q_{k}(s,a)+\alpha_k(r_{k+1}-m_{k}+ \max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a))$\\ &$m_{k+1}\leftarrow m_{k}+\beta_k(r_{k+1}+\max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a)-m_{k})$\\ tabular VMQ&$Q_{k+1}(s,a)\leftarrow Q_{k}(s,a)+\alpha_k(r_{k+1}+\gamma \max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a)-\omega_k)$\\ &$\omega_{k+1}\leftarrow \omega_{k}+\beta_k(r_{k+1}+\gamma \max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a)-\omega_{k})$\\ \hline \end{tabular} \label{differenceRandVMQ} \vskip -0.1in \end{table*} Tabular VMQ's update formula bears some resemblance to R-learning's update formula. As shown in Table \ref{differenceRandVMQ}, the update formulas of the two algorithms have the following differences: \\(1) The goal of the R-learning algorithm \cite{schwartz1993reinforcement} is to maximize the average reward, rather than the cumulative reward, by learning an estimate of the average reward. This estimate $m$ is then used to update the Q-values. On the contrary, the $\omega$ in the tabular VMQ update formula eventually converges to $\mathbb{E}[\delta]$. \\(2) When $\gamma=1$ in the tabular VMQ update formula, the R-learning update formula is formally the same as the tabular VMQ update formula. Therefore, R-learning algorithm can be considered as a special case of VMQ algorithm in form. \subsection{Variance Reduction for TD Learning} The TD with centering algorithm (CTD) \cite{korda2015td} was proposed, which directly applies variance reduction techniques to the TD algorithm. The CTD algorithm updates its parameters using the average gradient of a batch of Markovian samples and a projection operator. Unfortunately, the authors’ analysis of the CTD algorithm contains technical errors. The VRTD algorithm \cite{xu2020reanalysis} is also a variance-reduced algorithm that updates its parameters using the average gradient of a batch of i.i.d. samples. The authors of VRTD provide a technically sound analysis to demonstrate the advantages of variance reduction. \subsection{Variance Reduction for Policy Gradient Algorithms} Policy gradient algorithms are a class of reinforcement learning algorithms that directly optimize cumulative rewards. REINFORCE is a Monte Carlo algorithm that estimates gradients through sampling, but may have a high variance. Baselines are introduced to reduce variance and to accelerate learning \cite{Sutton2018book}. In Actor-Critic, value function as a baseline and bootstrapping are used to reduce variance, also accelerating convergence \cite{Sutton2018book}. TRPO \cite{schulman2015trust} and PPO \cite{schulman2017proximal} use generalized advantage estimation, which combines multi-step bootstrapping and Monte Carlo estimation to reduce variance, making gradient estimation more stable and accelerating convergence. In Variance Minimization, the incorporation of $\omega \doteq \mathbb{E}[\delta]$ bears a striking resemblance to the use of a baseline in policy gradient methods. The introduction of a baseline in policy gradient techniques does not alter the expected value of the update; rather, it significantly impacts the variance of gradient estimation. The addition of $\omega \doteq \mathbb{E}[\delta]$ in Variance Minimization preserves the invariance of the optimal policy while stabilizing gradient estimation, reducing the variance of gradient estimation, and hastening convergence.