\begin{figure*}[htb]
  \vskip 0.2in
  \begin{center}
  \subfigure[Maze]{
      \includegraphics[width=0.3\columnwidth, height=0.2\columnwidth]{main/pic/maze_complete.pdf}
      \label{MazeFull}
  }
  \subfigure[Cliff Walking]{
      \includegraphics[width=0.3\columnwidth, height=0.2\columnwidth]{main/pic/cw_complete.pdf}
      \label{CliffWalkingFull}
  }
  \\
  \subfigure[Mountain Car]{
      \includegraphics[width=0.3\columnwidth, height=0.2\columnwidth]{main/pic/mt_complete.pdf}
      \label{MountainCarFull}
  }
  \subfigure[Acrobot]{
      \includegraphics[width=0.3\columnwidth, height=0.2\columnwidth]{main/pic/Acrobot_complete.pdf}
      \label{AcrobotFull}
  }
      \caption{Learning curses of four contral environments.}
      \label{Complete_full}
  \end{center}
  \vskip -0.2in
\end{figure*}

\section{Related Work}
\subsection{Difference between VMQ and R-learning}
\begin{table*}[htb]
  \centering
  \caption{Difference between R-learning and tabular VMQ.}
  \vskip 0.15in
  \begin{tabular}{c|cc}
      \hline
      algorithms&update formula \\
      \hline
       R-learning&$Q_{k+1}(s,a)\leftarrow Q_{k}(s,a)+\alpha_k(r_{k+1}-m_{k}+ \max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a))$\\
              &$m_{k+1}\leftarrow m_{k}+\beta_k(r_{k+1}+\max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a)-m_{k})$\\
       tabular VMQ&$Q_{k+1}(s,a)\leftarrow Q_{k}(s,a)+\alpha_k(r_{k+1}+\gamma \max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a)-\omega_k)$\\
       &$\omega_{k+1}\leftarrow \omega_{k}+\beta_k(r_{k+1}+\gamma \max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a)-\omega_{k})$\\
       \hline
  \end{tabular}
  \label{differenceRandVMQ}
  \vskip -0.1in
\end{table*}
Tabular VMQ's update formula bears some resemblance 
to R-learning's update formula. As shown in Table \ref{differenceRandVMQ}, the update formulas of the two algorithms have the following differences:
\\(1) The goal of the R-learning algorithm \cite{schwartz1993reinforcement} is to maximize the average 
reward, rather than the cumulative reward, by learning an estimate 
of the average reward. This estimate $m$ is then used to update the Q-values.
On the contrary, the $\omega$ in the tabular VMQ update formula eventually converges to $\mathbb{E}[\delta]$.
\\(2) When $\gamma=1$ in the tabular VMQ update formula, the 
R-learning update formula is formally 
the same as the tabular VMQ update formula. 
Therefore, R-learning algorithm can be 
considered as a special case of VMQ algorithm in form.

\subsection{Variance Reduction for TD Learning}
 The TD with centering algorithm (CTD) \cite{korda2015td} 
was proposed, which directly applies variance reduction techniques to 
the TD algorithm. The CTD algorithm updates its parameters using the 
average gradient of a batch of Markovian samples and a projection operator. 
Unfortunately, the authors’ analysis of the CTD algorithm contains technical 
errors. The VRTD algorithm \cite{xu2020reanalysis} is also a variance-reduced algorithm that updates 
its parameters using the average gradient of a batch of i.i.d. samples. The 
authors of VRTD provide a technically sound analysis to demonstrate the 
advantages of variance reduction. 

\subsection{Variance Reduction for Policy Gradient Algorithms}
Policy gradient algorithms are a class of reinforcement 
learning algorithms that directly optimize cumulative rewards. 
REINFORCE  is a Monte Carlo algorithm that estimates 
gradients through sampling, but may have a high variance. 
Baselines are introduced to reduce variance and to
accelerate learning \cite{Sutton2018book}. In  Actor-Critic, 
value function as a baseline and bootstrapping 
 are used to reduce variance, also accelerating convergence \cite{Sutton2018book}.
 TRPO \cite{schulman2015trust} and PPO \cite{schulman2017proximal}
  use generalized advantage 
estimation, which combines multi-step bootstrapping and Monte Carlo 
estimation to reduce variance, making gradient estimation more stable and 
accelerating convergence. 

In Variance Minimization, 
the incorporation of $\omega \doteq \mathbb{E}[\delta]$ 
bears a striking resemblance to the use of a baseline 
in policy gradient methods. The introduction of a baseline 
in policy gradient techniques does not alter 
the expected value of the update; 
rather, it significantly impacts the variance of gradient estimation. 
The addition of $\omega \doteq \mathbb{E}[\delta]$ in Variance Minimization 
 preserves the invariance of the optimal 
policy while stabilizing gradient estimation, 
reducing the variance of gradient estimation, 
and hastening convergence.