\section{Related Work}
\subsection{Difference between VMQ and R-learning}
Tabular VMQ's update formula bears some resemblance 
to R-learning's update formula. As shown in Table \ref{differenceRandVMQ}, the update formulas of the two algorithms have the following differences:
\\(1) The goal of the R-learning algorithm \cite{schwartz1993reinforcement} is to maximize the average 
reward, rather than the cumulative reward, by learning an estimate 
of the average reward. This estimate $m$ is then used to update the Q-values.
On the contrary, the $\omega$ in the tabular VMQ update formula eventually converges to $\mathbb{E}[\delta]$.
\\(2) When $\gamma=1$ in the tabular VMQ update formula, the 
R-learning update formula is formally 
the same as the tabular VMQ update formula. 
Therefore, R-learning algorithm can be 
considered as a special case of VMQ algorithm in form.

\subsection{Variance Reduction for TD Learning}
 The TD with centering algorithm (CTD) \cite{korda2015td} 
was proposed, which directly applies variance reduction techniques to 
the TD algorithm. The CTD algorithm updates its parameters using the 
average gradient of a batch of Markovian samples and a projection operator. 
Unfortunately, the authors’ analysis of the CTD algorithm contains technical 
errors. The VRTD algorithm \cite{xu2020reanalysis} is also a variance-reduced algorithm that updates 
its parameters using the average gradient of a batch of i.i.d. samples. The 
authors of VRTD provide a technically sound analysis to demonstrate the 
advantages of variance reduction. 

\subsection{Variance Reduction for Policy Gradient Algorithms}
Policy gradient algorithms are a class of reinforcement 
learning algorithms that directly optimize cumulative rewards. 
REINFORCE  is a Monte Carlo algorithm that estimates 
gradients through sampling, but may have a high variance. 
Baselines are introduced to reduce variance and to
accelerate learning \cite{Sutton2018book}. In  Actor-Critic, 
value function as a baseline and bootstrapping 
 are used to reduce variance, also accelerating convergence \cite{Sutton2018book}.
 TRPO \cite{schulman2015trust} and PPO \cite{schulman2017proximal}
  use generalized advantage 
estimation, which combines multi-step bootstrapping and Monte Carlo 
estimation to reduce variance, making gradient estimation more stable and 
accelerating convergence. 

In Variance Minimization, 
the incorporation of $\omega \doteq \mathbb{E}[\delta]$ 
bears a striking resemblance to the use of a baseline 
in policy gradient methods. The introduction of a baseline 
in policy gradient techniques does not alter 
the expected value of the update; 
rather, it significantly impacts the variance of gradient estimation. 
The addition of $\omega \doteq \mathbb{E}[\delta]$ in Variance Minimization 
 preserves the invariance of the optimal 
policy while stabilizing gradient estimation, 
reducing the variance of gradient estimation, 
and hastening convergence.