\section{Related Work} \subsection{Difference between VMQ and R-learning} Tabular VMQ's update formula bears some resemblance to R-learning's update formula. As shown in Table \ref{differenceRandVMQ}, the update formulas of the two algorithms have the following differences: \\(1) The goal of the R-learning algorithm \cite{schwartz1993reinforcement} is to maximize the average reward, rather than the cumulative reward, by learning an estimate of the average reward. This estimate $m$ is then used to update the Q-values. On the contrary, the $\omega$ in the tabular VMQ update formula eventually converges to $\mathbb{E}[\delta]$. \\(2) When $\gamma=1$ in the tabular VMQ update formula, the R-learning update formula is formally the same as the tabular VMQ update formula. Therefore, R-learning algorithm can be considered as a special case of VMQ algorithm in form. \subsection{Variance Reduction for TD Learning} The TD with centering algorithm (CTD) \cite{korda2015td} was proposed, which directly applies variance reduction techniques to the TD algorithm. The CTD algorithm updates its parameters using the average gradient of a batch of Markovian samples and a projection operator. Unfortunately, the authors’ analysis of the CTD algorithm contains technical errors. The VRTD algorithm \cite{xu2020reanalysis} is also a variance-reduced algorithm that updates its parameters using the average gradient of a batch of i.i.d. samples. The authors of VRTD provide a technically sound analysis to demonstrate the advantages of variance reduction. \subsection{Variance Reduction for Policy Gradient Algorithms} Policy gradient algorithms are a class of reinforcement learning algorithms that directly optimize cumulative rewards. REINFORCE is a Monte Carlo algorithm that estimates gradients through sampling, but may have a high variance. Baselines are introduced to reduce variance and to accelerate learning \cite{Sutton2018book}. In Actor-Critic, value function as a baseline and bootstrapping are used to reduce variance, also accelerating convergence \cite{Sutton2018book}. TRPO \cite{schulman2015trust} and PPO \cite{schulman2017proximal} use generalized advantage estimation, which combines multi-step bootstrapping and Monte Carlo estimation to reduce variance, making gradient estimation more stable and accelerating convergence. In Variance Minimization, the incorporation of $\omega \doteq \mathbb{E}[\delta]$ bears a striking resemblance to the use of a baseline in policy gradient methods. The introduction of a baseline in policy gradient techniques does not alter the expected value of the update; rather, it significantly impacts the variance of gradient estimation. The addition of $\omega \doteq \mathbb{E}[\delta]$ in Variance Minimization preserves the invariance of the optimal policy while stabilizing gradient estimation, reducing the variance of gradient estimation, and hastening convergence.