\section{Introduction} \label{introduction} Reinforcement learning can be mainly divided into two categories: value-based reinforcement learning and policy gradient-based reinforcement learning. This paper focuses on temporal difference learning based on linear approximated valued functions. Its research is usually divided into two steps: the first step is to establish the convergence of the algorithm, and the second step is to accelerate the algorithm. In terms of stability, \citet{sutton1988learning} established the convergence of on-policy TD(0), and \citet{tsitsiklis1997analysis} established the convergence of on-policy TD($\lambda$). However, ``The deadly triad'' consisting of off-policy learning, bootstrapping, and function approximation makes the stability a difficult problem \citep{Sutton2018book}. To solve this problem, convergent off-policy temporal difference learning algorithms are proposed, e.g., BR \cite{baird1995residual}, GTD \cite{sutton2008convergent}, GTD2 and TDC \cite{sutton2009fast}, ETD \cite{sutton2016emphatic}, and MRetrace \cite{chen2023modified}. In terms of acceleration, \citet{hackman2012faster} proposed Hybrid TD algorithm with on-policy matrix. \citet{liu2015finite,liu2016proximal,liu2018proximal} proposed true stochastic algorithms, i.e., GTD-MP and GTD2-MP, from a convex-concave saddle-point formulation. Second-order methods are used to accelerate TD learning, e.g., Quasi Newton TD \cite{givchi2015quasi} and accelerated TD (ATD) \citep{pan2017accelerated}. \citet{hallak2016generalized} introduced an new parameter to reduce variance for ETD. \citet{zhang2022truncated} proposed truncated ETD with a lower variance. Variance Reduced TD with direct variance reduction technique \citep{johnson2013accelerating} is proposed by \cite{korda2015td} and analysed by \cite{xu2019reanalysis}. How to further improve the convergence rates of reinforcement learning algorithms is currently still an open problem. Algorithm stability is prominently reflected in the changes to the objective function, transitioning from mean squared errors (MSE) \citep{Sutton2018book} to mean squared bellman errors (MSBE) \cite{baird1995residual}, then to norm of the expected TD update \cite{sutton2009fast}, and further to mean squared projected Bellman errors (MSPBE) \cite{sutton2009fast}. On the other hand, algorithm acceleration is more centered around optimizing the iterative update formula of the algorithm itself without altering the objective function, thereby speeding up the convergence rate of the algorithm. The emergence of new optimization objective functions often leads to the development of novel algorithms. The introduction of new algorithms, in turn, tends to inspire researchers to explore methods for accelerating algorithms, leading to the iterative creation of increasingly superior algorithms. The kernel loss function can be optimized using standard gradient-based methods, addressing the issue of double sampling in residual gradient algorithm \cite{feng2019kernel}. It ensures convergence in both on-policy and off-policy scenarios. The logistic bellman error is convex and smooth in the action-value function parameters, with bounded gradients \cite{basserrano2021logistic}. In contrast, the squared Bellman error is not convex in the action-value function parameters, and RL algorithms based on recursive optimization using it are known to be unstable. % The value-based algorithms mentioned above aim to % minimize some errors, e.g., mean squared errors \citep{Sutton2018book}, % mean squared Bellman errors \cite{baird1995residual}, norm % of the expected TD update \cite{sutton2009fast}, % mean squared projected Bellman errors (MSPBE) \cite{sutton2009fast}, etc. It is necessary to propose a new objective function, but the mentioned objective functions above are all some form of error. Is minimizing error the only option for value-based reinforcement learning? For policy evaluation experiments, differences in objective functions may result in inconsistent fixed points. This inconsistency makes it difficult to uniformly compare the superiority of algorithms derived from different objective functions. However, for control experiments, since the choice of actions depends on the relative values of the Q values rather than their absolute values, the presence of solution bias is acceptable. Based on this observation, we propose alternate objective functions instead of minimizing errors. We minimize Variance of Projected Bellman Error (VPBE) and derive Variance Minimization (VM) algorithms. These algorithms preserve the invariance of the optimal policy in the control environments, but significantly reduce the variance of gradient estimation, and thus hastening convergence. The contributions of this paper are as follows: (1) Introduction of novel objective functions based on the invariance of the optimal policy. (2) Propose two off-policy variance minimization algorithms. (3) Proof of their convergence. (5) Experiments demonstrating the faster convergence speed of the proposed algorithms.