\section{Introduction}
\label{introduction}
Reinforcement learning can be mainly divided into two
categories: value-based reinforcement learning
and policy gradient-based reinforcement learning. This
paper focuses on temporal difference learning based on
linear approximated valued functions. Its research is
usually divided into two steps: the first step is to establish the convergence of the algorithm, and the second
step is to accelerate the algorithm.


In terms of stability, \citet{sutton1988learning} established the
 convergence of on-policy TD(0), and \citet{tsitsiklis1997analysis}
 established the convergence of on-policy TD($\lambda$).
 However, ``The deadly triad'' consisting of off-policy learning, 
 bootstrapping, and function approximation makes 
 the stability  a difficult problem \citep{Sutton2018book}.
 To solve this problem, convergent off-policy temporal difference
  learning algorithms are proposed, e.g., BR \cite{baird1995residual},
 GTD \cite{sutton2008convergent},  GTD2 and TDC \cite{sutton2009fast},
  ETD \cite{sutton2016emphatic}, and MRetrace \cite{chen2023modified}.

In terms of acceleration, \citet{hackman2012faster} 
proposed Hybrid TD algorithm with on-policy matrix.
\citet{liu2015finite,liu2016proximal,liu2018proximal} proposed
true stochastic algorithms, i.e., GTD-MP and GTD2-MP, from 
a convex-concave saddle-point formulation.
Second-order  methods are used to accelerate TD learning,
e.g.,  Quasi Newton TD \cite{givchi2015quasi} and 
accelerated TD (ATD)  \citep{pan2017accelerated}.
\citet{hallak2016generalized} introduced an new parameter 
to reduce variance for ETD.
\citet{zhang2022truncated} proposed truncated ETD with a lower variance.
Variance Reduced TD with direct variance reduction technique \citep{johnson2013accelerating} is proposed by \cite{korda2015td} 
and analysed by  \cite{xu2019reanalysis}.
How to further improve the convergence rates of reinforcement learning 
algorithms is currently still an open problem.

Algorithm stability is prominently reflected in the changes 
to the objective function, transitioning from mean squared 
errors (MSE) \citep{Sutton2018book} to mean squared bellman errors (MSBE) \cite{baird1995residual}, then to 
norm of the expected TD update \cite{sutton2009fast}, and further to 
mean squared projected Bellman errors (MSPBE) \cite{sutton2009fast}. On the other hand, algorithm 
acceleration is more centered around optimizing the iterative 
update formula of the algorithm itself without altering the 
objective function, thereby speeding up the convergence rate 
of the algorithm. The emergence of new optimization objective 
functions often leads to the development of novel algorithms. 
The introduction of new algorithms, in turn, tends to inspire 
researchers to explore methods for accelerating algorithms, 
leading to the iterative creation of increasingly superior algorithms.

The kernel loss function can be optimized using standard 
gradient-based methods, addressing the issue of double 
sampling in residual gradient algorithm \cite{feng2019kernel}. It ensures convergence 
in both on-policy and off-policy scenarios. The logistic bellman 
error is convex and smooth in the action-value function parameters, 
with bounded gradients \cite{basserrano2021logistic}. In contrast, the squared Bellman error is 
not convex in the action-value function parameters, and RL algorithms 
based on recursive optimization using it are known to be unstable.


% The value-based algorithms mentioned above aim to
% minimize some errors, e.g., mean squared errors \citep{Sutton2018book},
% mean squared Bellman errors \cite{baird1995residual}, norm
% of the expected TD update \cite{sutton2009fast}, 
% mean squared projected Bellman errors (MSPBE) \cite{sutton2009fast}, etc.
It is necessary to propose a new objective function, but the mentioned objective functions above are all some form of error.
Is minimizing error the only option for value-based reinforcement learning?

For policy evaluation experiments, 
differences in objective functions may result 
in inconsistent fixed points. This inconsistency 
makes it difficult to uniformly compare the superiority 
of algorithms derived from different objective functions. 
However, for control experiments, since the choice of actions 
depends on the relative values of the Q values rather than their
 absolute values, the presence of solution bias is acceptable.

Based on this observation, we propose  alternate objective functions 
instead of minimizing errors. We minimize 
Variance of Projected Bellman Error (VPBE)
and derive Variance Minimization (VM) algorithms.
These algorithms preserve the invariance of the optimal policy in the control environments,
but significantly reduce the variance of gradient estimation,
and thus hastening convergence.

The contributions of this paper are as follows:
(1) Introduction of  novel objective functions based on
the invariance of the optimal policy.
(2) Propose two off-policy variance minimization algorithms.
(3) Proof of their convergence.
(5) Experiments demonstrating the faster convergence speed of the proposed algorithms.