%%%%%%%% ICML 2024 EXAMPLE LATEX SUBMISSION FILE %%%%%%%%%%%%%%%%%

\documentclass{article}

% Recommended, but optional, packages for figures and better typesetting:
\usepackage{microtype}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{diagbox}
\usepackage{wrapfig}
\usepackage{booktabs} % for professional tables

% hyperref makes hyperlinks in the resulting PDF.
% If your build breaks (sometimes temporarily if a hyperlink spans a page)
% please comment out the following usepackage line and replace
% \usepackage{icml2024} with \usepackage[nohyperref]{icml2024} above.
\usepackage{hyperref}


% Attempt to make hyperref and algorithmic work together better:
\newcommand{\theHalgorithm}{\arabic{algorithm}}

% Use the following line for the initial blind version submitted for review:
\usepackage{icml2024}

% If accepted, instead use the following line for the camera-ready submission:
% \usepackage[accepted]{icml2024}

% For theorems and such
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}

% if you use cleveref..
\usepackage[capitalize,noabbrev]{cleveref}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}

% Todonotes is useful during development; simply uncomment the next line
%    and comment out the line below the next line to turn off comments
%\usepackage[disable,textsize=tiny]{todonotes}
\usepackage[textsize=tiny]{todonotes}


% The \icmltitle you define below is probably too long as a header.
% Therefore, a short form for the running title is supplied here:
\icmltitlerunning{Submission and Formatting Instructions for ICML 2024}

\begin{document}

\twocolumn[
\icmltitle{Is Minimizing Errors the Only Option for Value-based Reinforcement Learning?}

% It is OKAY to include author information, even for blind
% submissions: the style file will automatically remove it for you
% unless you've provided the [accepted] option to the icml2024
% package.

% List of affiliations: The first argument should be a (short)
% identifier you will use later to specify author affiliations
% Academic affiliations should list Department, University, City, Region, Country
% Industry affiliations should list Company, City, Region, Country

% You can specify symbols, otherwise they are numbered in order.
% Ideally, you should not use this facility. Affiliations will be numbered
% in order of appearance and this is the preferred way.
\icmlsetsymbol{equal}{*}

\begin{icmlauthorlist}
\icmlauthor{Firstname1 Lastname1}{equal,yyy}
\icmlauthor{Firstname2 Lastname2}{equal,yyy,comp}
\icmlauthor{Firstname3 Lastname3}{comp}
\icmlauthor{Firstname4 Lastname4}{sch}
\icmlauthor{Firstname5 Lastname5}{yyy}
\icmlauthor{Firstname6 Lastname6}{sch,yyy,comp}
\icmlauthor{Firstname7 Lastname7}{comp}
%\icmlauthor{}{sch}
\icmlauthor{Firstname8 Lastname8}{sch}
\icmlauthor{Firstname8 Lastname8}{yyy,comp}
%\icmlauthor{}{sch}
%\icmlauthor{}{sch}
\end{icmlauthorlist}

\icmlaffiliation{yyy}{Department of XXX, University of YYY, Location, Country}
\icmlaffiliation{comp}{Company Name, Location, Country}
\icmlaffiliation{sch}{School of ZZZ, Institute of WWW, Location, Country}

\icmlcorrespondingauthor{Firstname1 Lastname1}{first1.last1@xxx.edu}
\icmlcorrespondingauthor{Firstname2 Lastname2}{first2.last2@www.uk}

% You may provide any keywords that you
% find helpful for describing your paper; these are used to populate
% the "keywords" metadata in the PDF but will not be shown in the document
\icmlkeywords{Machine Learning, ICML}

\vskip 0.3in
]

% this must go after the closing bracket ] following \twocolumn[ ...

% This command actually creates the footnote in the first column
% listing the affiliations and the copyright notice.
% The command takes one argument, which is text to display at the start of the footnote.
% The \icmlEqualContribution command is standard text for equal contribution.
% Remove it (just {}) if you do not need this facility.

%\printAffiliationsAndNotice{}  % leave blank if no need to mention equal contribution
\printAffiliationsAndNotice{\icmlEqualContribution} % otherwise use the standard text.

\begin{abstract}
    In the regression task of supervised learning, 
    we need to minimize the error and trade off the variance. 
    Drawing on this idea, the existing research on 
    value-based reinforcement learning also minimizes the error. 
    However, is error minimization really the only option
     for value-based reinforcement learning? 
     We can easily observe that the policy on action 
     choosing probabilities is often related to the relative values,
      and has nothing to do with their absolute values. 
      Based on this observation, we propose the objective
      of variance minimization instead of error minimization, 
      derive on-policy and off-policy algorithms respectively, 
      and conduct an analysis of the convergence rate and experiments. 
      The experimental results show that our proposed variance minimization algorithms
       converge much faster.
\end{abstract}

\section{Introduction}
\label{introduction}
Reinforcement learning can be mainly divided into two
categories: value-based reinforcement learning
and policy gradient-based reinforcement learning. This
paper focuses on temporal difference learning based on
linear approximated valued functions. Its research is
usually divided into two steps: the first step is to establish the convergence of the algorithm, and the second
step is to accelerate the algorithm.

In terms of stability, \citet{sutton1988learning} established the
 convergence of on-policy TD(0), and \citet{tsitsiklis1997analysis}
 established the convergence of on-policy TD($\lambda$).
 However, ``The deadly triad'' consisting of off-policy learning, 
 bootstrapping, and function approximation makes 
 the stability  a difficult problem \citep{Sutton2018book}.
 To solve this problem, convergent off-policy temporal difference
  learning algorithms are proposed, e.g., BR \cite{baird1995residual},
 GTD \cite{sutton2008convergent},  GTD2 and TDC \cite{sutton2009fast},
  ETD \cite{sutton2016emphatic}, and MRetrace \cite{chen2023modified}.

In terms of acceleration, \citet{hackman2012faster} 
proposed Hybrid TD algorithm with on-policy matrix.
\citet{liu2015finite,liu2016proximal,liu2018proximal} proposed
true stochastic algorithms, i.e., GTD-MP and GTD2-MP, from 
a convex-concave saddle-point formulation.
Second-order  methods are used to accelerate TD learning,
e.g.,  Quasi Newton TD \cite{givchi2015quasi} and 
accelerated TD (ATD)  \citep{pan2017accelerated}.
\citet{hallak2016generalized} introduced an new parameter 
to reduce variance for ETD.
\citet{zhang2022truncated} proposed truncated ETD with a lower variance.
Variance Reduced TD with direct variance reduction technique \citep{johnson2013accelerating} is proposed by \cite{korda2015td} 
and analysed by  \cite{xu2019reanalysis}.
How to further improve the convergence rates of reinforcement learning 
algorithms is currently still an open problem.

Algorithm stability is prominently reflected in the changes 
to the objective function, transitioning from mean squared 
errors (MSE) \citep{Sutton2018book} to mean squared bellman errors (MSBE) \cite{baird1995residual}, then to 
norm of the expected TD update \cite{sutton2009fast}, and further to 
mean squared projected Bellman errors (MSPBE) \cite{sutton2009fast}. On the other hand, algorithm 
acceleration is more centered around optimizing the iterative 
update formula of the algorithm itself without altering the 
objective function, thereby speeding up the convergence rate 
of the algorithm. The emergence of new optimization objective 
functions often leads to the development of novel algorithms. 
The introduction of new algorithms, in turn, tends to inspire 
researchers to explore methods for accelerating algorithms, 
leading to the iterative creation of increasingly superior algorithms.

The kernel loss function can be optimized using standard 
gradient-based methods, addressing the issue of double 
sampling in residual gradient algorithm \cite{feng2019kernel}. It ensures convergence 
in both on-policy and off-policy scenarios. The logistic bellman 
error is convex and smooth in the action-value function parameters, 
with bounded gradients \cite{basserrano2021logistic}. In contrast, the squared Bellman error is 
not convex in the action-value function parameters, and RL algorithms 
based on recursive optimization using it are known to be unstable.


% The value-based algorithms mentioned above aim to
% minimize some errors, e.g., mean squared errors \citep{Sutton2018book},
% mean squared Bellman errors \cite{baird1995residual}, norm
% of the expected TD update \cite{sutton2009fast}, 
% mean squared projected Bellman errors (MSPBE) \cite{sutton2009fast}, etc.
It is necessary to propose a new objective function, but the mentioned objective functions above are all some form of error.
Is minimizing error the only option for value-based reinforcement learning?

Error can be decomposed into bias, variance and unavoidable noise.
Among them, bias measures the difference between the predicted
 values of the model and the true values, reflecting the 
 model's fitting ability. Variance, on the other hand, 
 quantifies the model's sensitivity to different training data,
  indicating its stability and generalization ability. 
  Balancing bias and variance is important, as they represent trade-offs
  \cite{zhou2021machine}.
  In the context of this paper, where only a linear model is
   considered and the model complexity is not adjusted, 
   it is difficult to improve the bias. High bias indicates that the model poorly 
    fits the training data, resulting in underfitting. 
   In supervised learning, high bias is generally
    considered unacceptable. 

However, in reinforcement learning, high bias may
be acceptable in certain cases. This is due to the 
observation that policies based on value functions, such
as greedy, $\epsilon$-greedy, and softmax policies, often rely on
the relative values of action values rather than their
absolute values when selecting different actions.

Based on this observation, we propose  alternate objective functions 
instead of minimizing errors. We minimize Variance of Bellman Error (VBE) and
Variance of Projected Bellman Error (VPBE),  
and derive Variance Minimization (VM) algorithms.
These algorithms preserve the invariance of the optimal policy,
but significantly reduce the variance of gradient estimation,
and thus hastening convergence.

The contributions of this paper are as follows:
(1) Introduction of  novel objective functions based on
the invariance of the optimal policy.
(2) Derived two algorithms, one on-policy and one off-policy.
(3) Proof of their convergence.
(4) Analysis of the convergence rate of on-policy algorithm.
(5) Experiments demonstrating the faster convergence speed of the proposed algorithms.


\section{Preliminaries}
\label{preliminaries}
Reinforcement learning agent interacts with environment, observes state,
 takes sequential decision makings to influence environment, and obtains
 rewards.
 Consider an infinite-horizon discounted 
 Markov Decision Process (MDP), defined by a tuple $\langle S,A,R,P,\gamma
 \rangle$, where $S=\{1,2,\ldots,N\}$ is a finite set of states of the environment;  $A$
 is a finite set of actions of the agent; 
 $R:S\times A \times S \rightarrow \mathbb{R}$ is a bounded deterministic reward
 function; $P:S\times A\times S \rightarrow [0,1]$ is the transition
 probability distribution;  and $\gamma\in (0,1)$
  is the discount factor \cite{Sutton2018book}.
  Due to the requirements of  online learning, value iteration based on sampling
  is considered in this paper. 
  In each sampling, an experience (or transition) $\langle s, a, s', r\rangle$ is
  obtained.

  A policy is a mapping $\pi:S\times A \rightarrow [0,1]$. The goal of the
  agent is to find an optimal policy $\pi^*$ to maximize the expectation of a
  discounted cumulative rewards in a long period.
  State value function $V^{\pi}(s)$  for a stationary policy $\pi$ is 
  defined as:
  \begin{equation*}
  V^{\pi}(s)=\mathbb{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{k}|s_0=s].
  \label{valuefunction}
  \end{equation*}
  Linear value function for state $s\in S$ is defined as:
   \begin{equation}
   V_{{\theta}}(s):= {\theta}^{\top}{\phi}(s) = \sum_{i=1}^{m}
   \theta_i \phi_i(s),
   \label{linearvaluefunction}
   \end{equation}
  where ${\theta}:=(\theta_1,\theta_2,\ldots,\theta_m)^{\top}\in
  \mathbb{R}^m$ is a parameter vector, 
  ${\phi}:=(\phi_1,\phi_2,\ldots,\phi_m)^{\top}\in \mathbb{R}^m$ is a feature
  function  defined on state space $S$, and $m$ is the feature size. 

  Tabular temporal difference (TD) learning \cite{Sutton2018book} has been successfully applied to small-scale problems.
  To deal with the well-known curse of dimensionality of large scale MDPs, value
  function is usually approximated by a linear model, kernel methods, decision
   trees, or neural networks, etc. This paper focuses on the linear model, where
   features are usually hand coded by domain experts.

TD learning can also be used to find optimal strategies. The problem of finding an optimal policy is 
often called the control problem. Two popular TD methods are Sarsa and Q-leaning. The former is an on-policy 
TD control, while the latter is an off-policy control. 

It is well known that TDC algorithm \cite{sutton2009fast} guarantees 
convergence under off-policy conditions while the off-policy TD algorithm may diverge. The 
objective function of TDC is MSPBE. 
TDC is essentially an adjustment or correction of the TD update so that it
follows the gradient of the MSPBE objective function. In the context of the TDC algorithm, the control algorithm 
is known as Greedy-GQ($\lambda$) \cite{sutton2009fast}. When $\lambda$ is set to 0, it is denoted 
as GQ(0).

\section{Variance Minimization Algorithms}
\subsection{Motivation}
In reinforcement learning, bias is acceptable, 
while in supervised learning it is not. As shown 
in Table \ref{example_bias}, although there is a bias between the 
true value and the predicted value, action $a_3$ is 
still chosen under the greedy-policy.
On the contrary, supervised learning is usually used to predict temperature, humidity, morbidity, etc. If the bias is too large, the consequences could be serious. 

\begin{table}[t]
    \caption{Classification accuracies for naive Bayes and flexible
    Bayes on various data sets.}
    \label{example_bias}
    \vskip 0.15in
    \begin{center}
    \begin{small}
    \begin{sc}
    \begin{tabular}{lcccr}
    \toprule
    action & $Q$ value & $Q$ value with bias \\
    \midrule
    $Q(s, a_0)$ & 1& 5 \\
    $Q(s, a_1)$ & 2& 6 \\
    $Q(s, a_2)$ & 3& 7 \\
    $Q(s, a_3)$ & 4& 8 \\
    $\arg \min_{a}Q(s,a)$ & $a_3$& $a_3$\\
    \bottomrule
    \end{tabular}
    \end{sc}
    \end{small}
    \end{center}
    \vskip -0.1in
\end{table}

In addition, reward shaping can significantly speed up the learning  by adding a shaping
reward $F(s,s')$ to the original reward  $r$, 
where $F(s,s')$ is the general form of any state-based shaping reward.
Static potential-based reward shaping (Static PBRS) maintains the policy invariance if the
shaping reward follows from $F(s,s')=\gamma
f(s')-f(s)$ \cite{ng1999policy}.

This means that we can make changes to the TD error $\delta = r+\gamma \theta^{\top}\phi'-\theta^{\top}\phi $ while still ensuring the invariance of the optimal policy,
\begin{equation*}
    \delta - \omega= r+\gamma \theta^{\top}\phi'-\theta^{\top}\phi - \omega,
\end{equation*}
where $\omega$ is a constant, acting as a static PBRS. 
This also means that algorithms with the optimization goal 
of minimizing errors, after introducing reward shaping, 
may result in larger or smaller bias. Fortunately, 
as discussed above, bias is acceptable in reinforcement 
learning. 
However, the problem is that selecting an appropriate 
$\omega$ requires expert knowledge. This forces us to learn 
$\omega$ dynamically, i.e., $\omega=\omega_t $ and dynamic PBRS can also maintain the policy 
invariance if the shaping reward is $F(s,t,s',t')=\gamma f(s',t')-f(s,t)$,
where $t$ is the time-step the agent reaches in  state $s$
\cite{devlin2012dynamic}.
However, this result requires the convergence guarantee of the dynamic potential
function $f(s,t)$. If $f(s,t)$ does not converge as the time-step
$t\rightarrow\infty$, the Q-values of  dynamic PBRS are not 
guaranteed to converge.

Let  $f_{\omega_t}(s)=\frac{\omega_t}{\gamma-1}$.
Thus, $F_{\omega_t}(s,s')=\gamma f_{\omega_t}(s')-f_{\omega_t}(s)= \omega_t$
is a dynamic PBRS. And if $\omega$ converges finally, the dynamic potential
function $f(s,t)$ will converge.
Bias is the expected difference between the predicted value 
and the true value. Therefore, under the premise of bootstrapping, we first think of 
letting $\omega \doteq \mathbb{E}[\mathbb{E}[\delta|s]]=\mathbb{E}[\delta]$. 

As we all know, the optimization process of linear TD(0) (semi-gradient) and linear TDC are as follows, respectively:
\begin{equation*}
    \theta^{*}= \arg \min_{\theta} \mathbb{E}[(\mathbb{E}[\delta |s])^2],
\end{equation*}
and
\begin{equation*}
    \theta^{*}=\arg \min_{\theta} \mathbb{E}[\delta \phi]^{\top} \mathbb{E}[\phi \phi^{\top}]^{-1} \mathbb{E}[\delta\phi].
\end{equation*}
As a result, two novel objective functions and their corresponding algorithms are proposed, 
where $\omega$ is subsequently proven to converge, meaning that these two algorithms can maintain the invariance of the optimal strategy.

\subsection{Variance Minimization TD Learning: VMTD}
For on-policy learning,
a novel objective function, Variance of Bellman Error (VBE), is proposed as follows:
\begin{equation}
    \begin{array}{ccl}
        \arg \min_{\theta}\text{VBE}(\theta)&=&\arg \min_{\theta}\mathbb{E}[(\mathbb{E}[\delta|s]-\mathbb{E}[\mathbb{E}[\delta|s]])^2]\\
        &=&\arg \min_{\theta,\omega} \mathbb{E}[(\mathbb{E}[\delta|s]-\omega)^2].
    \end{array}
\end{equation}
Clearly, it is no longer to minimize Bellman errors. 

First, the parameter  $\omega$ is derived directly based on
stochastic gradient descent:
\begin{equation}
\omega_{k+1}\leftarrow \omega_{k}+\beta_k(\delta_k-\omega_k),
\label{omega}
\end{equation}
where $\delta_k$ is the TD error as follows:
\begin{equation}
\delta_k = r+\gamma
\theta_k^{\top}\phi_{k}'-\theta_k^{\top}\phi_k.
\label{delta}
\end{equation}

Then, based on stochastic semi-gradient descent, the update of 
the parameter $\theta$ is as follows:
\begin{equation}
\theta_{k+1}\leftarrow
\theta_{k}+\alpha_k(\delta_k-\omega_k)\phi_k.
\label{theta}
\end{equation}

The pseudocode of the VMTD algorithm is shown in Algorithm \ref{alg:algorithm 1}.

For control tasks,  two extensions of VMTD are named VMSarsa and VMQ respectively, 
and the update formulas are shown below: 
\begin{equation}
    \theta_{k+1}\leftarrow
    \theta_{k}+\alpha_k(\delta_k-\omega_k)\phi(s_k,a_k).
\end{equation}
and
\begin{equation}
    \omega_{k+1}\leftarrow \omega_{k}+\beta_k(\delta_k-\omega_k),
\end{equation}
where $\delta_k$ delta in VMSarsa is:
\begin{equation}
    \delta_{k}=r_{k+1}+\gamma \theta_{k}^{\top}\phi(s_{k+1},a_{k+1}) - \theta_{k}^{\top}\phi(s_{k},a_{k}),
    \label{deltaSarsa}
\end{equation}
and $\delta_k$ delta in VMQ is:
\begin{equation}
    \delta_{k}=r_{k+1}+\gamma \max_{a\in A}\theta_{k}^{\top}\phi(s_{k+1},a) - \theta_{k}^{\top}\phi(s_{k},a_{k}).
    \label{deltaQ}
\end{equation}

\begin{algorithm}[t]
    \caption{VMTD algorithm with linear function approximation in the on-policy setting}
    \label{alg:algorithm 1}
\begin{algorithmic}
    \STATE {\bfseries Input:} $\theta_{0}$, $\omega_{0}$, $\gamma
    $, learning rate $\alpha_t$ and $\beta_t$
    \REPEAT
    \STATE For any episode, initialize $\theta_{0}$ arbitrarily,  $\omega_{0}$ to $0$, $\gamma \in (0,1]$, and $\alpha_t$ and $\beta_t$ are constant.\\
    \FOR{$t=0$ {\bfseries to} $T-1$}
    \STATE Take $A_t$ from $S_t$ according to policy $\mu$, and arrive at $S_{t+1}$\\
    \STATE Observe sample ($S_t$,$R_{t+1}$,$S_{t+1}$) at time step $t$ (with their corresponding state feature vectors)\\
    \STATE $\delta_t = R_{t+1}+\gamma\theta_t^{\top}\phi_{t}'-\theta_t^{\top}\phi_t$
    \STATE $\theta_{t+1}\leftarrow \theta_{t}+\alpha_t(\delta_t-\omega_t)\phi_t$
    \STATE $\omega_{t+1}\leftarrow \omega_{t}+\beta_t(\delta_t-\omega_t)$
    \STATE $S_t=S_{t+1}$
    \ENDFOR
    \UNTIL{terminal episode}
\end{algorithmic}
\end{algorithm}


% \begin{algorithm}[t]
%     \caption{VMTDC algorithm with linear function approximation in the off-policy setting}
%     \label{alg:algorithm 2}
% \begin{algorithmic}
%     \STATE {\bfseries Input:} $\theta_{0}$, $u_0$, $\omega_{0}$, $\gamma
%     $, learning rate $\alpha_t$, $\zeta_t$ and $\beta_t$, behavior policy $\mu$ and target policy $\pi$
%     \REPEAT
%     \STATE For any episode, initialize $\theta_{0}$ arbitrarily, $u_t$ and $\omega_{0}$ to $0$, $\gamma \in (0,1]$, and $\alpha_t$, $\zeta_t$ and $\beta_t$ are constant.\\
%     \textbf{Output}: $\theta^*$.\\
%     \FOR{$t=0$ {\bfseries to} $T-1$}
%     \STATE Take $A_t$ from $S_t$ according to $\mu$, and arrive at $S_{t+1}$\\
%     \STATE Observe sample ($S_t$,$R_{t+1}$,$S_{t+1}$) at time step $t$ (with their corresponding state feature vectors)\\
%     \STATE $\delta_t = R_{t+1}+\gamma\theta_t^{\top}\phi_{t+1}-\theta_t^{\top}\phi_t$
%     \STATE $\rho_{t} \leftarrow \frac{\pi(A_t | S_t)}{\mu(A_t | S_t)}$
%     \STATE $\theta_{t+1}\leftarrow \theta_{t}+\alpha_t[\rho_t (\delta_t-\omega_t)\phi_t - \gamma \phi_{t+1}(\phi^{\top}_{t} u_t)]$
%     \STATE $u_{t+1}\leftarrow u_{t}+\zeta_t[\rho_t(\delta_t-\omega_t) - \phi^{\top}_{t} u_t] \phi_t$
%     \STATE $\omega_{t+1}\leftarrow \omega_{t}+\beta_t \rho_t(\delta_t-\omega_t)$
%     \STATE $S_t=S_{t+1}$
%     \ENDFOR
%     \UNTIL{terminal episode}
% \end{algorithmic}
% \end{algorithm}

\subsection{Variance Minimization TDC Learning: VMTDC}
For off-policy learning, we employ a projection operator.
The objective function is called Variance of Projected Bellman error (VPBE), 
and the corresponding algorithm is called VMTDC.
\begin{equation}
    \begin{array}{ccl}
    \text{VPBE}(\theta)&=&\mathbb{E}[(\delta-\mathbb{E}[\delta]) \phi]^{\top} \mathbb{E}[\phi \phi^{\top}]^{-1}\mathbb{E}[(\delta-\mathbb{E}[\delta])\phi]\\
    &=&\mathbb{E}[(\delta-\omega) \phi]^{\top} \mathbb{E}[\phi \phi^{\top}]^{-1}\mathbb{E}[(\delta-\omega)\phi],
    \end{array}
\end{equation}
where $\omega$  is used to estimate $\mathbb{E}[\delta]$, i.e., $\omega \doteq \mathbb{E}[\delta]$.

The derivation process of the VMTDC algorithm is the same 
as that of the TDC algorithm, the only difference is that the original $\delta$ is replaced by $\delta-\omega$.
Therefore, we can easily get the updated formula of VMTDC, as follows:
% \begin{equation*}
%     \rho_{k} \leftarrow \frac{\pi(a_k | s_k)}{\mu(a_k | s_k)},
% \end{equation*}
\begin{equation}
    \theta_{k+1}\leftarrow\theta_{k}+\alpha_{k}[(\delta_{k}- \omega_k) \phi(s_k)\\
     - \gamma\phi(s_{k+1})(\phi^{\top} (s_k) u_k)],
\label{thetavmtdc}
\end{equation}
\begin{equation}
    u_{k+1}\leftarrow u_{k}+\zeta_{k}[\delta_{k}-\omega_k - \phi^{\top} (s_k) u_k]\phi(s_k),
\label{uvmtdc}
\end{equation}
and
\begin{equation}
    \omega_{k+1}\leftarrow \omega_{k}+\beta_k (\delta_k- \omega_k),
    \label{omegavmtdc}
\end{equation}
The pseudocode of the VMTDC algorithm for importance-sampling scenario is shown in Algorithm \ref{alg:algorithm 2} of Appendix \ref{proofth2}.

Now, we will introduce the improved version of the GQ(0) algorithm, named VMGQ(0):
\begin{equation}
    \begin{array}{ccl}
    \theta_{k+1}\leftarrow\theta_{k}&+&\alpha_{k}[(\delta_{k}- \omega_k) \phi(s_k,a_k)\\
     &-& \gamma\phi(s_{k+1},A^{*}_{k+1})(\phi^{\top} (s_k,a_k) u_k)],
    \end{array}
\end{equation}
\begin{equation}
    u_{k+1}\leftarrow u_{k}+\zeta_{k}[(\delta_{k}-u_k) - \phi^{\top} (s_k,a_k) u_k]\phi(s_k,a_k),
\end{equation}
and
\begin{equation}
    \omega_{k+1}\leftarrow \omega_{k}+\beta_k(\delta_k- \omega_k),
\end{equation}
where $\delta_{k}$ is (\ref{deltaQ}) and $A^{*}_{k+1}={\arg \max}_{a}(\theta_{k}^{\top}\phi(s_{k+1},a))$.

\section{Theoretical Analysis}
The purpose of this section is to establish the stabilities of the VMTD algorithm
and the VMTDC algorithm, and also presents a corollary on the convergence rate of VMTD.

\begin{theorem}
    \label{theorem1}(Convergence of VMTD).
    In the case of on-policy learning, consider the iterations (\ref{omega}) and (\ref{theta}) with (\ref{delta})  of VMTD.
     Let the step-size sequences $\alpha_k$ and $\beta_k$, $k\geq 0$ satisfy in this case $\alpha_k,\beta_k>0$, for all $k$,
    $
    \sum_{k=0}^{\infty}\alpha_k=\sum_{k=0}^{\infty}\beta_k=\infty,
    $
    $
    \sum_{k=0}^{\infty}\alpha_k^2<\infty,
    $
    $
    \sum_{k=0}^{\infty}\beta_k^2<\infty,
    $
    and  
    $
    \alpha_k = o(\beta_k).
    $
    Assume that $(\phi_k,r_k,\phi_k')$ is an i.i.d. sequence with
    uniformly bounded second moments, where $\phi_k$ and $\phi'_{k}$ are sampled from the same Markov chain.
    Let $A = \mathrm{Cov}(\phi,\phi-\gamma\phi')$,
    $b=\mathrm{Cov}(r,\phi)$.
    Assume that matrix $A$ is  non-singular. 
    Then the parameter vector $\theta_k$ converges with probability one 
    to $A^{-1}b$.
\end{theorem}

\begin{proof}
\label{th1proof}   
    The proof is  based on Borkar's Theorem   for
    general stochastic approximation recursions with two time scales
    \cite{borkar1997stochastic}. 
    
    % The new TD error for the linear setting is 
    % \begin{equation*}
    % \delta_{\text{new}}=r+\gamma
    % \theta^{\top}\phi'-\theta^{\top}\phi-\mathbb{E}[\delta].
    % \end{equation*}
    A new one-step
    linear TD solution is defined
    as: 
    \begin{equation*}
    0=\mathbb{E}[(\delta-\mathbb{E}[\delta]) \phi]=-A\theta+b.
    \end{equation*}
    Thus, the  VMTD's solution is
    $\theta_{\text{VMTD}}=A^{-1}b$.
    
    First, note that recursion (\ref{theta}) can be rewritten as
    \begin{equation*}
    \theta_{k+1}\leftarrow \theta_k+\beta_k\xi(k),
    \end{equation*}
    where
    \begin{equation*}
    \xi(k)=\frac{\alpha_k}{\beta_k}(\delta_k-\omega_k)\phi_k
    \end{equation*}
    Due to the settings of step-size schedule $\alpha_k = o(\beta_k)$,
    $\xi(k)\rightarrow 0$ almost surely as $k\rightarrow\infty$. 
    That is the increments in iteration (\ref{omega}) are uniformly larger than
    those in (\ref{theta}), thus (\ref{omega}) is the faster recursion.
    Along the faster time scale, iterations of (\ref{omega}) and (\ref{theta})
    are associated to ODEs system as follows:
    \begin{equation}
    \dot{\theta}(t) = 0,
    \label{thetaFast}
    \end{equation}
    \begin{equation}
    \dot{\omega}(t)=\mathbb{E}[\delta_t|\theta(t)]-\omega(t).
    \label{omegaFast}
    \end{equation}
    Based on the ODE (\ref{thetaFast}), $\theta(t)\equiv \theta$ when
    viewed from the faster timescale. 
    By the Hirsch lemma \cite{hirsch1989convergent}, it follows that
    $||\theta_k-\theta||\rightarrow 0$ a.s. as $k\rightarrow \infty$ for some
    $\theta$ that depends on the initial condition $\theta_0$ of recursion
    (\ref{theta}).
    Thus, the ODE pair (\ref{thetaFast})-(\ref{omegaFast}) can be written as
    \begin{equation}
    \dot{\omega}(t)=\mathbb{E}[\delta_t|\theta]-\omega(t).
    \label{omegaFastFinal}
    \end{equation}
    Consider the function $h(\omega)=\mathbb{E}[\delta|\theta]-\omega$,
    i.e., the driving vector field of the ODE (\ref{omegaFastFinal}).
    It is easy to find that the function $h$ is Lipschitz with coefficient
    $-1$.
    Let $h_{\infty}(\cdot)$ be the function defined by
     $h_{\infty}(\omega)=\lim_{x\rightarrow \infty}\frac{h(x\omega)}{x}$.
     Then  $h_{\infty}(\omega)= -\omega$,  is well-defined. 
     For (\ref{omegaFastFinal}), $\omega^*=\mathbb{E}[\delta|\theta]$
    is the unique globally asymptotically stable equilibrium.
     For the ODE
      \begin{equation}
     \dot{\omega}(t) = h_{\infty}(\omega(t))= -\omega(t),
     \label{omegaInfty}
     \end{equation}
     apply $\vec{V}(\omega)=(-\omega)^{\top}(-\omega)/2$ as its
    associated strict Liapunov function. Then,
    the origin of (\ref{omegaInfty}) is a globally asymptotically stable
    equilibrium.
    
    
    Consider now the recursion (\ref{omega}).
    Let
    $M_{k+1}=(\delta_k-\omega_k)
    -\mathbb{E}[(\delta_k-\omega_k)|\mathcal{F}(k)]$,
    where $\mathcal{F}(k)=\sigma(\omega_l,\theta_l,l\leq k;\phi_s,\phi_s',r_s,s<k)$, 
    $k\geq 1$ are the sigma fields
    generated by $\omega_0,\theta_0,\omega_{l+1},\theta_{l+1},\phi_l,\phi_l'$,
    $0\leq l<k$.
    It is easy to verify that $M_{k+1},k\geq0$ are integrable random variables that 
    satisfy $\mathbb{E}[M_{k+1}|\mathcal{F}(k)]=0$, $\forall k\geq0$.
    Because $\phi_k$, $r_k$, and $\phi_k'$   have
    uniformly bounded second moments, it can be seen that for some constant
    $c_1>0$, $\forall k\geq0$,
    \begin{equation*}
    \mathbb{E}[||M_{k+1}||^2|\mathcal{F}(k)]\leq
    c_1(1+||\omega_k||^2+||\theta_k||^2).
    \end{equation*}
    
    
    Now Assumptions (A1) and (A2) of \cite{borkar2000ode} are verified.
    Furthermore, Assumptions (TS) of \cite{borkar2000ode} is satisfied by our
    conditions on the step-size sequences $\alpha_k$, $\beta_k$. Thus,
    by Theorem 2.2 of \cite{borkar2000ode} we obtain that
    $||\omega_k-\omega^*||\rightarrow 0$ almost surely as $k\rightarrow \infty$.
    
    Consider now the slower time scale recursion (\ref{theta}).
    Based on the above analysis, (\ref{theta}) can be rewritten as 
    \begin{equation*}
    \theta_{k+1}\leftarrow
    \theta_{k}+\alpha_k(\delta_k-\mathbb{E}[\delta_k|\theta_k])\phi_k.
    \end{equation*}
    
    Let $\mathcal{G}(k)=\sigma(\theta_l,l\leq k;\phi_s,\phi_s',r_s,s<k)$, 
    $k\geq 1$ be the sigma fields
    generated by $\theta_0,\theta_{l+1},\phi_l,\phi_l'$,
    $0\leq l<k$.
    Let 
    $
    Z_{k+1} = Y_{t}-\mathbb{E}[Y_{t}|\mathcal{G}(k)],
    $
    where
    \begin{equation*}
    Y_{k}=(\delta_k-\mathbb{E}[\delta_k|\theta_k])\phi_k.
    \end{equation*}
    Consequently,
    \begin{equation*}
    \begin{array}{ccl}
    \mathbb{E}[Y_t|\mathcal{G}(k)]&=&\mathbb{E}[(\delta_k-\mathbb{E}[\delta_k|\theta_k])\phi_k|\mathcal{G}(k)]\\
    &=&\mathbb{E}[\delta_k\phi_k|\theta_k]
    -\mathbb{E}[\mathbb{E}[\delta_k|\theta_k]\phi_k]\\
    &=&\mathbb{E}[\delta_k\phi_k|\theta_k]
    -\mathbb{E}[\delta_k|\theta_k]\mathbb{E}[\phi_k]\\
    &=&\mathrm{Cov}(\delta_k|\theta_k,\phi_k),
    \end{array}
    \end{equation*}
    where $\mathrm{Cov}(\cdot,\cdot)$ is a covariance operator.
    
     Thus,
     \begin{equation*}
    \begin{array}{ccl}
    Z_{k+1}&=&(\delta_k-\mathbb{E}[\delta_k|\theta_k])\phi_k-\mathrm{Cov}(\delta_k|\theta_k,\phi_k).
    \end{array}
    \end{equation*}
    It is easy to verify that $Z_{k+1},k\geq0$ are integrable random variables that 
    satisfy $\mathbb{E}[Z_{k+1}|\mathcal{G}(k)]=0$, $\forall k\geq0$.
    Also, because $\phi_k$, $r_k$, and $\phi_k'$  have
    uniformly bounded second moments, it can be seen that for some constant
    $c_2>0$, $\forall k\geq0$,
    \begin{equation*}
    \mathbb{E}[||Z_{k+1}||^2|\mathcal{G}(k)]\leq
    c_2(1+||\theta_k||^2).
    \end{equation*}
    
    Consider now the following ODE associated with (\ref{theta}):
    \begin{equation}
    \begin{array}{ccl}
    \dot{\theta}(t)&=&\mathrm{Cov}(\delta|\theta(t),\phi)\\
    &=&\mathrm{Cov}(r+(\gamma\phi'-\phi)^{\top}\theta(t),\phi)\\
    &=&\mathrm{Cov}(r,\phi)-\mathrm{Cov}(\theta(t)^{\top}(\phi-\gamma\phi'),\phi)\\
    &=&\mathrm{Cov}(r,\phi)-\theta(t)^{\top}\mathrm{Cov}(\phi-\gamma\phi',\phi)\\
    &=&\mathrm{Cov}(r,\phi)-\mathrm{Cov}(\phi-\gamma\phi',\phi)^{\top}\theta(t)\\
    &=&\mathrm{Cov}(r,\phi)-\mathrm{Cov}(\phi,\phi-\gamma\phi')\theta(t)\\
    &=&-A\theta(t)+b.
    \end{array}
    \label{odetheta}
    \end{equation}
    Let $\vec{h}(\theta(t))$ be the driving vector field of the ODE
    (\ref{odetheta}).
    \begin{equation*}
    \vec{h}(\theta(t))=-A\theta(t)+b.
    \end{equation*}
     Consider the cross-covariance matrix,
    \begin{equation}
    \begin{array}{ccl}
    A &=& \mathrm{Cov}(\phi,\phi-\gamma\phi')\\
      &=&\frac{\mathrm{Cov}(\phi,\phi)+\mathrm{Cov}(\phi-\gamma\phi',\phi-\gamma\phi')-\mathrm{Cov}(\gamma\phi',\gamma\phi')}{2}\\
      &=&\frac{\mathrm{Cov}(\phi,\phi)+\mathrm{Cov}(\phi-\gamma\phi',\phi-\gamma\phi')-\gamma^2\mathrm{Cov}(\phi',\phi')}{2}\\
      &=&\frac{(1-\gamma^2)\mathrm{Cov}(\phi,\phi)+\mathrm{Cov}(\phi-\gamma\phi',\phi-\gamma\phi')}{2},\\
    \end{array}
    \label{covariance}
    \end{equation}
    where we eventually used $\mathrm{Cov}(\phi',\phi')=\mathrm{Cov}(\phi,\phi)$
    \footnote{The covariance matrix $\mathrm{Cov}(\phi',\phi')$ is equal to
    the covariance matrix $\mathrm{Cov}(\phi,\phi)$ if the initial state is re-reachable or
    initialized randomly in a Markov chain for on-policy update.}.
    Note that the covariance matrix $\mathrm{Cov}(\phi,\phi)$ and
    $\mathrm{Cov}(\phi-\gamma\phi',\phi-\gamma\phi')$ are semi-positive
    definite. Then, the matrix $A$ is semi-positive definite because  $A$ is
    linearly combined  by  two positive-weighted semi-positive definite matrice
    (\ref{covariance}).
    Furthermore, $A$ is nonsingular due to the assumption.
    Hence, the cross-covariance matrix $A$ is positive definite.
    
    Therefore,
    $\theta^*=A^{-1}b$ can be seen to be the unique globally asymptotically
    stable equilibrium for ODE (\ref{odetheta}).
    Let $\vec{h}_{\infty}(\theta)=\lim_{r\rightarrow
    \infty}\frac{\vec{h}(r\theta)}{r}$. Then
    $\vec{h}_{\infty}(\theta)=-A\theta$ is well-defined. 
    Consider now the ODE
    \begin{equation}
    \dot{\theta}(t)=-A\theta(t).
    \label{odethetafinal}
    \end{equation}
    The ODE (\ref{odethetafinal}) has the origin as its unique globally asymptotically stable equilibrium.
    Thus, the assumption (A1) and (A2) are verified.
\end{proof}

Theorem 3 in \cite{dalal2020tale} provides a general conclusion on the convergence speed of all linear two-timescale 
algorithms. VMTD satisfies the assumptions of this theorem, leading 
to the following corollary.
\begin{corollary}
    \label{corollary4_2}
Consider the Sparsely Projected variant of VMTD. Then, for $\alpha_k = 1/(k+1)^{\alpha}$, $\beta_k = 1/(k+1)^{\beta}$, 
$0<\beta<\alpha<1$, $p>1$, with probility larger than $1- \tau$, for all $k\geq N_3$, we have
\begin{equation}
    ||\theta'_{k} - \theta^{*}|| \le C_{3,\theta} \frac{\sqrt{\ln (4d_{1}^{2}(k+1)^{p}/\tau)} }{(k+1)^{\alpha / 2}}
\end{equation} 
\begin{equation}
    ||\omega'_{n} - \omega^{*}|| \le C_{3,\omega} \frac{\sqrt{\ln (4d_{2}^{2}(k+1)^{p}/\tau)} }{(k+1)^{\omega / 2}},
\end{equation} 
\end{corollary}

where $d_1$ and $d_2$ represent the dimensions of $\theta$ and $\omega$, respectively. For VMTD, $d_2 =1$.
The meanings of $N_3$,$C_{3,\theta}$ and $C_{3,\omega}$ are explained in \cite{dalal2020tale}.
The formulas for $\theta'_{k}$ and $\omega'_{n}$ can be found in (\ref{sparseprojectiontheta}) and (\ref{sparseprojectionomega}).


\begin{theorem}
    \label{theorem2}(Convergence of VMTDC).
    In the case of off-policy learning, consider the iterations (\ref{omegavmtdc}), (\ref{uvmtdc}) and (\ref{thetavmtdc})   of VMTDC.
     Let the step-size sequences $\alpha_k$, $\zeta_k$ and $\beta_k$, $k\geq 0$ satisfy in this case $\alpha_k,\zeta_k,\beta_k>0$, for all $k$,
    $
    \sum_{k=0}^{\infty}\alpha_k=\sum_{k=0}^{\infty}\beta_k=\sum_{k=0}^{\infty}\zeta_k=\infty,
    $
    $
    \sum_{k=0}^{\infty}\alpha_k^2<\infty,
    $
    $
    \sum_{k=0}^{\infty}\zeta_k^2<\infty,
    $
    $
    \sum_{k=0}^{\infty}\beta_k^2<\infty,
    $
    and  
    $
    \alpha_k = o(\zeta_k),
    $
    $
    \zeta_k = o(\beta_k).
    $
    Assume that $(\phi_k,r_k,\phi_k')$ is an i.i.d. sequence with
    uniformly bounded second moments.
    Let $A = \mathrm{Cov}(\phi,\phi-\gamma\phi')$,
    $b=\mathrm{Cov}(r,\phi)$, and $C=\mathbb{E}[\phi\phi^{\top}]$.
    Assume that  $A$ and $C$ are  non-singular matrices. 
    Then the parameter vector $\theta_k$ converges with probability one 
    to $A^{-1}b$.
\end{theorem}
Please refer to the appendix \ref{proofth2} for detailed proof process.

\section{Experimental Studies}
This section assesses algorithm performance through experiments, 
which are divided into policy evaluation experiments and control experiments.

\subsection{Testing Tasks}
\textbf{Random-walk:} as shown in Figure \ref{randomwalk}, all episodes
start in the center state, $C$, and proceed to left or right by one state on each
step, equiprobably. Episodes terminate either on the extreme left or
the extreme right, and get a reward of $+1$ if terminate on the right, or
$0$ in the other case. In this task, the true value for each state is  the
probability of starting from that state and terminating on the right
\cite{Sutton2018book}.
Thus, the true values of states from $A$ to $E$ are
$\frac{1}{6},\frac{2}{6},\frac{3}{6},\frac{4}{6},\frac{5}{6}$, respectively.
The discount factor $\gamma=1.0$. 
There are three standard kinds of features for random-walk problems: tabular
feature, inverted feature and dependent feature \cite{sutton2009fast}. 
The feature matrices corresponding to three random walks are shown in Appendix \ref{experimentaldetails}.
Conduct experiments using
an on-policy approach in the Random-walk environment.
\begin{figure}
    \begin{center}
    \input{pic/randomwalk.tex}
    % \captionsetup{width=0.5\textwidth}
    \caption{Random walk.}
    \label{randomwalk}
    \end{center}
\end{figure}
\begin{figure}
    \begin{center}
    \input{pic/BairdExample.tex}
    \caption{7-state version of Baird's off-policy counterexample.}
    \label{bairdexample}
    \end{center}
\end{figure}
    
\textbf{Baird's off-policy counterexample:} This task is well known as a
counterexample, in which TD diverges \cite{baird1995residual,sutton2009fast}. As
shown in Figure \ref{bairdexample}, reward for each transition is zero. Thus the true values are zeros for all states and for any given policy. The behaviour policy
chooses actions represented by solid lines with a probability of $\frac{1}{7}$
and actions represented by dotted lines with a probability of $\frac{6}{7}$. The
target policy is expected to choose the solid line with more probability than $\frac{1}{7}$,
and it chooses the solid line with probability of $1$ in this paper.
 The discount factor $\gamma =0.99$, and the feature matrix is
defined in Appendix \ref{experimentaldetails} \cite{baird1995residual,sutton2009fast,maei2011gradient}.

\textbf{Maze}:  The learning agent should find a shortest path from the upper
left corner to the lower right corner. 
 \begin{wrapfigure}{r}{4cm}
\centering
\includegraphics[scale=0.2]{pic/maze_13_13.pdf} 
% \caption{The 2-state counterexample.}
\end{wrapfigure}
 In each state,
there are four alternative actions: $up$, $down$, $left$, and $right$, which
takes the agent deterministically to the corresponding neighbour state,
except when a movement is blocked by an obstacle or the edge
of the maze. Rewards are $-1$ in all transitions until the
agent reaches the goal state.
The discount factor $\gamma=0.99$, and states $s$ are represented by tabular
features.The maximum number of moves in the game is set to 1000.

\textbf{The other three control environments}: Cliff Walking, Mountain Car, and Acrobot are 
selected from the gym official website and correspond to the following 
versions: ``CliffWalking-v0'', ``MountainCar-v0'' and ``Acrobot-v1''. 
For specific details, please refer to the gym official website.
The maximum number of steps for the Mountain Car environment is set to 1000, 
while the default settings are used for the other two environments. In  Mountain car and Acrobot, features are generated by tile coding.

Please, refer to the Appendix \ref{experimentaldetails} for the selection of learning rates for all experiments.

\subsection{Experimental Results and Analysis}
\begin{figure}[htb]
    \vskip 0.2in
    \begin{center}
    \subfigure[Dependent]{
        \includegraphics[width=0.46\columnwidth, height=0.46\columnwidth]{pic/dependent_new.pdf}
        \label{DependentFull}
    }
    \subfigure[Tabular]{
        \includegraphics[width=0.46\columnwidth, height=0.46\columnwidth]{pic/tabular_new.pdf}
        \label{TabularFull}
    }
    \\
    \subfigure[Inverted]{
        \includegraphics[width=0.46\columnwidth, height=0.46\columnwidth]{pic/inverted_new.pdf}
        \label{InvertedFull}
    }
    \subfigure[counterexample]{
        \includegraphics[width=0.46\columnwidth, height=0.46\columnwidth]{pic/counterexample_quanju_new.pdf}
        \label{CounterExampleFull}
    }
        \caption{Learning curses of four evaluation environments.}
        \label{Evaluation_full}
    \end{center}
    \vskip -0.2in
\end{figure}
\begin{figure*}[htb]
    \vskip 0.2in
    \begin{center}
    \subfigure[Maze]{
        \includegraphics[width=0.9\columnwidth, height=0.64\columnwidth]{pic/maze_complete.pdf}
        \label{MazeFull}
    }
    \subfigure[Cliff Walking]{
        \includegraphics[width=0.9\columnwidth, height=0.64\columnwidth]{pic/cw_complete.pdf}
        \label{CliffWalkingFull}
    }
    \\
    \subfigure[Mountain Car]{
        \includegraphics[width=0.9\columnwidth, height=0.64\columnwidth]{pic/mt_complete.pdf}
        \label{MountainCarFull}
    }
    \subfigure[Acrobot]{
        \includegraphics[width=0.9\columnwidth, height=0.64\columnwidth]{pic/Acrobot_complete.pdf}
        \label{AcrobotFull}
    }
        \caption{Learning curses of four contral environments.}
        \label{Complete_full}
    \end{center}
    \vskip -0.2in
\end{figure*}
\begin{table*}
    \centering
    \caption{Difference between R-learning and tabular VMQ.}
    \vskip 0.15in
    \begin{tabular}{c|cc}
        \hline
        algorithms&update formula \\
        \hline
         R-learning&$Q_{k+1}(s,a)\leftarrow Q_{k}(s,a)+\alpha_k(r_{k+1}-m_{k}+ \max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a))$\\
                &$m_{k+1}\leftarrow m_{k}+\beta_k(r_{k+1}+\max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a)-m_{k})$\\
         tabular VMQ&$Q_{k+1}(s,a)\leftarrow Q_{k}(s,a)+\alpha_k(r_{k+1}+\gamma \max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a)-\omega_k)$\\
         &$\omega_{k+1}\leftarrow \omega_{k}+\beta_k(r_{k+1}+\gamma \max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a)-\omega_{k})$\\
         \hline
    \end{tabular}
    \label{differenceRandVMQ}
    \vskip -0.1in
\end{table*}

% The learning rates of all algorithms in different environments are shown in Table \ref{lrofways}. 
% Figure \ref{Complete_full} shows the experimental curves of different algorithms in four environments.

For policy evaluation experiments, compare the performance of the VMTD, 
VMTDC, TD, and TDC algorithms. 
The vertical axis is unified as RVBE.

For policy evaluation experiments, the criteria for evaluating 
  algorithms vary. The objective function minimized by our proposed 
  new algorithm differs from that of other algorithms. Therefore, to 
ensure fairness in comparisons, this study only contrasts algorithm 
experiments in controlled settings.

This study will compare the performance of Sarsa, Q-learning, GQ(0), 
  AC, VMSarsa, VMQ, and VMGQ(0) in four control environments.
% All experiments involved in this paper were run independently for 100 times.

The learning curves of the algorithms corresponding to 
policy evaluation experiments and control experiments are 
shown in Figures \ref{Evaluation_full} and \ref{Complete_full}, respectively.
The shaded area in Figure \ref{Evaluation_full}, \ref{Complete_full} represents the standard deviation (std).

In the random-walk tasks, VMTD and VMTDC exhibit excellent performance, 
outperforming TD and TDC in the case of dependent random-walk.

In the 7-state example counter task, TD diverges, 
while VMTDC converges and performs better than TDC. 
From the update formula, it can be observed that the VMTD algorithm, like TDC,  
is also an adjustment or correction of the TD update.
What is more surprising is that VMTD also maintains 
convergence and demonstrates the best performance.

In  Maze, Mountain Car, and Acrobot, 
the convergence speed of VMSarsa, VMQ, and VMGQ(0) has 
been significantly improved compared to Sarsa, Q-learning, 
and GQ(0), respectively. The performance of the AC algorithm 
is at an intermediate level. The performances of VMSarsa, 
VMQ, and VMGQ(0) in these three experimental environments 
have no significant differences.


In  Cliff Walking, Sarsa and 
VMSarsa converge to slightly worse solutions compared to 
other algorithms. The convergence speed of VMSarsa is significantly 
better than that of Sarsa. The convergence speed of VMGQ(0) and VMQ 
is better than other algorithms, and the performance of VMGQ(0) is 
slightly better than that of VMQ.

In summary, the performance of VMSarsa, 
VMQ, and VMGQ(0) is better than that of other algorithms. 
In the Cliff Walking environment, 
the performance of VMGQ(0) is slightly better than that of 
VMSarsa and VMQ. In the other three experimental environments, 
the performances of VMSarsa, VMQ, and VMGQ(0) are close.


\section{Related Work}
\subsection{Difference between VMQ and R-learning}
Tabular VMQ's update formula bears some resemblance 
to R-learning's update formula. As shown in Table \ref{differenceRandVMQ}, the update formulas of the two algorithms have the following differences:
\\(1) The goal of the R-learning algorithm \cite{schwartz1993reinforcement} is to maximize the average 
reward, rather than the cumulative reward, by learning an estimate 
of the average reward. This estimate $m$ is then used to update the Q-values.
On the contrary, the $\omega$ in the tabular VMQ update formula eventually converges to $\mathbb{E}[\delta]$.
\\(2) When $\gamma=1$ in the tabular VMQ update formula, the 
R-learning update formula is formally 
the same as the tabular VMQ update formula. 
Therefore, R-learning algorithm can be 
considered as a special case of VMQ algorithm in form.

\subsection{Variance Reduction for TD Learning}
 The TD with centering algorithm (CTD) \cite{korda2015td} 
was proposed, which directly applies variance reduction techniques to 
the TD algorithm. The CTD algorithm updates its parameters using the 
average gradient of a batch of Markovian samples and a projection operator. 
Unfortunately, the authors’ analysis of the CTD algorithm contains technical 
errors. The VRTD algorithm \cite{xu2020reanalysis} is also a variance-reduced algorithm that updates 
its parameters using the average gradient of a batch of i.i.d. samples. The 
authors of VRTD provide a technically sound analysis to demonstrate the 
advantages of variance reduction. 

\subsection{Variance Reduction for Policy Gradient Algorithms}
Policy gradient algorithms are a class of reinforcement 
learning algorithms that directly optimize cumulative rewards. 
REINFORCE  is a Monte Carlo algorithm that estimates 
gradients through sampling, but may have a high variance. 
Baselines are introduced to reduce variance and to
accelerate learning \cite{Sutton2018book}. In  Actor-Critic, 
value function as a baseline and bootstrapping 
 are used to reduce variance, also accelerating convergence \cite{Sutton2018book}.
 TRPO \cite{schulman2015trust} and PPO \cite{schulman2017proximal}
  use generalized advantage 
estimation, which combines multi-step bootstrapping and Monte Carlo 
estimation to reduce variance, making gradient estimation more stable and 
accelerating convergence. 

In Variance Minimization, 
the incorporation of $\omega \doteq \mathbb{E}[\delta]$ 
bears a striking resemblance to the use of a baseline 
in policy gradient methods. The introduction of a baseline 
in policy gradient techniques does not alter 
the expected value of the update; 
rather, it significantly impacts the variance of gradient estimation. 
The addition of $\omega \doteq \mathbb{E}[\delta]$ in Variance Minimization 
 preserves the invariance of the optimal 
policy while stabilizing gradient estimation, 
reducing the variance of gradient estimation, 
and hastening convergence.

\section{Conclusion and Future Work}
Value-based reinforcement learning typically aims 
to minimize error as an optimization objective. 
As an alternation, this study proposes two new objective 
functions: VBE and VPBE, and derives an on-policy algorithm: 
VMTD and an off-policy algorithm: VMTDC. 
% The VMTD algorithm 
% is essentially an adjustment or correction to the traditional 
% TD update. 
%  Both 
% algorithms are capable of stabilizing gradient estimation, reducing 
% the variance of gradient estimation and accelerating convergence.
Both algorithms demonstrated superior performance in policy 
evaluation and control experiments.
Future work may include, but are not limited
to, (1) analysis of the convergence rate of VMTDC. 
(2) extensions of VBE and VPBE to multi-step returns. 
(3) extensions to nonlinear approximations, such as neural networks. 

% \section{Format of the Paper}

% All submissions must follow the specified format.

% \begin{figure}[ht]
% \vskip 0.2in
% \begin{center}
% \centerline{\includegraphics[width=\columnwidth]{icml_numpapers}}
% \caption{Historical locations and number of accepted papers for International
% Machine Learning Conferences (ICML 1993 -- ICML 2008) and International
% Workshops on Machine Learning (ML 1988 -- ML 1992). At the time this figure was
% produced, the number of accepted papers for ICML 2008 was unknown and instead
% estimated.}
% \label{icml-historical}
% \end{center}
% \vskip -0.2in
% \end{figure}

% \subsection{Figures}

% You may want to include figures in the paper to illustrate
% your approach and results. Such artwork should be centered,
% legible, and separated from the text. Lines should be dark and at
% least 0.5~points thick for purposes of reproduction, and text should
% not appear on a gray background.

% Label all distinct components of each figure. If the figure takes the
% form of a graph, then give a name for each axis and include a legend
% that briefly describes each curve. Do not include a title inside the
% figure; instead, the caption should serve this function.

% Number figures sequentially, placing the figure number and caption
% \emph{after} the graphics, with at least 0.1~inches of space before
% the caption and 0.1~inches after it, as in
% \cref{icml-historical}. The figure caption should be set in
% 9~point type and centered unless it runs two or more lines, in which
% case it should be flush left. You may float figures to the top or
% bottom of a column, and you may set wide figures across both columns
% (use the environment \texttt{figure*} in \LaTeX). Always place
% two-column figures at the top or bottom of the page.


% \subsection{Theorems and such}
% The preferred way is to number definitions, propositions, lemmas, etc. consecutively, within sections, as shown below.
% \begin{definition}
% \label{def:inj}
% A function $f:X \to Y$ is injective if for any $x,y\in X$ different, $f(x)\ne f(y)$.
% \end{definition}
% Using \cref{def:inj} we immediate get the following result:
% \begin{proposition}
% If $f$ is injective mapping a set $X$ to another set $Y$, 
% the cardinality of $Y$ is at least as large as that of $X$
% \end{proposition}
% \begin{proof} 
% Left as an exercise to the reader. 
% \end{proof}
% \cref{lem:usefullemma} stated next will prove to be useful.
% \begin{lemma}
% \label{lem:usefullemma}
% For any $f:X \to Y$ and $g:Y\to Z$ injective functions, $f \circ g$ is injective.
% \end{lemma}
% \begin{theorem}
% \label{thm:bigtheorem}
% If $f:X\to Y$ is bijective, the cardinality of $X$ and $Y$ are the same.
% \end{theorem}
% An easy corollary of \cref{thm:bigtheorem} is the following:
% \begin{corollary}
% If $f:X\to Y$ is bijective, 
% the cardinality of $X$ is at least as large as that of $Y$.
% \end{corollary}
% \begin{assumption}
% The set $X$ is finite.
% \label{ass:xfinite}
% \end{assumption}
% \begin{remark}
% According to some, it is only the finite case (cf. \cref{ass:xfinite}) that is interesting.
% \end{remark}
%restatable


% In the unusual situation where you want a paper to appear in the
% references without citing it in the main text, use \nocite
\nocite{langley00}

\bibliography{example_paper}
\bibliographystyle{icml2024}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% APPENDIX
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\appendix
\onecolumn
\section{Relevant proofs}
\subsection{Proof of Corollary \ref{corollary4_2}}
\label{proofcorollary4_2}
The update formulas in linear two-timescale algorithms are as follows:
\begin{equation}
    \theta_{k+1}=\theta_{k} + \alpha_{k}[h_1(\theta_{k},\omega_{k})+M^{(1)}_{k+1}],
\end{equation}
\begin{equation}
    \omega_{k+1}=\omega_{k} + \alpha_{k}[h_2(\theta_{k},\omega_{k})+M^{(2)}_{k+1}].
\end{equation}
where $\alpha_k, \beta_k \in \mathbb{R} $ are stepsizes and $M^{(1)} \in \mathbb{R}^{d_1}, M^{(2)} \in \mathbb{R}^{d_2}$
denote noise. 
$h_1 : \mathbb{R}^{d_{1}}\times \mathbb{R}^{d_{2}}\rightarrow \mathbb{R}^{d_{1}}$ and 
$h_2 : \mathbb{R}^{d_{1}}\times \mathbb{R}^{d_{2}}\rightarrow \mathbb{R}^{d_{2}}$ have the 
form, respectively, 
\begin{equation}
    h_{1}(\theta,\omega)=v_1 - \Gamma_1 \theta - W_1\omega,
\end{equation}
\begin{equation}
    h_{2}(\theta,\omega)=v_2 - \Gamma_2 \theta - W_2\omega,
\end{equation}
where $v_1 \in \mathbb{R}^{d_1}$, $v_2 \in \mathbb{R}^{d_2}$, $\Gamma_1 \in \mathbb{R}^{d_1 \times d_1}$
, $\Gamma_2 \in \mathbb{R}^{d_2 \times d_1}$, $W_1 \in \mathbb{R}^{d_1 \times d_2}$ and 
$W_2 \in \mathbb{R}^{d_2 \times d_2}$. $d_1$ and $d_2$ are the dimensions of vectors $\theta$ and $\omega$, respectively.

For Theorem 3 in \cite{dalal2020tale}, the theorem still holds even when $d——1$ is not equal to $d_2$. For the VMTD algorithm, $d_2$ is equal to 1.
% Before proving the Corollary \ref{corollary4_2}, 
\cite{dalal2020tale} presents 
the matrix assumption, step size assumption, and 
defines sparse projection.

\begin{assumption}
\label{matrixassumption}
(Matrix Assumption).
$W_2$ and $X_1 = \Gamma_1 - W_1 W_{2}^{-1}\Gamma_2$ are positive definite(not necessarily symmetric).
\end{assumption}

\begin{assumption}
    \label{stepsizeassumption}
(Step Size Assumption).
$\alpha_k = (k+1)^{-\alpha}$ and $\beta_k = (k+1)^{-\beta}$, where $1>\alpha > \beta > 0$.
\end{assumption}

\begin{definition}
    \label{sparseprojection}
(Sparse Projection).
For $R>0$, let $\Pi_{R}(x)=\min \{1, R/||x||\}$. $x$ be the projection into the ball with redius
R around the origin. The sparse projection operator
\begin{equation*}
    \Pi_{n, R} = \begin{cases}
        \Pi_{R}, & \text{if } k = n^{n} - 1 \text{ for some } n \in \mathbb{Z}_{>0}, \\
        I, & \text{otherwise}.
    \end{cases}
\end{equation*}
We call it sparse as it projects only on specific indices that are exponentially far apart.

Pick an arbitrary $p>1$. Fix some constant $R^{\theta}_{\text{proj}}>0$ and $R^{\omega}_{\text{proj}}>0$ 
for the radius of the projection ball. Further, let 
\begin{equation*}
    \theta^{*}=X^{-1}_{1}b_{1}, \omega^{*}=W^{-1}_{2}(v_2 - \Gamma_2 \theta^{*})
\end{equation*}
with $b_1=v_1 - W_1 W_2^{-1}v_2$.
The formula for the sparse projection update in linear two-timescale algorithms is as follows:
\begin{equation}
    \label{sparseprojectiontheta}
    \theta'_{k+1}=\Pi_{k+1,R^{\theta}_{\text{proj}}}(\theta'_{k} + \alpha_{k}[h_1(\theta'_{k},\omega'_{k})+M^{(1')}_{k+1}]),
\end{equation}
\begin{equation}
    \label{sparseprojectionomega}
    \omega'_{k+1}=\Pi_{k+1,R^{\omega}_{\text{proj}}}(\omega'_{k} + \beta_{k}[h_2(\theta'_{k},\omega'_{k})+M^{(2')}_{k+1}]).
\end{equation}

\end{definition}

\begin{proof}
    As long as the VMTD algorithm satisfies Assumption \ref{matrixassumption}, 
the convergence speed of the VMTD algorithm can be 
obtained.

VMTD's update rule is 
\begin{equation*}
    \theta_{k+1}=\theta_{k}+\alpha_k(\delta_k-\omega_k)\phi_k.
\end{equation*}
\begin{equation*}
        \omega_{k+1}=\omega_{k}+\beta_k(\delta_k-\omega_k).
\end{equation*}
Thus, $h_1(\theta, \omega)=\mathrm{Cov}(r,\phi)-\mathrm{Cov}(\phi,\phi - \gamma\phi')\theta$, 
$h_2(\theta, \omega)=\mathbb{E}[r]+\mathbb{E}[\gamma \phi'^{\top}-\phi^{\top}]\theta -\omega$, 
$\Gamma_1 =\mathrm{Cov}(\phi,\phi - \gamma\phi')$, 
$W_1 = 0$ and 
$\Gamma_2 = -\mathbb{E}[\gamma \phi'^{\top}-\phi^{\top}]$, 
$W_2 = 1$, 
$v_2 = \mathbb{E}[r]$. Additionally, 
$X_1=\Gamma_1 - W_1 W^{-1}_2 \Gamma_2 = \mathrm{Cov}(\phi,\phi - \gamma\phi')$.
% By the Assumption \ref{matrixassumption}, 
It can be deduced from the proof \ref{th1proof} that $X_1$ is a positive definite matrix.
The VMTD algorithm satisfies the Assumption \ref{matrixassumption}.
By the proof \ref{th1proof}, Definition 1 in \cite{dalal2020tale} is satisfied.
We can apply the Theorem 3 in \cite{dalal2020tale} to get the Corollary \ref{corollary4_2}.


\end{proof}


\subsection{Proof of Theorem \ref{theorem2}}
\label{proofth2}
\begin{proof}
The proof is similar to that given by \cite{sutton2009fast} for TDC, but it is based on multi-time-scale stochastic approximation.

For the VMTDC algorithm, a new one-step linear TD solution is defined as:
\begin{equation*}
    0=\mathbb{E}[(\phi - \gamma \phi' - \mathbb{E}[\phi - \gamma \phi'])\phi^\top]\mathbb{E}[\phi \phi^{\top}]^{-1}\mathbb{E}[(\delta -\mathbb{E}[\delta])\phi]=A^{\top}C^{-1}(-A\theta+b).
\end{equation*}
The matrix $A^{\top}C^{-1}A$ is positive definite. Thus, the  VMTD's solution is
$\theta_{\text{VMTDC}}=\theta_{\text{VMTD}}=A^{-1}b$.

First, note that recursion (\ref{thetavmtdc}) and (\ref{uvmtdc}) can be rewritten as, respectively, 
\begin{equation*}
    \theta_{k+1}\leftarrow \theta_k+\zeta_k x(k),
\end{equation*}
\begin{equation*}
    u_{k+1}\leftarrow u_k+\beta_k y(k),
\end{equation*}
where 
\begin{equation*}
    x(k)=\frac{\alpha_k}{\zeta_k}[(\delta_{k}- \omega_k) \phi_k - \gamma\phi'_{k}(\phi^{\top}_k u_k)],
\end{equation*}
\begin{equation*}
    y(k)=\frac{\zeta_k}{\beta_k}[\delta_{k}-\omega_k - \phi^{\top}_k u_k]\phi_k.
\end{equation*}

Recursion (\ref{thetavmtdc}) can also be rewritten as
\begin{equation*}
    \theta_{k+1}\leftarrow \theta_k+\beta_k z(k),
\end{equation*}
where
\begin{equation*}
    z(k)=\frac{\alpha_k}{\beta_k}[(\delta_{k}- \omega_k) \phi_k - \gamma\phi'_{k}(\phi^{\top}_k u_k)],
\end{equation*}

Due to the settings of step-size schedule 
$\alpha_k = o(\zeta_k)$, $\zeta_k = o(\beta_k)$, $x(k)\rightarrow 0$, $y(k)\rightarrow 0$, $z(k)\rightarrow 0$ almost surely as $k\rightarrow 0$.
That is that the increments in iteration (\ref{omegavmtdc}) are uniformly larger than
those in (\ref{uvmtdc}) and  the increments in iteration (\ref{uvmtdc}) are uniformly larger than
those in (\ref{thetavmtdc}), thus (\ref{omegavmtdc}) is the fastest recursion, (\ref{uvmtdc}) is the second fast recursion and (\ref{thetavmtdc}) is the slower recursion.
Along the fastest time scale, iterations of (\ref{thetavmtdc}), (\ref{uvmtdc}) and (\ref{omegavmtdc})
are associated to ODEs system as follows:
\begin{equation}
    \dot{\theta}(t) = 0,
    \label{thetavmtdcFastest}
\end{equation}
\begin{equation}
    \dot{u}(t) = 0,
    \label{uvmtdcFastest}
\end{equation}
\begin{equation}
    \dot{\omega}(t)=\mathbb{E}[\delta_t|u(t),\theta(t)]-\omega(t).
    \label{omegavmtdcFastest}
\end{equation}

Based on the ODE (\ref{thetavmtdcFastest}) and (\ref{uvmtdcFastest}), both $\theta(t)\equiv \theta$
and $u(t)\equiv u$ when viewed from the fastest timescale.
By the Hirsch lemma \cite{hirsch1989convergent}, it follows that
$||\theta_k-\theta||\rightarrow 0$ a.s. as $k\rightarrow \infty$ for some
$\theta$ that depends on the initial condition $\theta_0$ of recursion
(\ref{thetavmtdc}) and $||u_k-u||\rightarrow 0$ a.s. as $k\rightarrow \infty$ for some
$u$ that depends on the initial condition $u_0$ of recursion
(\ref{uvmtdc}). Thus, the ODE pair (\ref{thetavmtdcFastest})-(ref{omegavmtdcFastest})
can be written as 
\begin{equation}
    \dot{\omega}(t)=\mathbb{E}[\delta_t|u,\theta]-\omega(t).
    \label{omegavmtdcFastestFinal}
\end{equation}

Consider the function $h(\omega)=\mathbb{E}[\delta|\theta,u]-\omega$,
i.e., the driving vector field of the ODE (\ref{omegavmtdcFastestFinal}).
It is easy to find that the function $h$ is Lipschitz with coefficient
$-1$.
Let $h_{\infty}(\cdot)$ be the function defined by
 $h_{\infty}(\omega)=\lim_{r\rightarrow \infty}\frac{h(r\omega)}{r}$.
 Then  $h_{\infty}(\omega)= -\omega$,  is well-defined. 
 For (\ref{omegavmtdcFastestFinal}), $\omega^*=\mathbb{E}[\delta|\theta,u]$
is the unique globally asymptotically stable equilibrium.
For the ODE
\begin{equation}
 \dot{\omega}(t) = h_{\infty}(\omega(t))= -\omega(t),
 \label{omegavmtdcInfty}
\end{equation}
apply $\vec{V}(\omega)=(-\omega)^{\top}(-\omega)/2$ as its
associated strict Liapunov function. Then,
the origin of (\ref{omegavmtdcInfty}) is a globally asymptotically stable
equilibrium.

Consider now the recursion (\ref{omegavmtdc}).
Let
$M_{k+1}=(\delta_k-\omega_k)
-\mathbb{E}[(\delta_k-\omega_k)|\mathcal{F}(k)]$,
where $\mathcal{F}(k)=\sigma(\omega_l,u_l,\theta_l,l\leq k;\phi_s,\phi_s',r_s,s<k)$, 
$k\geq 1$ are the sigma fields
generated by $\omega_0,u_0,\theta_0,\omega_{l+1},u_{l+1},\theta_{l+1},\phi_l,\phi_l'$,
$0\leq l<k$.
It is easy to verify that $M_{k+1},k\geq0$ are integrable random variables that 
satisfy $\mathbb{E}[M_{k+1}|\mathcal{F}(k)]=0$, $\forall k\geq0$.
Because $\phi_k$, $r_k$, and $\phi_k'$   have
uniformly bounded second moments, it can be seen that for some constant
$c_1>0$, $\forall k\geq0$,
\begin{equation*}
\mathbb{E}[||M_{k+1}||^2|\mathcal{F}(k)]\leq
c_1(1+||\omega_k||^2+||u_k||^2+||\theta_k||^2).
\end{equation*}


Now Assumptions (A1) and (A2) of \cite{borkar2000ode} are verified.
Furthermore, Assumptions (TS) of \cite{borkar2000ode} is satisfied by our
conditions on the step-size sequences $\alpha_k$,$\zeta_k$, $\beta_k$. Thus,
by Theorem 2.2 of \cite{borkar2000ode} we obtain that
$||\omega_k-\omega^*||\rightarrow 0$ almost surely as $k\rightarrow \infty$.

Consider now the second time scale recursion (\ref{uvmtdc}).
Based on the above analysis, (\ref{uvmtdc}) can be rewritten as
% \begin{equation*}
%     u_{k+1}\leftarrow u_{k}+\zeta_{k}[\delta_{k}-\mathbb{E}[\delta_k|u_k,\theta_k] - \phi^{\top} (s_k) u_k]\phi(s_k).
% \end{equation*}
\begin{equation}
    \dot{\theta}(t) = 0,
    \label{thetavmtdcFaster}
\end{equation}
\begin{equation}
    \dot{u}(t) = \mathbb{E}[(\delta_t-\mathbb{E}[\delta_t|u(t),\theta(t)])\phi_t|\theta(t)] - Cu(t).
    \label{uvmtdcFaster}
\end{equation}
The ODE (\ref{thetavmtdcFaster}) suggests that $\theta(t)\equiv \theta$ (i.e., a time invariant parameter)
when viewed from the second fast timescale.
By the Hirsch lemma \cite{hirsch1989convergent}, it follows that
$||\theta_k-\theta||\rightarrow 0$ a.s. as $k\rightarrow \infty$ for some
$\theta$ that depends on the initial condition $\theta_0$ of recursion
(\ref{thetavmtdc}). 

Consider now the recursion (\ref{uvmtdc}).
Let
$N_{k+1}=((\delta_k-\mathbb{E}[\delta_k]) - \phi_k \phi^{\top}_k u_k) -\mathbb{E}[((\delta_k-\mathbb{E}[\delta_k]) - \phi_k \phi^{\top}_k u_k)|\mathcal{I} (k)]$,
where $\mathcal{I}(k)=\sigma(u_l,\theta_l,l\leq k;\phi_s,\phi_s',r_s,s<k)$, 
$k\geq 1$ are the sigma fields
generated by $u_0,\theta_0,u_{l+1},\theta_{l+1},\phi_l,\phi_l'$,
$0\leq l<k$.
It is easy to verify that $N_{k+1},k\geq0$ are integrable random variables that 
satisfy $\mathbb{E}[N_{k+1}|\mathcal{I}(k)]=0$, $\forall k\geq0$.
Because $\phi_k$, $r_k$, and $\phi_k'$   have
uniformly bounded second moments, it can be seen that for some constant
$c_2>0$, $\forall k\geq0$,
\begin{equation*}
\mathbb{E}[||N_{k+1}||^2|\mathcal{I}(k)]\leq
c_2(1+||u_k||^2+||\theta_k||^2).
\end{equation*}

Because $\theta(t)\equiv \theta$ from (\ref{thetavmtdcFaster}), the ODE pair (\ref{thetavmtdcFaster})-(\ref{uvmtdcFaster})
can be written as 
\begin{equation}
    \dot{u}(t) = \mathbb{E}[(\delta_t-\mathbb{E}[\delta_t|\theta])\phi_t|\theta] - Cu(t).
    \label{uvmtdcFasterFinal}
\end{equation}
Now consider the function $h(u)=\mathbb{E}[\delta_t-\mathbb{E}[\delta_t|\theta]|\theta] -Cu$, i.e., the
driving vector field of the ODE (\ref{uvmtdcFasterFinal}). For (\ref{uvmtdcFasterFinal}),
$u^* = C^{-1}\mathbb{E}[(\delta-\mathbb{E}[\delta|\theta])\phi|\theta]$ is the unique globally asymptotically
stable equilibrium. Let $h_{\infty}(u)=-Cu$.
For the ODE
\begin{equation}
    \dot{u}(t) = h_{\infty}(u(t))= -Cu(t),
    \label{uvmtdcInfty}
\end{equation}
the origin of (\ref{uvmtdcInfty}) is a globally asymptotically stable
equilibrium because $C$ is a positive definite matrix (because it is nonnegative definite and nonsingular).
Now Assumptions (A1) and (A2) of \cite{borkar2000ode} are verified.
Furthermore, Assumptions (TS) of \cite{borkar2000ode} is satisfied by our
conditions on the step-size sequences $\alpha_k$,$\zeta_k$, $\beta_k$. Thus,
by Theorem 2.2 of \cite{borkar2000ode} we obtain that
$||u_k-u^*||\rightarrow 0$ almost surely as $k\rightarrow \infty$.

Consider now the slower timescale recursion (\ref{thetavmtdc}). In the light of the above,
(\ref{thetavmtdc}) can be rewritten as 
\begin{equation}
    \theta_{k+1} \leftarrow \theta_{k} + \alpha_k (\delta_k -\mathbb{E}[\delta_k|\theta_k]) \phi_k\\
    - \alpha_k \gamma\phi'_{k}(\phi^{\top}_k C^{-1}\mathbb{E}[(\delta_k -\mathbb{E}[\delta_k|\theta_k])\phi|\theta_k]).
\end{equation}
Let $\mathcal{G}(k)=\sigma(\theta_l,l\leq k;\phi_s,\phi_s',r_s,s<k)$, 
$k\geq 1$ be the sigma fields
generated by $\theta_0,\theta_{l+1},\phi_l,\phi_l'$,
$0\leq l<k$. Let
\begin{equation*}
    \begin{array}{ccl}
    Z_{k+1}&=&((\delta_k -\mathbb{E}[\delta_k|\theta_k]) \phi_k - \gamma \phi'_{k}\phi^{\top}_k C^{-1}\mathbb{E}[(\delta_k -\mathbb{E}[\delta_k|\theta_k])\phi|\theta_k])\\ 
     & &-\mathbb{E}[((\delta_k -\mathbb{E}[\delta_k|\theta_k]) \phi_k - \gamma \phi'_{k}\phi^{\top}_k C^{-1}\mathbb{E}[(\delta_k -\mathbb{E}[\delta_k|\theta_k])\phi|\theta_k])|\mathcal{G}(k)]\\
    &=&((\delta_k -\mathbb{E}[\delta_k|\theta_k]) \phi_k - \gamma \phi'_{k}\phi^{\top}_k C^{-1}\mathbb{E}[(\delta_k -\mathbb{E}[\delta_k|\theta_k])\phi|\theta_k])\\
    & &-\mathbb{E}[(\delta_k -\mathbb{E}[\delta_k|\theta_k]) \phi_k|\theta_k] - \gamma\mathbb{E}[\phi' \phi^{\top}]C^{-1}\mathbb{E}[(\delta_k -\mathbb{E}[\delta_k|\theta_k]) \phi_k|\theta_k].
    \end{array}
\end{equation*}
It is easy to see that $Z_k$, $k\geq 0$ are integrable random variables and $\mathbb{E}[Z_{k+1}|\mathcal{G}(k)]=0$, $\forall k\geq0$. Further,
\begin{equation*}
\mathbb{E}[||Z_{k+1}||^2|\mathcal{G}(k)]\leq
c_3(1+||\theta_k||^2), k\geq 0
\end{equation*}
for some constant $c_3 \geq 0$, again beacuse $\phi_k$, $r_k$, and $\phi_k'$   have
uniformly bounded second moments, it can be seen that for some constant.

Consider now the following ODE associated with (\ref{thetavmtdc}):
\begin{equation}
    \dot{\theta}(t) = (I - \mathbb{E}[\gamma \phi' \phi^{\top}]C^{-1})\mathbb{E}[(\delta -\mathbb{E}[\delta|\theta(t)]) \phi|\theta(t)].
    \label{thetavmtdcSlowerFinal}
\end{equation}
Let 
\begin{equation*}
\begin{array}{ccl}
    \vec{h}(\theta(t))&=&(I - \mathbb{E}[\gamma \phi' \phi^{\top}]C^{-1})\mathbb{E}[(\delta -\mathbb{E}[\delta|\theta(t)]) \phi|\theta(t)]\\
    &=&(C - \mathbb{E}[\gamma \phi' \phi^{\top}])C^{-1}\mathbb{E}[(\delta -\mathbb{E}[\delta|\theta(t)]) \phi|\theta(t)]\\
    &=& (\mathbb{E}[\phi \phi^{\top}] - \mathbb{E}[\gamma \phi' \phi^{\top}])C^{-1}\mathbb{E}[(\delta -\mathbb{E}[\delta|\theta(t)]) \phi|\theta(t)]\\
    &=& A^{\top}C^{-1}(-A\theta(t)+b),
\end{array}
\end{equation*}
because $\mathbb{E}[(\delta -\mathbb{E}[\delta|\theta(t)]) \phi|\theta(t)]=-A\theta(t)+b$, where 
$A = \mathrm{Cov}(\phi,\phi-\gamma\phi')$, $b=\mathrm{Cov}(r,\phi)$, and $C=\mathbb{E}[\phi\phi^{\top}]$

Therefore,
$\theta^*=A^{-1}b$ can be seen to be the unique globally asymptotically
stable equilibrium for ODE (\ref{thetavmtdcSlowerFinal}).
Let $\vec{h}_{\infty}(\theta)=\lim_{r\rightarrow
\infty}\frac{\vec{h}(r\theta)}{r}$. Then
$\vec{h}_{\infty}(\theta)=-A^{\top}C^{-1}A\theta$ is well-defined. 
Consider now the ODE
\begin{equation}
\dot{\theta}(t)=-A^{\top}C^{-1}A\theta(t).
\label{odethetavmtdcfinal}
\end{equation}

Because $C^{-1}$ is positive definite and $A$ has full rank (as it
is nonsingular by assumption), the matrix $A^{\top} C^{-1}A$ is also
positive definite. 
The ODE (\ref{odethetavmtdcfinal}) has the origin as its unique globally asymptotically stable equilibrium.
Thus, the assumption (A1) and (A2) are verified.

The proof is given above.
In the fastest time scale, the parameter $w$ converges to
$\mathbb{E}[\delta|u_k,\theta_k]$.
In the second fast time scale,
the parameter $u$ converges to $C^{-1}\mathbb{E}[(\delta-\mathbb{E}[\delta|\theta_k])\phi|\theta_k]$.
In the slower time scale,
the parameter $\theta$ converges to $A^{-1}b$.
\end{proof}

\begin{algorithm}[t]
    \caption{VMTDC algorithm with linear function approximation in the off-policy setting}
    \label{alg:algorithm 2}
\begin{algorithmic}
    \STATE {\bfseries Input:} $\theta_{0}$, $u_0$, $\omega_{0}$, $\gamma
    $, learning rate $\alpha_t$, $\zeta_t$ and $\beta_t$, behavior policy $\mu$ and target policy $\pi$
    \REPEAT
    \STATE For any episode, initialize $\theta_{0}$ arbitrarily, $u_t$ and $\omega_{0}$ to $0$, $\gamma \in (0,1]$, and $\alpha_t$, $\zeta_t$ and $\beta_t$ are constant.\\
    \textbf{Output}: $\theta^*$.\\
    \FOR{$t=0$ {\bfseries to} $T-1$}
    \STATE Take $A_t$ from $S_t$ according to $\mu$, and arrive at $S_{t+1}$\\
    \STATE Observe sample ($S_t$,$R_{t+1}$,$S_{t+1}$) at time step $t$ (with their corresponding state feature vectors)\\
    \STATE $\delta_t = R_{t+1}+\gamma\theta_t^{\top}\phi_{t+1}-\theta_t^{\top}\phi_t$
    \STATE $\rho_{t} \leftarrow \frac{\pi(A_t | S_t)}{\mu(A_t | S_t)}$
    \STATE $\theta_{t+1}\leftarrow \theta_{t}+\alpha_t[\rho_t (\delta_t-\omega_t)\phi_t - \gamma \phi_{t+1}(\phi^{\top}_{t} u_t)]$
    \STATE $u_{t+1}\leftarrow u_{t}+\zeta_t[\rho_t(\delta_t-\omega_t) - \phi^{\top}_{t} u_t] \phi_t$
    \STATE $\omega_{t+1}\leftarrow \omega_{t}+\beta_t \rho_t(\delta_t-\omega_t)$
    \STATE $S_t=S_{t+1}$
    \ENDFOR
    \UNTIL{terminal episode}
\end{algorithmic}
\end{algorithm}

\section{Experimental details}
\label{experimentaldetails}
The feature matrices corresponding to three random walks are shown below respectively:
\begin{equation*}
    \Phi_{tabular}=\left[ 
    \begin{array}{ccccc}
    1 & 0& 0& 0& 0\\
    0 & 1& 0& 0& 0\\
    0 & 0& 1& 0& 0\\
    0 & 0& 0& 1& 0\\
    0 & 0& 0& 0& 1
    \end{array}\right]
    \end{equation*}
    \begin{equation*}
    \Phi_{inverted}=\left[ 
    \begin{array}{ccccc}
    0 & \frac{1}{2}& \frac{1}{2}& \frac{1}{2}& \frac{1}{2}\\
    \frac{1}{2} & 0& \frac{1}{2}& \frac{1}{2}& \frac{1}{2}\\
    \frac{1}{2} & \frac{1}{2}& 0& \frac{1}{2}& \frac{1}{2}\\
    \frac{1}{2} & \frac{1}{2}& \frac{1}{2}& 0& \frac{1}{2}\\
    \frac{1}{2} & \frac{1}{2}& \frac{1}{2}& \frac{1}{2}& 0
    \end{array}\right]
    \end{equation*}
    \begin{equation*}
    \Phi_{dependent}=\left[ 
    \begin{array}{ccccc}
    1 & 0& 0\\
    \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}& 0\\
    \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}}& \frac{1}{\sqrt{3}}\\
    0 & \frac{1}{\sqrt{2}}& \frac{1}{\sqrt{2}}\\
    0 & 0& 1
    \end{array}\right]
    \end{equation*}

Three random walk experiments: the $\alpha$ values for 
all algorithms are in the range of $\{0.008, 0.015, 0.03, 0.06, 0.12, 0.25, 0.5\}$. For the TDC algorithm, 
the range of the ratio $\frac{\zeta}{\alpha}$ is $\{\frac{1}{512}, \frac{1}{256}, \frac{1}{128}, \frac{1}{64}, \frac{1}{32}, \frac{1}{16}, \frac{1}{8}, \frac{1}{4}, \frac{1}{2}, 1, 2\}$. For the VMTD algorithm, 
the range of the ratio $\frac{\beta}{\alpha}$ is $\{\frac{1}{512}, \frac{1}{256}, \frac{1}{128}, \frac{1}{64}, \frac{1}{32}, \frac{1}{16}, \frac{1}{8}, \frac{1}{4}, \frac{1}{2}, 1, 2\}$. It can be observed from 
the update formula of VMTDC that when $\zeta$ takes a very small value, 
the VMTDC update tends to be similar to VMTD update. Similarly, 
when $\beta$ takes a very small value, the VMTDC update tends to be 
similar to TDC update. Through experiments, it was found that 
setting $\zeta$ to a small value makes VMTDC updates approach VMTD 
updates, resulting in better performance. Therefore, for the VMTDC 
algorithm, the range of $\frac{\beta}{\alpha}$ ratio is $\{\frac{1}{512}, \frac{1}{256}, \frac{1}{128}, \frac{1}{64}, \frac{1}{32}, \frac{1}{16}, \frac{1}{8}, \frac{1}{4}, \frac{1}{2}, 1, 2\}$, and the range of 
$\zeta$ is $\{0.1, 0.01, 0.001, 0.0001, 0.00001\}$. The learning curves in Figure \ref{Evaluation_full} correspond to the optimal 
parameters.

The feature matrix of 7-state version of Baird's off-policy counterexample is
defined as follow:
\begin{equation*}
\Phi_{Counter}=\left[ 
\begin{array}{cccccccc}
1 & 2& 0& 0& 0& 0& 0& 0\\
1 & 0& 2& 0& 0& 0& 0& 0\\
1 & 0& 0& 2& 0& 0& 0& 0\\
1 & 0& 0& 0& 2& 0& 0& 0\\
1 & 0& 0& 0& 0& 2& 0& 0\\
1 & 0& 0& 0& 0& 0& 2& 0\\
2 & 0& 0& 0& 0& 0& 0& 1
\end{array}\right]
\end{equation*}

7-state version of Baird's off-policy counterexample: 
for TD algorithm, $\alpha$ is set to 0.1. For the TDC algorithm, the range of 
$\alpha$ is $\{0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0\}$, 
and the range of 
$\zeta$ is $\{0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5\}$. 
For the VMTD algorithm, the range of 
$\alpha$ is $\{0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0\}$, 
and the range of 
$\beta$ is $\{0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5\}$. Through experiments, it was found 
that setting $\zeta$ to a small value makes VMTDC updates approach VMTD 
updates, resulting in better performance. Therefore, for the VMTDC 
algorithm, The range of values for $\alpha$ and $\beta$ is the same as that of VMTD  
and the range of $\zeta$ 
is $\{0.1, 0.01, 0.001, 0.0001, 0.00001\}$. 
The learning curves in Figure \ref{Complete_full} correspond to the optimal parameters.
For all policy evaluation experiments, each experiment 
is independently run 100 times.

For the four control experiments: The learning rates for each 
algorithm in all experiments are shown in Table \ref{lrofways}.
For all control experiments, each experiment is independently run 50 times.

\begin{table*}[htb]
    \centering
    \caption{Learning rates ($lr$) of four control experiments.}
    \vskip 0.15in
    \begin{tabular}{c|ccccc}
        \hline
        \multicolumn{1}{c|}{\diagbox{algorithms($lr$)}{envs}} &Maze &Cliff walking &Mountain Car &Acrobot \\
        \hline
         Sarsa($\alpha$)&$0.1$ &$0.1$ &$0.1$ &$0.1$ \\
         GQ(0)($\alpha,\zeta$)&$0.1,0.003$ &$0.1,0.004$ &$0.1,0.01$ &$0.1,0.01$ \\
         VMSarsa($\alpha,\beta$)&$0.1,0.001$ &$0.1,\text{1e-4}$ &$0.1,\text{1e-4}$ &$0.1,\text{1e-4}$ \\
         VMGQ(0)($\alpha,\zeta,\beta$)&$0.1,0.001,0.001$ &$0.1,0.005,\text{1e-4}$ &$0.1,\text{5e-4},\text{1e-4}$ &$0.1,\text{5e-4},\text{1e-4}$ \\
         AC($lr_{\text{actor}},lr_{\text{critic}}$)&$0.01,0.1$ &$0.01,0.01$ &$0.01,0.05$ &$0.01,0.05$ \\
         Q-learning($\alpha$)&$0.1$ &$0.1$ &$0.1$ &$0.1$ \\
         VMQ($\alpha,\beta$)&$0.1,0.001$ &$0.1,\text{1e-4}$ &$0.1,\text{1e-4}$ &$0.1,\text{1e-4}$ \\
         \hline
    \end{tabular}
    \label{lrofways}
    \vskip -0.1in
\end{table*}


\end{document}