%%%%%%%% ICML 2024 EXAMPLE LATEX SUBMISSION FILE %%%%%%%%%%%%%%%%% \documentclass{article} % Recommended, but optional, packages for figures and better typesetting: \usepackage{microtype} \usepackage{graphicx} \usepackage{subfigure} \usepackage{diagbox} \usepackage{wrapfig} \usepackage{booktabs} % for professional tables % hyperref makes hyperlinks in the resulting PDF. % If your build breaks (sometimes temporarily if a hyperlink spans a page) % please comment out the following usepackage line and replace % \usepackage{icml2024} with \usepackage[nohyperref]{icml2024} above. \usepackage{hyperref} % Attempt to make hyperref and algorithmic work together better: \newcommand{\theHalgorithm}{\arabic{algorithm}} % Use the following line for the initial blind version submitted for review: \usepackage{icml2024} % If accepted, instead use the following line for the camera-ready submission: % \usepackage[accepted]{icml2024} % For theorems and such \usepackage{amsmath} \usepackage{amssymb} \usepackage{mathtools} \usepackage{amsthm} % if you use cleveref.. \usepackage[capitalize,noabbrev]{cleveref} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % THEOREMS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \theoremstyle{plain} \newtheorem{theorem}{Theorem}[section] \newtheorem{proposition}[theorem]{Proposition} \newtheorem{lemma}[theorem]{Lemma} \newtheorem{corollary}[theorem]{Corollary} \theoremstyle{definition} \newtheorem{definition}[theorem]{Definition} \newtheorem{assumption}[theorem]{Assumption} \theoremstyle{remark} \newtheorem{remark}[theorem]{Remark} % Todonotes is useful during development; simply uncomment the next line % and comment out the line below the next line to turn off comments %\usepackage[disable,textsize=tiny]{todonotes} \usepackage[textsize=tiny]{todonotes} % The \icmltitle you define below is probably too long as a header. % Therefore, a short form for the running title is supplied here: \icmltitlerunning{Submission and Formatting Instructions for ICML 2024} \begin{document} \twocolumn[ \icmltitle{Is Minimizing Errors the Only Option for Value-based Reinforcement Learning?} % It is OKAY to include author information, even for blind % submissions: the style file will automatically remove it for you % unless you've provided the [accepted] option to the icml2024 % package. % List of affiliations: The first argument should be a (short) % identifier you will use later to specify author affiliations % Academic affiliations should list Department, University, City, Region, Country % Industry affiliations should list Company, City, Region, Country % You can specify symbols, otherwise they are numbered in order. % Ideally, you should not use this facility. Affiliations will be numbered % in order of appearance and this is the preferred way. \icmlsetsymbol{equal}{*} \begin{icmlauthorlist} \icmlauthor{Firstname1 Lastname1}{equal,yyy} \icmlauthor{Firstname2 Lastname2}{equal,yyy,comp} \icmlauthor{Firstname3 Lastname3}{comp} \icmlauthor{Firstname4 Lastname4}{sch} \icmlauthor{Firstname5 Lastname5}{yyy} \icmlauthor{Firstname6 Lastname6}{sch,yyy,comp} \icmlauthor{Firstname7 Lastname7}{comp} %\icmlauthor{}{sch} \icmlauthor{Firstname8 Lastname8}{sch} \icmlauthor{Firstname8 Lastname8}{yyy,comp} %\icmlauthor{}{sch} %\icmlauthor{}{sch} \end{icmlauthorlist} \icmlaffiliation{yyy}{Department of XXX, University of YYY, Location, Country} \icmlaffiliation{comp}{Company Name, Location, Country} \icmlaffiliation{sch}{School of ZZZ, Institute of WWW, Location, Country} \icmlcorrespondingauthor{Firstname1 Lastname1}{first1.last1@xxx.edu} \icmlcorrespondingauthor{Firstname2 Lastname2}{first2.last2@www.uk} % You may provide any keywords that you % find helpful for describing your paper; these are used to populate % the "keywords" metadata in the PDF but will not be shown in the document \icmlkeywords{Machine Learning, ICML} \vskip 0.3in ] % this must go after the closing bracket ] following \twocolumn[ ... % This command actually creates the footnote in the first column % listing the affiliations and the copyright notice. % The command takes one argument, which is text to display at the start of the footnote. % The \icmlEqualContribution command is standard text for equal contribution. % Remove it (just {}) if you do not need this facility. %\printAffiliationsAndNotice{} % leave blank if no need to mention equal contribution \printAffiliationsAndNotice{\icmlEqualContribution} % otherwise use the standard text. \begin{abstract} In the regression task of supervised learning, we need to minimize the error and trade off the variance. Drawing on this idea, the existing research on value-based reinforcement learning also minimizes the error. However, is error minimization really the only option for value-based reinforcement learning? We can easily observe that the policy on action choosing probabilities is often related to the relative values, and has nothing to do with their absolute values. Based on this observation, we propose the objective of variance minimization instead of error minimization, derive on-policy and off-policy algorithms respectively, and conduct an analysis of the convergence rate and experiments. The experimental results show that our proposed variance minimization algorithms converge much faster. \end{abstract} \section{Introduction} \label{introduction} Reinforcement learning can be mainly divided into two categories: value-based reinforcement learning and policy gradient-based reinforcement learning. This paper focuses on temporal difference learning based on linear approximated valued functions. Its research is usually divided into two steps: the first step is to establish the convergence of the algorithm, and the second step is to accelerate the algorithm. In terms of stability, \citet{sutton1988learning} established the convergence of on-policy TD(0), and \citet{tsitsiklis1997analysis} established the convergence of on-policy TD($\lambda$). However, ``The deadly triad'' consisting of off-policy learning, bootstrapping, and function approximation makes the stability a difficult problem \citep{Sutton2018book}. To solve this problem, convergent off-policy temporal difference learning algorithms are proposed, e.g., BR \cite{baird1995residual}, GTD \cite{sutton2008convergent}, GTD2 and TDC \cite{sutton2009fast}, ETD \cite{sutton2016emphatic}, and MRetrace \cite{chen2023modified}. In terms of acceleration, \citet{hackman2012faster} proposed Hybrid TD algorithm with on-policy matrix. \citet{liu2015finite,liu2016proximal,liu2018proximal} proposed true stochastic algorithms, i.e., GTD-MP and GTD2-MP, from a convex-concave saddle-point formulation. Second-order methods are used to accelerate TD learning, e.g., Quasi Newton TD \cite{givchi2015quasi} and accelerated TD (ATD) \citep{pan2017accelerated}. \citet{hallak2016generalized} introduced an new parameter to reduce variance for ETD. \citet{zhang2022truncated} proposed truncated ETD with a lower variance. Variance Reduced TD with direct variance reduction technique \citep{johnson2013accelerating} is proposed by \cite{korda2015td} and analysed by \cite{xu2019reanalysis}. How to further improve the convergence rates of reinforcement learning algorithms is currently still an open problem. Algorithm stability is prominently reflected in the changes to the objective function, transitioning from mean squared errors (MSE) \citep{Sutton2018book} to mean squared bellman errors (MSBE) \cite{baird1995residual}, then to norm of the expected TD update \cite{sutton2009fast}, and further to mean squared projected Bellman errors (MSPBE) \cite{sutton2009fast}. On the other hand, algorithm acceleration is more centered around optimizing the iterative update formula of the algorithm itself without altering the objective function, thereby speeding up the convergence rate of the algorithm. The emergence of new optimization objective functions often leads to the development of novel algorithms. The introduction of new algorithms, in turn, tends to inspire researchers to explore methods for accelerating algorithms, leading to the iterative creation of increasingly superior algorithms. The kernel loss function can be optimized using standard gradient-based methods, addressing the issue of double sampling in residual gradient algorithm \cite{feng2019kernel}. It ensures convergence in both on-policy and off-policy scenarios. The logistic bellman error is convex and smooth in the action-value function parameters, with bounded gradients \cite{basserrano2021logistic}. In contrast, the squared Bellman error is not convex in the action-value function parameters, and RL algorithms based on recursive optimization using it are known to be unstable. % The value-based algorithms mentioned above aim to % minimize some errors, e.g., mean squared errors \citep{Sutton2018book}, % mean squared Bellman errors \cite{baird1995residual}, norm % of the expected TD update \cite{sutton2009fast}, % mean squared projected Bellman errors (MSPBE) \cite{sutton2009fast}, etc. It is necessary to propose a new objective function, but the mentioned objective functions above are all some form of error. Is minimizing error the only option for value-based reinforcement learning? Error can be decomposed into bias, variance and unavoidable noise. Among them, bias measures the difference between the predicted values of the model and the true values, reflecting the model's fitting ability. Variance, on the other hand, quantifies the model's sensitivity to different training data, indicating its stability and generalization ability. Balancing bias and variance is important, as they represent trade-offs \cite{zhou2021machine}. In the context of this paper, where only a linear model is considered and the model complexity is not adjusted, it is difficult to improve the bias. High bias indicates that the model poorly fits the training data, resulting in underfitting. In supervised learning, high bias is generally considered unacceptable. However, in reinforcement learning, high bias may be acceptable in certain cases. This is due to the observation that policies based on value functions, such as greedy, $\epsilon$-greedy, and softmax policies, often rely on the relative values of action values rather than their absolute values when selecting different actions. Based on this observation, we propose alternate objective functions instead of minimizing errors. We minimize Variance of Bellman Error (VBE) and Variance of Projected Bellman Error (VPBE), and derive Variance Minimization (VM) algorithms. These algorithms preserve the invariance of the optimal policy, but significantly reduce the variance of gradient estimation, and thus hastening convergence. The contributions of this paper are as follows: (1) Introduction of novel objective functions based on the invariance of the optimal policy. (2) Derived two algorithms, one on-policy and one off-policy. (3) Proof of their convergence. (4) Analysis of the convergence rate of on-policy algorithm. (5) Experiments demonstrating the faster convergence speed of the proposed algorithms. \section{Preliminaries} \label{preliminaries} Reinforcement learning agent interacts with environment, observes state, takes sequential decision makings to influence environment, and obtains rewards. Consider an infinite-horizon discounted Markov Decision Process (MDP), defined by a tuple $\langle S,A,R,P,\gamma \rangle$, where $S=\{1,2,\ldots,N\}$ is a finite set of states of the environment; $A$ is a finite set of actions of the agent; $R:S\times A \times S \rightarrow \mathbb{R}$ is a bounded deterministic reward function; $P:S\times A\times S \rightarrow [0,1]$ is the transition probability distribution; and $\gamma\in (0,1)$ is the discount factor \cite{Sutton2018book}. Due to the requirements of online learning, value iteration based on sampling is considered in this paper. In each sampling, an experience (or transition) $\langle s, a, s', r\rangle$ is obtained. A policy is a mapping $\pi:S\times A \rightarrow [0,1]$. The goal of the agent is to find an optimal policy $\pi^*$ to maximize the expectation of a discounted cumulative rewards in a long period. State value function $V^{\pi}(s)$ for a stationary policy $\pi$ is defined as: \begin{equation*} V^{\pi}(s)=\mathbb{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{k}|s_0=s]. \label{valuefunction} \end{equation*} Linear value function for state $s\in S$ is defined as: \begin{equation} V_{{\theta}}(s):= {\theta}^{\top}{\phi}(s) = \sum_{i=1}^{m} \theta_i \phi_i(s), \label{linearvaluefunction} \end{equation} where ${\theta}:=(\theta_1,\theta_2,\ldots,\theta_m)^{\top}\in \mathbb{R}^m$ is a parameter vector, ${\phi}:=(\phi_1,\phi_2,\ldots,\phi_m)^{\top}\in \mathbb{R}^m$ is a feature function defined on state space $S$, and $m$ is the feature size. Tabular temporal difference (TD) learning \cite{Sutton2018book} has been successfully applied to small-scale problems. To deal with the well-known curse of dimensionality of large scale MDPs, value function is usually approximated by a linear model, kernel methods, decision trees, or neural networks, etc. This paper focuses on the linear model, where features are usually hand coded by domain experts. TD learning can also be used to find optimal strategies. The problem of finding an optimal policy is often called the control problem. Two popular TD methods are Sarsa and Q-leaning. The former is an on-policy TD control, while the latter is an off-policy control. It is well known that TDC algorithm \cite{sutton2009fast} guarantees convergence under off-policy conditions while the off-policy TD algorithm may diverge. The objective function of TDC is MSPBE. TDC is essentially an adjustment or correction of the TD update so that it follows the gradient of the MSPBE objective function. In the context of the TDC algorithm, the control algorithm is known as Greedy-GQ($\lambda$) \cite{sutton2009fast}. When $\lambda$ is set to 0, it is denoted as GQ(0). \section{Variance Minimization Algorithms} \subsection{Motivation} In reinforcement learning, bias is acceptable, while in supervised learning it is not. As shown in Table \ref{example_bias}, although there is a bias between the true value and the predicted value, action $a_3$ is still chosen under the greedy-policy. On the contrary, supervised learning is usually used to predict temperature, humidity, morbidity, etc. If the bias is too large, the consequences could be serious. \begin{table}[t] \caption{Classification accuracies for naive Bayes and flexible Bayes on various data sets.} \label{example_bias} \vskip 0.15in \begin{center} \begin{small} \begin{sc} \begin{tabular}{lcccr} \toprule action & $Q$ value & $Q$ value with bias \\ \midrule $Q(s, a_0)$ & 1& 5 \\ $Q(s, a_1)$ & 2& 6 \\ $Q(s, a_2)$ & 3& 7 \\ $Q(s, a_3)$ & 4& 8 \\ $\arg \min_{a}Q(s,a)$ & $a_3$& $a_3$\\ \bottomrule \end{tabular} \end{sc} \end{small} \end{center} \vskip -0.1in \end{table} In addition, reward shaping can significantly speed up the learning by adding a shaping reward $F(s,s')$ to the original reward $r$, where $F(s,s')$ is the general form of any state-based shaping reward. Static potential-based reward shaping (Static PBRS) maintains the policy invariance if the shaping reward follows from $F(s,s')=\gamma f(s')-f(s)$ \cite{ng1999policy}. This means that we can make changes to the TD error $\delta = r+\gamma \theta^{\top}\phi'-\theta^{\top}\phi $ while still ensuring the invariance of the optimal policy, \begin{equation*} \delta - \omega= r+\gamma \theta^{\top}\phi'-\theta^{\top}\phi - \omega, \end{equation*} where $\omega$ is a constant, acting as a static PBRS. This also means that algorithms with the optimization goal of minimizing errors, after introducing reward shaping, may result in larger or smaller bias. Fortunately, as discussed above, bias is acceptable in reinforcement learning. However, the problem is that selecting an appropriate $\omega$ requires expert knowledge. This forces us to learn $\omega$ dynamically, i.e., $\omega=\omega_t $ and dynamic PBRS can also maintain the policy invariance if the shaping reward is $F(s,t,s',t')=\gamma f(s',t')-f(s,t)$, where $t$ is the time-step the agent reaches in state $s$ \cite{devlin2012dynamic}. However, this result requires the convergence guarantee of the dynamic potential function $f(s,t)$. If $f(s,t)$ does not converge as the time-step $t\rightarrow\infty$, the Q-values of dynamic PBRS are not guaranteed to converge. Let $f_{\omega_t}(s)=\frac{\omega_t}{\gamma-1}$. Thus, $F_{\omega_t}(s,s')=\gamma f_{\omega_t}(s')-f_{\omega_t}(s)= \omega_t$ is a dynamic PBRS. And if $\omega$ converges finally, the dynamic potential function $f(s,t)$ will converge. Bias is the expected difference between the predicted value and the true value. Therefore, under the premise of bootstrapping, we first think of letting $\omega \doteq \mathbb{E}[\mathbb{E}[\delta|s]]=\mathbb{E}[\delta]$. As we all know, the optimization process of linear TD(0) (semi-gradient) and linear TDC are as follows, respectively: \begin{equation*} \theta^{*}= \arg \min_{\theta} \mathbb{E}[(\mathbb{E}[\delta |s])^2], \end{equation*} and \begin{equation*} \theta^{*}=\arg \min_{\theta} \mathbb{E}[\delta \phi]^{\top} \mathbb{E}[\phi \phi^{\top}]^{-1} \mathbb{E}[\delta\phi]. \end{equation*} As a result, two novel objective functions and their corresponding algorithms are proposed, where $\omega$ is subsequently proven to converge, meaning that these two algorithms can maintain the invariance of the optimal strategy. \subsection{Variance Minimization TD Learning: VMTD} For on-policy learning, a novel objective function, Variance of Bellman Error (VBE), is proposed as follows: \begin{equation} \begin{array}{ccl} \arg \min_{\theta}\text{VBE}(\theta)&=&\arg \min_{\theta}\mathbb{E}[(\mathbb{E}[\delta|s]-\mathbb{E}[\mathbb{E}[\delta|s]])^2]\\ &=&\arg \min_{\theta,\omega} \mathbb{E}[(\mathbb{E}[\delta|s]-\omega)^2]. \end{array} \end{equation} Clearly, it is no longer to minimize Bellman errors. First, the parameter $\omega$ is derived directly based on stochastic gradient descent: \begin{equation} \omega_{k+1}\leftarrow \omega_{k}+\beta_k(\delta_k-\omega_k), \label{omega} \end{equation} where $\delta_k$ is the TD error as follows: \begin{equation} \delta_k = r+\gamma \theta_k^{\top}\phi_{k}'-\theta_k^{\top}\phi_k. \label{delta} \end{equation} Then, based on stochastic semi-gradient descent, the update of the parameter $\theta$ is as follows: \begin{equation} \theta_{k+1}\leftarrow \theta_{k}+\alpha_k(\delta_k-\omega_k)\phi_k. \label{theta} \end{equation} The pseudocode of the VMTD algorithm is shown in Algorithm \ref{alg:algorithm 1}. For control tasks, two extensions of VMTD are named VMSarsa and VMQ respectively, and the update formulas are shown below: \begin{equation} \theta_{k+1}\leftarrow \theta_{k}+\alpha_k(\delta_k-\omega_k)\phi(s_k,a_k). \end{equation} and \begin{equation} \omega_{k+1}\leftarrow \omega_{k}+\beta_k(\delta_k-\omega_k), \end{equation} where $\delta_k$ delta in VMSarsa is: \begin{equation} \delta_{k}=r_{k+1}+\gamma \theta_{k}^{\top}\phi(s_{k+1},a_{k+1}) - \theta_{k}^{\top}\phi(s_{k},a_{k}), \label{deltaSarsa} \end{equation} and $\delta_k$ delta in VMQ is: \begin{equation} \delta_{k}=r_{k+1}+\gamma \max_{a\in A}\theta_{k}^{\top}\phi(s_{k+1},a) - \theta_{k}^{\top}\phi(s_{k},a_{k}). \label{deltaQ} \end{equation} \begin{algorithm}[t] \caption{VMTD algorithm with linear function approximation in the on-policy setting} \label{alg:algorithm 1} \begin{algorithmic} \STATE {\bfseries Input:} $\theta_{0}$, $\omega_{0}$, $\gamma $, learning rate $\alpha_t$ and $\beta_t$ \REPEAT \STATE For any episode, initialize $\theta_{0}$ arbitrarily, $\omega_{0}$ to $0$, $\gamma \in (0,1]$, and $\alpha_t$ and $\beta_t$ are constant.\\ \FOR{$t=0$ {\bfseries to} $T-1$} \STATE Take $A_t$ from $S_t$ according to policy $\mu$, and arrive at $S_{t+1}$\\ \STATE Observe sample ($S_t$,$R_{t+1}$,$S_{t+1}$) at time step $t$ (with their corresponding state feature vectors)\\ \STATE $\delta_t = R_{t+1}+\gamma\theta_t^{\top}\phi_{t}'-\theta_t^{\top}\phi_t$ \STATE $\theta_{t+1}\leftarrow \theta_{t}+\alpha_t(\delta_t-\omega_t)\phi_t$ \STATE $\omega_{t+1}\leftarrow \omega_{t}+\beta_t(\delta_t-\omega_t)$ \STATE $S_t=S_{t+1}$ \ENDFOR \UNTIL{terminal episode} \end{algorithmic} \end{algorithm} % \begin{algorithm}[t] % \caption{VMTDC algorithm with linear function approximation in the off-policy setting} % \label{alg:algorithm 2} % \begin{algorithmic} % \STATE {\bfseries Input:} $\theta_{0}$, $u_0$, $\omega_{0}$, $\gamma % $, learning rate $\alpha_t$, $\zeta_t$ and $\beta_t$, behavior policy $\mu$ and target policy $\pi$ % \REPEAT % \STATE For any episode, initialize $\theta_{0}$ arbitrarily, $u_t$ and $\omega_{0}$ to $0$, $\gamma \in (0,1]$, and $\alpha_t$, $\zeta_t$ and $\beta_t$ are constant.\\ % \textbf{Output}: $\theta^*$.\\ % \FOR{$t=0$ {\bfseries to} $T-1$} % \STATE Take $A_t$ from $S_t$ according to $\mu$, and arrive at $S_{t+1}$\\ % \STATE Observe sample ($S_t$,$R_{t+1}$,$S_{t+1}$) at time step $t$ (with their corresponding state feature vectors)\\ % \STATE $\delta_t = R_{t+1}+\gamma\theta_t^{\top}\phi_{t+1}-\theta_t^{\top}\phi_t$ % \STATE $\rho_{t} \leftarrow \frac{\pi(A_t | S_t)}{\mu(A_t | S_t)}$ % \STATE $\theta_{t+1}\leftarrow \theta_{t}+\alpha_t[\rho_t (\delta_t-\omega_t)\phi_t - \gamma \phi_{t+1}(\phi^{\top}_{t} u_t)]$ % \STATE $u_{t+1}\leftarrow u_{t}+\zeta_t[\rho_t(\delta_t-\omega_t) - \phi^{\top}_{t} u_t] \phi_t$ % \STATE $\omega_{t+1}\leftarrow \omega_{t}+\beta_t \rho_t(\delta_t-\omega_t)$ % \STATE $S_t=S_{t+1}$ % \ENDFOR % \UNTIL{terminal episode} % \end{algorithmic} % \end{algorithm} \subsection{Variance Minimization TDC Learning: VMTDC} For off-policy learning, we employ a projection operator. The objective function is called Variance of Projected Bellman error (VPBE), and the corresponding algorithm is called VMTDC. \begin{equation} \begin{array}{ccl} \text{VPBE}(\theta)&=&\mathbb{E}[(\delta-\mathbb{E}[\delta]) \phi]^{\top} \mathbb{E}[\phi \phi^{\top}]^{-1}\mathbb{E}[(\delta-\mathbb{E}[\delta])\phi]\\ &=&\mathbb{E}[(\delta-\omega) \phi]^{\top} \mathbb{E}[\phi \phi^{\top}]^{-1}\mathbb{E}[(\delta-\omega)\phi], \end{array} \end{equation} where $\omega$ is used to estimate $\mathbb{E}[\delta]$, i.e., $\omega \doteq \mathbb{E}[\delta]$. The derivation process of the VMTDC algorithm is the same as that of the TDC algorithm, the only difference is that the original $\delta$ is replaced by $\delta-\omega$. Therefore, we can easily get the updated formula of VMTDC, as follows: % \begin{equation*} % \rho_{k} \leftarrow \frac{\pi(a_k | s_k)}{\mu(a_k | s_k)}, % \end{equation*} \begin{equation} \theta_{k+1}\leftarrow\theta_{k}+\alpha_{k}[(\delta_{k}- \omega_k) \phi(s_k)\\ - \gamma\phi(s_{k+1})(\phi^{\top} (s_k) u_k)], \label{thetavmtdc} \end{equation} \begin{equation} u_{k+1}\leftarrow u_{k}+\zeta_{k}[\delta_{k}-\omega_k - \phi^{\top} (s_k) u_k]\phi(s_k), \label{uvmtdc} \end{equation} and \begin{equation} \omega_{k+1}\leftarrow \omega_{k}+\beta_k (\delta_k- \omega_k), \label{omegavmtdc} \end{equation} The pseudocode of the VMTDC algorithm for importance-sampling scenario is shown in Algorithm \ref{alg:algorithm 2} of Appendix \ref{proofth2}. Now, we will introduce the improved version of the GQ(0) algorithm, named VMGQ(0): \begin{equation} \begin{array}{ccl} \theta_{k+1}\leftarrow\theta_{k}&+&\alpha_{k}[(\delta_{k}- \omega_k) \phi(s_k,a_k)\\ &-& \gamma\phi(s_{k+1},A^{*}_{k+1})(\phi^{\top} (s_k,a_k) u_k)], \end{array} \end{equation} \begin{equation} u_{k+1}\leftarrow u_{k}+\zeta_{k}[(\delta_{k}-u_k) - \phi^{\top} (s_k,a_k) u_k]\phi(s_k,a_k), \end{equation} and \begin{equation} \omega_{k+1}\leftarrow \omega_{k}+\beta_k(\delta_k- \omega_k), \end{equation} where $\delta_{k}$ is (\ref{deltaQ}) and $A^{*}_{k+1}={\arg \max}_{a}(\theta_{k}^{\top}\phi(s_{k+1},a))$. \section{Theoretical Analysis} The purpose of this section is to establish the stabilities of the VMTD algorithm and the VMTDC algorithm, and also presents a corollary on the convergence rate of VMTD. \begin{theorem} \label{theorem1}(Convergence of VMTD). In the case of on-policy learning, consider the iterations (\ref{omega}) and (\ref{theta}) with (\ref{delta}) of VMTD. Let the step-size sequences $\alpha_k$ and $\beta_k$, $k\geq 0$ satisfy in this case $\alpha_k,\beta_k>0$, for all $k$, $ \sum_{k=0}^{\infty}\alpha_k=\sum_{k=0}^{\infty}\beta_k=\infty, $ $ \sum_{k=0}^{\infty}\alpha_k^2<\infty, $ $ \sum_{k=0}^{\infty}\beta_k^2<\infty, $ and $ \alpha_k = o(\beta_k). $ Assume that $(\phi_k,r_k,\phi_k')$ is an i.i.d. sequence with uniformly bounded second moments, where $\phi_k$ and $\phi'_{k}$ are sampled from the same Markov chain. Let $A = \mathrm{Cov}(\phi,\phi-\gamma\phi')$, $b=\mathrm{Cov}(r,\phi)$. Assume that matrix $A$ is non-singular. Then the parameter vector $\theta_k$ converges with probability one to $A^{-1}b$. \end{theorem} \begin{proof} \label{th1proof} The proof is based on Borkar's Theorem for general stochastic approximation recursions with two time scales \cite{borkar1997stochastic}. % The new TD error for the linear setting is % \begin{equation*} % \delta_{\text{new}}=r+\gamma % \theta^{\top}\phi'-\theta^{\top}\phi-\mathbb{E}[\delta]. % \end{equation*} A new one-step linear TD solution is defined as: \begin{equation*} 0=\mathbb{E}[(\delta-\mathbb{E}[\delta]) \phi]=-A\theta+b. \end{equation*} Thus, the VMTD's solution is $\theta_{\text{VMTD}}=A^{-1}b$. First, note that recursion (\ref{theta}) can be rewritten as \begin{equation*} \theta_{k+1}\leftarrow \theta_k+\beta_k\xi(k), \end{equation*} where \begin{equation*} \xi(k)=\frac{\alpha_k}{\beta_k}(\delta_k-\omega_k)\phi_k \end{equation*} Due to the settings of step-size schedule $\alpha_k = o(\beta_k)$, $\xi(k)\rightarrow 0$ almost surely as $k\rightarrow\infty$. That is the increments in iteration (\ref{omega}) are uniformly larger than those in (\ref{theta}), thus (\ref{omega}) is the faster recursion. Along the faster time scale, iterations of (\ref{omega}) and (\ref{theta}) are associated to ODEs system as follows: \begin{equation} \dot{\theta}(t) = 0, \label{thetaFast} \end{equation} \begin{equation} \dot{\omega}(t)=\mathbb{E}[\delta_t|\theta(t)]-\omega(t). \label{omegaFast} \end{equation} Based on the ODE (\ref{thetaFast}), $\theta(t)\equiv \theta$ when viewed from the faster timescale. By the Hirsch lemma \cite{hirsch1989convergent}, it follows that $||\theta_k-\theta||\rightarrow 0$ a.s. as $k\rightarrow \infty$ for some $\theta$ that depends on the initial condition $\theta_0$ of recursion (\ref{theta}). Thus, the ODE pair (\ref{thetaFast})-(\ref{omegaFast}) can be written as \begin{equation} \dot{\omega}(t)=\mathbb{E}[\delta_t|\theta]-\omega(t). \label{omegaFastFinal} \end{equation} Consider the function $h(\omega)=\mathbb{E}[\delta|\theta]-\omega$, i.e., the driving vector field of the ODE (\ref{omegaFastFinal}). It is easy to find that the function $h$ is Lipschitz with coefficient $-1$. Let $h_{\infty}(\cdot)$ be the function defined by $h_{\infty}(\omega)=\lim_{x\rightarrow \infty}\frac{h(x\omega)}{x}$. Then $h_{\infty}(\omega)= -\omega$, is well-defined. For (\ref{omegaFastFinal}), $\omega^*=\mathbb{E}[\delta|\theta]$ is the unique globally asymptotically stable equilibrium. For the ODE \begin{equation} \dot{\omega}(t) = h_{\infty}(\omega(t))= -\omega(t), \label{omegaInfty} \end{equation} apply $\vec{V}(\omega)=(-\omega)^{\top}(-\omega)/2$ as its associated strict Liapunov function. Then, the origin of (\ref{omegaInfty}) is a globally asymptotically stable equilibrium. Consider now the recursion (\ref{omega}). Let $M_{k+1}=(\delta_k-\omega_k) -\mathbb{E}[(\delta_k-\omega_k)|\mathcal{F}(k)]$, where $\mathcal{F}(k)=\sigma(\omega_l,\theta_l,l\leq k;\phi_s,\phi_s',r_s,s0$, $\forall k\geq0$, \begin{equation*} \mathbb{E}[||M_{k+1}||^2|\mathcal{F}(k)]\leq c_1(1+||\omega_k||^2+||\theta_k||^2). \end{equation*} Now Assumptions (A1) and (A2) of \cite{borkar2000ode} are verified. Furthermore, Assumptions (TS) of \cite{borkar2000ode} is satisfied by our conditions on the step-size sequences $\alpha_k$, $\beta_k$. Thus, by Theorem 2.2 of \cite{borkar2000ode} we obtain that $||\omega_k-\omega^*||\rightarrow 0$ almost surely as $k\rightarrow \infty$. Consider now the slower time scale recursion (\ref{theta}). Based on the above analysis, (\ref{theta}) can be rewritten as \begin{equation*} \theta_{k+1}\leftarrow \theta_{k}+\alpha_k(\delta_k-\mathbb{E}[\delta_k|\theta_k])\phi_k. \end{equation*} Let $\mathcal{G}(k)=\sigma(\theta_l,l\leq k;\phi_s,\phi_s',r_s,s0$, $\forall k\geq0$, \begin{equation*} \mathbb{E}[||Z_{k+1}||^2|\mathcal{G}(k)]\leq c_2(1+||\theta_k||^2). \end{equation*} Consider now the following ODE associated with (\ref{theta}): \begin{equation} \begin{array}{ccl} \dot{\theta}(t)&=&\mathrm{Cov}(\delta|\theta(t),\phi)\\ &=&\mathrm{Cov}(r+(\gamma\phi'-\phi)^{\top}\theta(t),\phi)\\ &=&\mathrm{Cov}(r,\phi)-\mathrm{Cov}(\theta(t)^{\top}(\phi-\gamma\phi'),\phi)\\ &=&\mathrm{Cov}(r,\phi)-\theta(t)^{\top}\mathrm{Cov}(\phi-\gamma\phi',\phi)\\ &=&\mathrm{Cov}(r,\phi)-\mathrm{Cov}(\phi-\gamma\phi',\phi)^{\top}\theta(t)\\ &=&\mathrm{Cov}(r,\phi)-\mathrm{Cov}(\phi,\phi-\gamma\phi')\theta(t)\\ &=&-A\theta(t)+b. \end{array} \label{odetheta} \end{equation} Let $\vec{h}(\theta(t))$ be the driving vector field of the ODE (\ref{odetheta}). \begin{equation*} \vec{h}(\theta(t))=-A\theta(t)+b. \end{equation*} Consider the cross-covariance matrix, \begin{equation} \begin{array}{ccl} A &=& \mathrm{Cov}(\phi,\phi-\gamma\phi')\\ &=&\frac{\mathrm{Cov}(\phi,\phi)+\mathrm{Cov}(\phi-\gamma\phi',\phi-\gamma\phi')-\mathrm{Cov}(\gamma\phi',\gamma\phi')}{2}\\ &=&\frac{\mathrm{Cov}(\phi,\phi)+\mathrm{Cov}(\phi-\gamma\phi',\phi-\gamma\phi')-\gamma^2\mathrm{Cov}(\phi',\phi')}{2}\\ &=&\frac{(1-\gamma^2)\mathrm{Cov}(\phi,\phi)+\mathrm{Cov}(\phi-\gamma\phi',\phi-\gamma\phi')}{2},\\ \end{array} \label{covariance} \end{equation} where we eventually used $\mathrm{Cov}(\phi',\phi')=\mathrm{Cov}(\phi,\phi)$ \footnote{The covariance matrix $\mathrm{Cov}(\phi',\phi')$ is equal to the covariance matrix $\mathrm{Cov}(\phi,\phi)$ if the initial state is re-reachable or initialized randomly in a Markov chain for on-policy update.}. Note that the covariance matrix $\mathrm{Cov}(\phi,\phi)$ and $\mathrm{Cov}(\phi-\gamma\phi',\phi-\gamma\phi')$ are semi-positive definite. Then, the matrix $A$ is semi-positive definite because $A$ is linearly combined by two positive-weighted semi-positive definite matrice (\ref{covariance}). Furthermore, $A$ is nonsingular due to the assumption. Hence, the cross-covariance matrix $A$ is positive definite. Therefore, $\theta^*=A^{-1}b$ can be seen to be the unique globally asymptotically stable equilibrium for ODE (\ref{odetheta}). Let $\vec{h}_{\infty}(\theta)=\lim_{r\rightarrow \infty}\frac{\vec{h}(r\theta)}{r}$. Then $\vec{h}_{\infty}(\theta)=-A\theta$ is well-defined. Consider now the ODE \begin{equation} \dot{\theta}(t)=-A\theta(t). \label{odethetafinal} \end{equation} The ODE (\ref{odethetafinal}) has the origin as its unique globally asymptotically stable equilibrium. Thus, the assumption (A1) and (A2) are verified. \end{proof} Theorem 3 in \cite{dalal2020tale} provides a general conclusion on the convergence speed of all linear two-timescale algorithms. VMTD satisfies the assumptions of this theorem, leading to the following corollary. \begin{corollary} \label{corollary4_2} Consider the Sparsely Projected variant of VMTD. Then, for $\alpha_k = 1/(k+1)^{\alpha}$, $\beta_k = 1/(k+1)^{\beta}$, $0<\beta<\alpha<1$, $p>1$, with probility larger than $1- \tau$, for all $k\geq N_3$, we have \begin{equation} ||\theta'_{k} - \theta^{*}|| \le C_{3,\theta} \frac{\sqrt{\ln (4d_{1}^{2}(k+1)^{p}/\tau)} }{(k+1)^{\alpha / 2}} \end{equation} \begin{equation} ||\omega'_{n} - \omega^{*}|| \le C_{3,\omega} \frac{\sqrt{\ln (4d_{2}^{2}(k+1)^{p}/\tau)} }{(k+1)^{\omega / 2}}, \end{equation} \end{corollary} where $d_1$ and $d_2$ represent the dimensions of $\theta$ and $\omega$, respectively. For VMTD, $d_2 =1$. The meanings of $N_3$,$C_{3,\theta}$ and $C_{3,\omega}$ are explained in \cite{dalal2020tale}. The formulas for $\theta'_{k}$ and $\omega'_{n}$ can be found in (\ref{sparseprojectiontheta}) and (\ref{sparseprojectionomega}). \begin{theorem} \label{theorem2}(Convergence of VMTDC). In the case of off-policy learning, consider the iterations (\ref{omegavmtdc}), (\ref{uvmtdc}) and (\ref{thetavmtdc}) of VMTDC. Let the step-size sequences $\alpha_k$, $\zeta_k$ and $\beta_k$, $k\geq 0$ satisfy in this case $\alpha_k,\zeta_k,\beta_k>0$, for all $k$, $ \sum_{k=0}^{\infty}\alpha_k=\sum_{k=0}^{\infty}\beta_k=\sum_{k=0}^{\infty}\zeta_k=\infty, $ $ \sum_{k=0}^{\infty}\alpha_k^2<\infty, $ $ \sum_{k=0}^{\infty}\zeta_k^2<\infty, $ $ \sum_{k=0}^{\infty}\beta_k^2<\infty, $ and $ \alpha_k = o(\zeta_k), $ $ \zeta_k = o(\beta_k). $ Assume that $(\phi_k,r_k,\phi_k')$ is an i.i.d. sequence with uniformly bounded second moments. Let $A = \mathrm{Cov}(\phi,\phi-\gamma\phi')$, $b=\mathrm{Cov}(r,\phi)$, and $C=\mathbb{E}[\phi\phi^{\top}]$. Assume that $A$ and $C$ are non-singular matrices. Then the parameter vector $\theta_k$ converges with probability one to $A^{-1}b$. \end{theorem} Please refer to the appendix \ref{proofth2} for detailed proof process. \section{Experimental Studies} This section assesses algorithm performance through experiments, which are divided into policy evaluation experiments and control experiments. \subsection{Testing Tasks} \textbf{Random-walk:} as shown in Figure \ref{randomwalk}, all episodes start in the center state, $C$, and proceed to left or right by one state on each step, equiprobably. Episodes terminate either on the extreme left or the extreme right, and get a reward of $+1$ if terminate on the right, or $0$ in the other case. In this task, the true value for each state is the probability of starting from that state and terminating on the right \cite{Sutton2018book}. Thus, the true values of states from $A$ to $E$ are $\frac{1}{6},\frac{2}{6},\frac{3}{6},\frac{4}{6},\frac{5}{6}$, respectively. The discount factor $\gamma=1.0$. There are three standard kinds of features for random-walk problems: tabular feature, inverted feature and dependent feature \cite{sutton2009fast}. The feature matrices corresponding to three random walks are shown in Appendix \ref{experimentaldetails}. Conduct experiments using an on-policy approach in the Random-walk environment. \begin{figure} \begin{center} \input{pic/randomwalk.tex} % \captionsetup{width=0.5\textwidth} \caption{Random walk.} \label{randomwalk} \end{center} \end{figure} \begin{figure} \begin{center} \input{pic/BairdExample.tex} \caption{7-state version of Baird's off-policy counterexample.} \label{bairdexample} \end{center} \end{figure} \textbf{Baird's off-policy counterexample:} This task is well known as a counterexample, in which TD diverges \cite{baird1995residual,sutton2009fast}. As shown in Figure \ref{bairdexample}, reward for each transition is zero. Thus the true values are zeros for all states and for any given policy. The behaviour policy chooses actions represented by solid lines with a probability of $\frac{1}{7}$ and actions represented by dotted lines with a probability of $\frac{6}{7}$. The target policy is expected to choose the solid line with more probability than $\frac{1}{7}$, and it chooses the solid line with probability of $1$ in this paper. The discount factor $\gamma =0.99$, and the feature matrix is defined in Appendix \ref{experimentaldetails} \cite{baird1995residual,sutton2009fast,maei2011gradient}. \textbf{Maze}: The learning agent should find a shortest path from the upper left corner to the lower right corner. \begin{wrapfigure}{r}{4cm} \centering \includegraphics[scale=0.2]{pic/maze_13_13.pdf} % \caption{The 2-state counterexample.} \end{wrapfigure} In each state, there are four alternative actions: $up$, $down$, $left$, and $right$, which takes the agent deterministically to the corresponding neighbour state, except when a movement is blocked by an obstacle or the edge of the maze. Rewards are $-1$ in all transitions until the agent reaches the goal state. The discount factor $\gamma=0.99$, and states $s$ are represented by tabular features.The maximum number of moves in the game is set to 1000. \textbf{The other three control environments}: Cliff Walking, Mountain Car, and Acrobot are selected from the gym official website and correspond to the following versions: ``CliffWalking-v0'', ``MountainCar-v0'' and ``Acrobot-v1''. For specific details, please refer to the gym official website. The maximum number of steps for the Mountain Car environment is set to 1000, while the default settings are used for the other two environments. In Mountain car and Acrobot, features are generated by tile coding. Please, refer to the Appendix \ref{experimentaldetails} for the selection of learning rates for all experiments. \subsection{Experimental Results and Analysis} \begin{figure}[htb] \vskip 0.2in \begin{center} \subfigure[Dependent]{ \includegraphics[width=0.46\columnwidth, height=0.46\columnwidth]{pic/dependent_new.pdf} \label{DependentFull} } \subfigure[Tabular]{ \includegraphics[width=0.46\columnwidth, height=0.46\columnwidth]{pic/tabular_new.pdf} \label{TabularFull} } \\ \subfigure[Inverted]{ \includegraphics[width=0.46\columnwidth, height=0.46\columnwidth]{pic/inverted_new.pdf} \label{InvertedFull} } \subfigure[counterexample]{ \includegraphics[width=0.46\columnwidth, height=0.46\columnwidth]{pic/counterexample_quanju_new.pdf} \label{CounterExampleFull} } \caption{Learning curses of four evaluation environments.} \label{Evaluation_full} \end{center} \vskip -0.2in \end{figure} \begin{figure*}[htb] \vskip 0.2in \begin{center} \subfigure[Maze]{ \includegraphics[width=0.9\columnwidth, height=0.64\columnwidth]{pic/maze_complete.pdf} \label{MazeFull} } \subfigure[Cliff Walking]{ \includegraphics[width=0.9\columnwidth, height=0.64\columnwidth]{pic/cw_complete.pdf} \label{CliffWalkingFull} } \\ \subfigure[Mountain Car]{ \includegraphics[width=0.9\columnwidth, height=0.64\columnwidth]{pic/mt_complete.pdf} \label{MountainCarFull} } \subfigure[Acrobot]{ \includegraphics[width=0.9\columnwidth, height=0.64\columnwidth]{pic/Acrobot_complete.pdf} \label{AcrobotFull} } \caption{Learning curses of four contral environments.} \label{Complete_full} \end{center} \vskip -0.2in \end{figure*} \begin{table*} \centering \caption{Difference between R-learning and tabular VMQ.} \vskip 0.15in \begin{tabular}{c|cc} \hline algorithms&update formula \\ \hline R-learning&$Q_{k+1}(s,a)\leftarrow Q_{k}(s,a)+\alpha_k(r_{k+1}-m_{k}+ \max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a))$\\ &$m_{k+1}\leftarrow m_{k}+\beta_k(r_{k+1}+\max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a)-m_{k})$\\ tabular VMQ&$Q_{k+1}(s,a)\leftarrow Q_{k}(s,a)+\alpha_k(r_{k+1}+\gamma \max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a)-\omega_k)$\\ &$\omega_{k+1}\leftarrow \omega_{k}+\beta_k(r_{k+1}+\gamma \max_{b\in A}Q_{k}(s,b) - Q_{k}(s,a)-\omega_{k})$\\ \hline \end{tabular} \label{differenceRandVMQ} \vskip -0.1in \end{table*} % The learning rates of all algorithms in different environments are shown in Table \ref{lrofways}. % Figure \ref{Complete_full} shows the experimental curves of different algorithms in four environments. For policy evaluation experiments, compare the performance of the VMTD, VMTDC, TD, and TDC algorithms. The vertical axis is unified as RVBE. For policy evaluation experiments, the criteria for evaluating algorithms vary. The objective function minimized by our proposed new algorithm differs from that of other algorithms. Therefore, to ensure fairness in comparisons, this study only contrasts algorithm experiments in controlled settings. This study will compare the performance of Sarsa, Q-learning, GQ(0), AC, VMSarsa, VMQ, and VMGQ(0) in four control environments. % All experiments involved in this paper were run independently for 100 times. The learning curves of the algorithms corresponding to policy evaluation experiments and control experiments are shown in Figures \ref{Evaluation_full} and \ref{Complete_full}, respectively. The shaded area in Figure \ref{Evaluation_full}, \ref{Complete_full} represents the standard deviation (std). In the random-walk tasks, VMTD and VMTDC exhibit excellent performance, outperforming TD and TDC in the case of dependent random-walk. In the 7-state example counter task, TD diverges, while VMTDC converges and performs better than TDC. From the update formula, it can be observed that the VMTD algorithm, like TDC, is also an adjustment or correction of the TD update. What is more surprising is that VMTD also maintains convergence and demonstrates the best performance. In Maze, Mountain Car, and Acrobot, the convergence speed of VMSarsa, VMQ, and VMGQ(0) has been significantly improved compared to Sarsa, Q-learning, and GQ(0), respectively. The performance of the AC algorithm is at an intermediate level. The performances of VMSarsa, VMQ, and VMGQ(0) in these three experimental environments have no significant differences. In Cliff Walking, Sarsa and VMSarsa converge to slightly worse solutions compared to other algorithms. The convergence speed of VMSarsa is significantly better than that of Sarsa. The convergence speed of VMGQ(0) and VMQ is better than other algorithms, and the performance of VMGQ(0) is slightly better than that of VMQ. In summary, the performance of VMSarsa, VMQ, and VMGQ(0) is better than that of other algorithms. In the Cliff Walking environment, the performance of VMGQ(0) is slightly better than that of VMSarsa and VMQ. In the other three experimental environments, the performances of VMSarsa, VMQ, and VMGQ(0) are close. \section{Related Work} \subsection{Difference between VMQ and R-learning} Tabular VMQ's update formula bears some resemblance to R-learning's update formula. As shown in Table \ref{differenceRandVMQ}, the update formulas of the two algorithms have the following differences: \\(1) The goal of the R-learning algorithm \cite{schwartz1993reinforcement} is to maximize the average reward, rather than the cumulative reward, by learning an estimate of the average reward. This estimate $m$ is then used to update the Q-values. On the contrary, the $\omega$ in the tabular VMQ update formula eventually converges to $\mathbb{E}[\delta]$. \\(2) When $\gamma=1$ in the tabular VMQ update formula, the R-learning update formula is formally the same as the tabular VMQ update formula. Therefore, R-learning algorithm can be considered as a special case of VMQ algorithm in form. \subsection{Variance Reduction for TD Learning} The TD with centering algorithm (CTD) \cite{korda2015td} was proposed, which directly applies variance reduction techniques to the TD algorithm. The CTD algorithm updates its parameters using the average gradient of a batch of Markovian samples and a projection operator. Unfortunately, the authors’ analysis of the CTD algorithm contains technical errors. The VRTD algorithm \cite{xu2020reanalysis} is also a variance-reduced algorithm that updates its parameters using the average gradient of a batch of i.i.d. samples. The authors of VRTD provide a technically sound analysis to demonstrate the advantages of variance reduction. \subsection{Variance Reduction for Policy Gradient Algorithms} Policy gradient algorithms are a class of reinforcement learning algorithms that directly optimize cumulative rewards. REINFORCE is a Monte Carlo algorithm that estimates gradients through sampling, but may have a high variance. Baselines are introduced to reduce variance and to accelerate learning \cite{Sutton2018book}. In Actor-Critic, value function as a baseline and bootstrapping are used to reduce variance, also accelerating convergence \cite{Sutton2018book}. TRPO \cite{schulman2015trust} and PPO \cite{schulman2017proximal} use generalized advantage estimation, which combines multi-step bootstrapping and Monte Carlo estimation to reduce variance, making gradient estimation more stable and accelerating convergence. In Variance Minimization, the incorporation of $\omega \doteq \mathbb{E}[\delta]$ bears a striking resemblance to the use of a baseline in policy gradient methods. The introduction of a baseline in policy gradient techniques does not alter the expected value of the update; rather, it significantly impacts the variance of gradient estimation. The addition of $\omega \doteq \mathbb{E}[\delta]$ in Variance Minimization preserves the invariance of the optimal policy while stabilizing gradient estimation, reducing the variance of gradient estimation, and hastening convergence. \section{Conclusion and Future Work} Value-based reinforcement learning typically aims to minimize error as an optimization objective. As an alternation, this study proposes two new objective functions: VBE and VPBE, and derives an on-policy algorithm: VMTD and an off-policy algorithm: VMTDC. % The VMTD algorithm % is essentially an adjustment or correction to the traditional % TD update. % Both % algorithms are capable of stabilizing gradient estimation, reducing % the variance of gradient estimation and accelerating convergence. Both algorithms demonstrated superior performance in policy evaluation and control experiments. Future work may include, but are not limited to, (1) analysis of the convergence rate of VMTDC. (2) extensions of VBE and VPBE to multi-step returns. (3) extensions to nonlinear approximations, such as neural networks. % \section{Format of the Paper} % All submissions must follow the specified format. % \begin{figure}[ht] % \vskip 0.2in % \begin{center} % \centerline{\includegraphics[width=\columnwidth]{icml_numpapers}} % \caption{Historical locations and number of accepted papers for International % Machine Learning Conferences (ICML 1993 -- ICML 2008) and International % Workshops on Machine Learning (ML 1988 -- ML 1992). At the time this figure was % produced, the number of accepted papers for ICML 2008 was unknown and instead % estimated.} % \label{icml-historical} % \end{center} % \vskip -0.2in % \end{figure} % \subsection{Figures} % You may want to include figures in the paper to illustrate % your approach and results. Such artwork should be centered, % legible, and separated from the text. Lines should be dark and at % least 0.5~points thick for purposes of reproduction, and text should % not appear on a gray background. % Label all distinct components of each figure. If the figure takes the % form of a graph, then give a name for each axis and include a legend % that briefly describes each curve. Do not include a title inside the % figure; instead, the caption should serve this function. % Number figures sequentially, placing the figure number and caption % \emph{after} the graphics, with at least 0.1~inches of space before % the caption and 0.1~inches after it, as in % \cref{icml-historical}. The figure caption should be set in % 9~point type and centered unless it runs two or more lines, in which % case it should be flush left. You may float figures to the top or % bottom of a column, and you may set wide figures across both columns % (use the environment \texttt{figure*} in \LaTeX). Always place % two-column figures at the top or bottom of the page. % \subsection{Theorems and such} % The preferred way is to number definitions, propositions, lemmas, etc. consecutively, within sections, as shown below. % \begin{definition} % \label{def:inj} % A function $f:X \to Y$ is injective if for any $x,y\in X$ different, $f(x)\ne f(y)$. % \end{definition} % Using \cref{def:inj} we immediate get the following result: % \begin{proposition} % If $f$ is injective mapping a set $X$ to another set $Y$, % the cardinality of $Y$ is at least as large as that of $X$ % \end{proposition} % \begin{proof} % Left as an exercise to the reader. % \end{proof} % \cref{lem:usefullemma} stated next will prove to be useful. % \begin{lemma} % \label{lem:usefullemma} % For any $f:X \to Y$ and $g:Y\to Z$ injective functions, $f \circ g$ is injective. % \end{lemma} % \begin{theorem} % \label{thm:bigtheorem} % If $f:X\to Y$ is bijective, the cardinality of $X$ and $Y$ are the same. % \end{theorem} % An easy corollary of \cref{thm:bigtheorem} is the following: % \begin{corollary} % If $f:X\to Y$ is bijective, % the cardinality of $X$ is at least as large as that of $Y$. % \end{corollary} % \begin{assumption} % The set $X$ is finite. % \label{ass:xfinite} % \end{assumption} % \begin{remark} % According to some, it is only the finite case (cf. \cref{ass:xfinite}) that is interesting. % \end{remark} %restatable % In the unusual situation where you want a paper to appear in the % references without citing it in the main text, use \nocite \nocite{langley00} \bibliography{example_paper} \bibliographystyle{icml2024} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % APPENDIX %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newpage \appendix \onecolumn \section{Relevant proofs} \subsection{Proof of Corollary \ref{corollary4_2}} \label{proofcorollary4_2} The update formulas in linear two-timescale algorithms are as follows: \begin{equation} \theta_{k+1}=\theta_{k} + \alpha_{k}[h_1(\theta_{k},\omega_{k})+M^{(1)}_{k+1}], \end{equation} \begin{equation} \omega_{k+1}=\omega_{k} + \alpha_{k}[h_2(\theta_{k},\omega_{k})+M^{(2)}_{k+1}]. \end{equation} where $\alpha_k, \beta_k \in \mathbb{R} $ are stepsizes and $M^{(1)} \in \mathbb{R}^{d_1}, M^{(2)} \in \mathbb{R}^{d_2}$ denote noise. $h_1 : \mathbb{R}^{d_{1}}\times \mathbb{R}^{d_{2}}\rightarrow \mathbb{R}^{d_{1}}$ and $h_2 : \mathbb{R}^{d_{1}}\times \mathbb{R}^{d_{2}}\rightarrow \mathbb{R}^{d_{2}}$ have the form, respectively, \begin{equation} h_{1}(\theta,\omega)=v_1 - \Gamma_1 \theta - W_1\omega, \end{equation} \begin{equation} h_{2}(\theta,\omega)=v_2 - \Gamma_2 \theta - W_2\omega, \end{equation} where $v_1 \in \mathbb{R}^{d_1}$, $v_2 \in \mathbb{R}^{d_2}$, $\Gamma_1 \in \mathbb{R}^{d_1 \times d_1}$ , $\Gamma_2 \in \mathbb{R}^{d_2 \times d_1}$, $W_1 \in \mathbb{R}^{d_1 \times d_2}$ and $W_2 \in \mathbb{R}^{d_2 \times d_2}$. $d_1$ and $d_2$ are the dimensions of vectors $\theta$ and $\omega$, respectively. For Theorem 3 in \cite{dalal2020tale}, the theorem still holds even when $d——1$ is not equal to $d_2$. For the VMTD algorithm, $d_2$ is equal to 1. % Before proving the Corollary \ref{corollary4_2}, \cite{dalal2020tale} presents the matrix assumption, step size assumption, and defines sparse projection. \begin{assumption} \label{matrixassumption} (Matrix Assumption). $W_2$ and $X_1 = \Gamma_1 - W_1 W_{2}^{-1}\Gamma_2$ are positive definite(not necessarily symmetric). \end{assumption} \begin{assumption} \label{stepsizeassumption} (Step Size Assumption). $\alpha_k = (k+1)^{-\alpha}$ and $\beta_k = (k+1)^{-\beta}$, where $1>\alpha > \beta > 0$. \end{assumption} \begin{definition} \label{sparseprojection} (Sparse Projection). For $R>0$, let $\Pi_{R}(x)=\min \{1, R/||x||\}$. $x$ be the projection into the ball with redius R around the origin. The sparse projection operator \begin{equation*} \Pi_{n, R} = \begin{cases} \Pi_{R}, & \text{if } k = n^{n} - 1 \text{ for some } n \in \mathbb{Z}_{>0}, \\ I, & \text{otherwise}. \end{cases} \end{equation*} We call it sparse as it projects only on specific indices that are exponentially far apart. Pick an arbitrary $p>1$. Fix some constant $R^{\theta}_{\text{proj}}>0$ and $R^{\omega}_{\text{proj}}>0$ for the radius of the projection ball. Further, let \begin{equation*} \theta^{*}=X^{-1}_{1}b_{1}, \omega^{*}=W^{-1}_{2}(v_2 - \Gamma_2 \theta^{*}) \end{equation*} with $b_1=v_1 - W_1 W_2^{-1}v_2$. The formula for the sparse projection update in linear two-timescale algorithms is as follows: \begin{equation} \label{sparseprojectiontheta} \theta'_{k+1}=\Pi_{k+1,R^{\theta}_{\text{proj}}}(\theta'_{k} + \alpha_{k}[h_1(\theta'_{k},\omega'_{k})+M^{(1')}_{k+1}]), \end{equation} \begin{equation} \label{sparseprojectionomega} \omega'_{k+1}=\Pi_{k+1,R^{\omega}_{\text{proj}}}(\omega'_{k} + \beta_{k}[h_2(\theta'_{k},\omega'_{k})+M^{(2')}_{k+1}]). \end{equation} \end{definition} \begin{proof} As long as the VMTD algorithm satisfies Assumption \ref{matrixassumption}, the convergence speed of the VMTD algorithm can be obtained. VMTD's update rule is \begin{equation*} \theta_{k+1}=\theta_{k}+\alpha_k(\delta_k-\omega_k)\phi_k. \end{equation*} \begin{equation*} \omega_{k+1}=\omega_{k}+\beta_k(\delta_k-\omega_k). \end{equation*} Thus, $h_1(\theta, \omega)=\mathrm{Cov}(r,\phi)-\mathrm{Cov}(\phi,\phi - \gamma\phi')\theta$, $h_2(\theta, \omega)=\mathbb{E}[r]+\mathbb{E}[\gamma \phi'^{\top}-\phi^{\top}]\theta -\omega$, $\Gamma_1 =\mathrm{Cov}(\phi,\phi - \gamma\phi')$, $W_1 = 0$ and $\Gamma_2 = -\mathbb{E}[\gamma \phi'^{\top}-\phi^{\top}]$, $W_2 = 1$, $v_2 = \mathbb{E}[r]$. Additionally, $X_1=\Gamma_1 - W_1 W^{-1}_2 \Gamma_2 = \mathrm{Cov}(\phi,\phi - \gamma\phi')$. % By the Assumption \ref{matrixassumption}, It can be deduced from the proof \ref{th1proof} that $X_1$ is a positive definite matrix. The VMTD algorithm satisfies the Assumption \ref{matrixassumption}. By the proof \ref{th1proof}, Definition 1 in \cite{dalal2020tale} is satisfied. We can apply the Theorem 3 in \cite{dalal2020tale} to get the Corollary \ref{corollary4_2}. \end{proof} \subsection{Proof of Theorem \ref{theorem2}} \label{proofth2} \begin{proof} The proof is similar to that given by \cite{sutton2009fast} for TDC, but it is based on multi-time-scale stochastic approximation. For the VMTDC algorithm, a new one-step linear TD solution is defined as: \begin{equation*} 0=\mathbb{E}[(\phi - \gamma \phi' - \mathbb{E}[\phi - \gamma \phi'])\phi^\top]\mathbb{E}[\phi \phi^{\top}]^{-1}\mathbb{E}[(\delta -\mathbb{E}[\delta])\phi]=A^{\top}C^{-1}(-A\theta+b). \end{equation*} The matrix $A^{\top}C^{-1}A$ is positive definite. Thus, the VMTD's solution is $\theta_{\text{VMTDC}}=\theta_{\text{VMTD}}=A^{-1}b$. First, note that recursion (\ref{thetavmtdc}) and (\ref{uvmtdc}) can be rewritten as, respectively, \begin{equation*} \theta_{k+1}\leftarrow \theta_k+\zeta_k x(k), \end{equation*} \begin{equation*} u_{k+1}\leftarrow u_k+\beta_k y(k), \end{equation*} where \begin{equation*} x(k)=\frac{\alpha_k}{\zeta_k}[(\delta_{k}- \omega_k) \phi_k - \gamma\phi'_{k}(\phi^{\top}_k u_k)], \end{equation*} \begin{equation*} y(k)=\frac{\zeta_k}{\beta_k}[\delta_{k}-\omega_k - \phi^{\top}_k u_k]\phi_k. \end{equation*} Recursion (\ref{thetavmtdc}) can also be rewritten as \begin{equation*} \theta_{k+1}\leftarrow \theta_k+\beta_k z(k), \end{equation*} where \begin{equation*} z(k)=\frac{\alpha_k}{\beta_k}[(\delta_{k}- \omega_k) \phi_k - \gamma\phi'_{k}(\phi^{\top}_k u_k)], \end{equation*} Due to the settings of step-size schedule $\alpha_k = o(\zeta_k)$, $\zeta_k = o(\beta_k)$, $x(k)\rightarrow 0$, $y(k)\rightarrow 0$, $z(k)\rightarrow 0$ almost surely as $k\rightarrow 0$. That is that the increments in iteration (\ref{omegavmtdc}) are uniformly larger than those in (\ref{uvmtdc}) and the increments in iteration (\ref{uvmtdc}) are uniformly larger than those in (\ref{thetavmtdc}), thus (\ref{omegavmtdc}) is the fastest recursion, (\ref{uvmtdc}) is the second fast recursion and (\ref{thetavmtdc}) is the slower recursion. Along the fastest time scale, iterations of (\ref{thetavmtdc}), (\ref{uvmtdc}) and (\ref{omegavmtdc}) are associated to ODEs system as follows: \begin{equation} \dot{\theta}(t) = 0, \label{thetavmtdcFastest} \end{equation} \begin{equation} \dot{u}(t) = 0, \label{uvmtdcFastest} \end{equation} \begin{equation} \dot{\omega}(t)=\mathbb{E}[\delta_t|u(t),\theta(t)]-\omega(t). \label{omegavmtdcFastest} \end{equation} Based on the ODE (\ref{thetavmtdcFastest}) and (\ref{uvmtdcFastest}), both $\theta(t)\equiv \theta$ and $u(t)\equiv u$ when viewed from the fastest timescale. By the Hirsch lemma \cite{hirsch1989convergent}, it follows that $||\theta_k-\theta||\rightarrow 0$ a.s. as $k\rightarrow \infty$ for some $\theta$ that depends on the initial condition $\theta_0$ of recursion (\ref{thetavmtdc}) and $||u_k-u||\rightarrow 0$ a.s. as $k\rightarrow \infty$ for some $u$ that depends on the initial condition $u_0$ of recursion (\ref{uvmtdc}). Thus, the ODE pair (\ref{thetavmtdcFastest})-(ref{omegavmtdcFastest}) can be written as \begin{equation} \dot{\omega}(t)=\mathbb{E}[\delta_t|u,\theta]-\omega(t). \label{omegavmtdcFastestFinal} \end{equation} Consider the function $h(\omega)=\mathbb{E}[\delta|\theta,u]-\omega$, i.e., the driving vector field of the ODE (\ref{omegavmtdcFastestFinal}). It is easy to find that the function $h$ is Lipschitz with coefficient $-1$. Let $h_{\infty}(\cdot)$ be the function defined by $h_{\infty}(\omega)=\lim_{r\rightarrow \infty}\frac{h(r\omega)}{r}$. Then $h_{\infty}(\omega)= -\omega$, is well-defined. For (\ref{omegavmtdcFastestFinal}), $\omega^*=\mathbb{E}[\delta|\theta,u]$ is the unique globally asymptotically stable equilibrium. For the ODE \begin{equation} \dot{\omega}(t) = h_{\infty}(\omega(t))= -\omega(t), \label{omegavmtdcInfty} \end{equation} apply $\vec{V}(\omega)=(-\omega)^{\top}(-\omega)/2$ as its associated strict Liapunov function. Then, the origin of (\ref{omegavmtdcInfty}) is a globally asymptotically stable equilibrium. Consider now the recursion (\ref{omegavmtdc}). Let $M_{k+1}=(\delta_k-\omega_k) -\mathbb{E}[(\delta_k-\omega_k)|\mathcal{F}(k)]$, where $\mathcal{F}(k)=\sigma(\omega_l,u_l,\theta_l,l\leq k;\phi_s,\phi_s',r_s,s0$, $\forall k\geq0$, \begin{equation*} \mathbb{E}[||M_{k+1}||^2|\mathcal{F}(k)]\leq c_1(1+||\omega_k||^2+||u_k||^2+||\theta_k||^2). \end{equation*} Now Assumptions (A1) and (A2) of \cite{borkar2000ode} are verified. Furthermore, Assumptions (TS) of \cite{borkar2000ode} is satisfied by our conditions on the step-size sequences $\alpha_k$,$\zeta_k$, $\beta_k$. Thus, by Theorem 2.2 of \cite{borkar2000ode} we obtain that $||\omega_k-\omega^*||\rightarrow 0$ almost surely as $k\rightarrow \infty$. Consider now the second time scale recursion (\ref{uvmtdc}). Based on the above analysis, (\ref{uvmtdc}) can be rewritten as % \begin{equation*} % u_{k+1}\leftarrow u_{k}+\zeta_{k}[\delta_{k}-\mathbb{E}[\delta_k|u_k,\theta_k] - \phi^{\top} (s_k) u_k]\phi(s_k). % \end{equation*} \begin{equation} \dot{\theta}(t) = 0, \label{thetavmtdcFaster} \end{equation} \begin{equation} \dot{u}(t) = \mathbb{E}[(\delta_t-\mathbb{E}[\delta_t|u(t),\theta(t)])\phi_t|\theta(t)] - Cu(t). \label{uvmtdcFaster} \end{equation} The ODE (\ref{thetavmtdcFaster}) suggests that $\theta(t)\equiv \theta$ (i.e., a time invariant parameter) when viewed from the second fast timescale. By the Hirsch lemma \cite{hirsch1989convergent}, it follows that $||\theta_k-\theta||\rightarrow 0$ a.s. as $k\rightarrow \infty$ for some $\theta$ that depends on the initial condition $\theta_0$ of recursion (\ref{thetavmtdc}). Consider now the recursion (\ref{uvmtdc}). Let $N_{k+1}=((\delta_k-\mathbb{E}[\delta_k]) - \phi_k \phi^{\top}_k u_k) -\mathbb{E}[((\delta_k-\mathbb{E}[\delta_k]) - \phi_k \phi^{\top}_k u_k)|\mathcal{I} (k)]$, where $\mathcal{I}(k)=\sigma(u_l,\theta_l,l\leq k;\phi_s,\phi_s',r_s,s0$, $\forall k\geq0$, \begin{equation*} \mathbb{E}[||N_{k+1}||^2|\mathcal{I}(k)]\leq c_2(1+||u_k||^2+||\theta_k||^2). \end{equation*} Because $\theta(t)\equiv \theta$ from (\ref{thetavmtdcFaster}), the ODE pair (\ref{thetavmtdcFaster})-(\ref{uvmtdcFaster}) can be written as \begin{equation} \dot{u}(t) = \mathbb{E}[(\delta_t-\mathbb{E}[\delta_t|\theta])\phi_t|\theta] - Cu(t). \label{uvmtdcFasterFinal} \end{equation} Now consider the function $h(u)=\mathbb{E}[\delta_t-\mathbb{E}[\delta_t|\theta]|\theta] -Cu$, i.e., the driving vector field of the ODE (\ref{uvmtdcFasterFinal}). For (\ref{uvmtdcFasterFinal}), $u^* = C^{-1}\mathbb{E}[(\delta-\mathbb{E}[\delta|\theta])\phi|\theta]$ is the unique globally asymptotically stable equilibrium. Let $h_{\infty}(u)=-Cu$. For the ODE \begin{equation} \dot{u}(t) = h_{\infty}(u(t))= -Cu(t), \label{uvmtdcInfty} \end{equation} the origin of (\ref{uvmtdcInfty}) is a globally asymptotically stable equilibrium because $C$ is a positive definite matrix (because it is nonnegative definite and nonsingular). Now Assumptions (A1) and (A2) of \cite{borkar2000ode} are verified. Furthermore, Assumptions (TS) of \cite{borkar2000ode} is satisfied by our conditions on the step-size sequences $\alpha_k$,$\zeta_k$, $\beta_k$. Thus, by Theorem 2.2 of \cite{borkar2000ode} we obtain that $||u_k-u^*||\rightarrow 0$ almost surely as $k\rightarrow \infty$. Consider now the slower timescale recursion (\ref{thetavmtdc}). In the light of the above, (\ref{thetavmtdc}) can be rewritten as \begin{equation} \theta_{k+1} \leftarrow \theta_{k} + \alpha_k (\delta_k -\mathbb{E}[\delta_k|\theta_k]) \phi_k\\ - \alpha_k \gamma\phi'_{k}(\phi^{\top}_k C^{-1}\mathbb{E}[(\delta_k -\mathbb{E}[\delta_k|\theta_k])\phi|\theta_k]). \end{equation} Let $\mathcal{G}(k)=\sigma(\theta_l,l\leq k;\phi_s,\phi_s',r_s,s