\section{Background} \label{preliminaries} \subsection{Markov Decision Process} Reinforcement learning agent interacts with environment, observes state, takes sequential decision makings to influence environment, and obtains rewards. Consider an infinite-horizon discounted Markov Decision Process (MDP), defined by a tuple $\langle S,A,R,P,\gamma \rangle$, where $S=\{1,2,\ldots,N\}$ is a finite set of states of the environment; $A$ is a finite set of actions of the agent; $R:S\times A \times S \rightarrow \mathbb{R}$ is a bounded deterministic reward function; $P:S\times A\times S \rightarrow [0,1]$ is the transition probability distribution; and $\gamma\in (0,1)$ is the discount factor \cite{Sutton2018book}. Due to the requirements of online learning, value iteration based on sampling is considered in this paper. In each sampling, an experience (or transition) $\langle s, a, s', r\rangle$ is obtained. A policy is a mapping $\pi:S\times A \rightarrow [0,1]$. The goal of the agent is to find an optimal policy $\pi^*$ to maximize the expectation of a discounted cumulative rewards in a long period. For each discrete time step $t=0,1,2,3,…$, State value function $V^{\pi}(s)$ for a stationary policy $\pi$ is defined as: \begin{equation*} V^{\pi}(s)=\mathbb{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1}|S_t=s]. \label{valuefunction} \end{equation*} Linear value function for state $s\in S$ is defined as: \begin{equation} V_{{\theta}}(s):= {{\theta}}^{\top}{{\phi}}(s) = \sum_{i=1}^{m} \theta_i \phi_i(s), \label{linearvaluefunction} \end{equation} where ${{\theta}}:=(\theta_1,\theta_2,\ldots,\theta_m)^{\top}\in \mathbb{R}^m$ is a parameter vector, ${{\phi}}:=(\phi_1,\phi_2,\ldots,\phi_m)^{\top}\in \mathbb{R}^m$ is a feature function defined on state space $S$, and $m$ is the feature size. Tabular temporal difference (TD) learning \cite{Sutton2018book} has been successfully applied to small-scale problems. To deal with the well-known curse of dimensionality of large scale MDPs, value function is usually approximated by a linear model (the focus of this paper), kernel methods, decision trees, or neural networks, etc. \subsection{On-policy and Off-policy} \begin{table*}[t] \caption{Minimum eigenvalues of various algorithms in the 2-state counterexample.} \label{tab:min_eigenvalues} % 添加标签 \vskip 0.15in \begin{center} \begin{small} \begin{sc} \begin{tabular}{lccccccr} \toprule algorithm &TD & TDC & ETD & VMTD & VMTDC & VMETD \\ \midrule on-policy 2-state&$0.475$ & $0.09025$ & \text{\textbackslash}& $0.25$ & $0.025$ & \text{\textbackslash} \\ off-policy 2-state&$-0.2$ & $0.016$ & $3.4$ & $0.25$ & $0.025$ & $1.15$\\ \bottomrule \end{tabular} \end{sc} \end{small} \end{center} \vskip -0.1in \end{table*} On-policy and off-policy algorithms are currently hot topics in research. The main difference between the two lies in the fact that in on-policy algorithms, the behavior policy $\mu$ and the target policy $\pi$ are the same during the learning process. In off-policy algorithms, however, the behavior policy and the target policy are different. The algorithm uses data generated from the behavior policy to optimize the target policy, which leads to higher sample efficiency and complex stability issues. % In the on-policy TD(0) algorithm, since the behavior policy and the target policy % are consistent, the convergence of TD(0) is more assured. In each time step $t$ of the % update, the algorithm is based on the actual behavior of the current policy, % which gradually leads the value function estimate to converge to the true % value of the target policy. From the theory of stochastic methods, the convergence point of linear TD algorithms, is a parameter vector, say $\bm{\theta}$, that satisfies \begin{equation*} \begin{array}{ccl} b - \textbf{A}{\theta}&=&0, \end{array} \end{equation*} where $\textbf{A}\in \mathbb{R}^{|S| \times m}$ and $b\in \mathbb{R}^{m}$. If the matrix $\textbf{A}$ is positive definite, then the algorithm converges. The convergence rate of the algorithm is related to the matrix $\textbf{A}$. The larger the minimum eigenvalue of $\textbf{A}$, the faster the convergence rate. Next, we will compute the minimum eigenvalue of $\textbf{A}$ for TD(0), TDC, and ETD in both on-policy and off-policy settings in a 2-state environment. First, we will introduce the environment setup for the 2-state case in both on-policy and off-policy settings. \begin{figure}[h] \begin{center} \includegraphics[width=0.3\columnwidth, height=0.15\columnwidth]{main/pic/2StateExample.pdf} \caption{2-state} \end{center} \end{figure} The "1"$\rightarrow$"2" problem has only two states. From each state, there are two actions, left and right, which take the agent to the left or right state. All rewards are zero. The feature $\bm{\Phi}=(1,2)^{\top}$ are assigned to the left and the right state. The first policy takes the equal probability to left or right in both states, i.e., $ \textbf{P}_{1}= \begin{bmatrix} 0.5 & 0.5 \\ 0.5 & 0.5 \end{bmatrix} $. The second policy only selects action right in both states, i.e., $ \textbf{P}_{2}= \begin{bmatrix} 0 & 1 \\ 0 & 1 \end{bmatrix} $. The state distribution of the first policy is $d_1 =(0.5,0.5)^{\top}$. The state distribution of the second policy is $d_1 =(0,1)^{\top}$. The discount factor is $\gamma=0.9$. In the on-policy setting, the behavior policy and the target policy are the same, so let $\textbf{P}_{\mu}=\textbf{P}_{\pi}=\textbf{P}_{1}$. In the off-policy setting, let $\textbf{P}_{\mu}=\textbf{P}_{1}$ and $\textbf{P}_{\pi}=\textbf{P}_{2}$. % The on-policy TD(0) update formula is % \begin{equation*} % \label{thetatd_onpolicy} % \begin{array}{ccl} % \bm{\theta}_{t+1}&\leftarrow&\bm{\theta}_t+\alpha_t \delta_t\bm{\phi}_t, % \end{array} % \end{equation*} % where $\delta_t = r_{t+1}+\gamma \bm{\theta}_t^{\top}\bm{\phi}_{t+1}-\bm{\theta}_t^{\top}\bm{\phi}_t$ is one-step TD error and The key matrix $\textbf{A}_{\text{on}}$ of on-policy TD(0) is \begin{equation*} \textbf{A}_{\text{on}} = \bm{\Phi}^{\top}\textbf{D}_{\pi}(\textbf{I}-\gamma \textbf{P}_{\pi})\bm{\Phi}, \end{equation*} where $\bm{\Phi}$ is the $|S| \times m$ matrix with the $\phi(s)$ as its rows, and $\textbf{D}_{\pi}$ is the $|S| \times |S|$ diagonal matrix with $d_{\pi}$ on its diagonal. $d_{\pi}$ is a vector, each component representing the steady-state distribution under policy $\pi$. $\textbf{P}_{\pi}$ denote the $|S| \times |S|$ matrix of transition probabilities under $\pi$. And $\textbf{P}_{\pi}^{\top}d_{\pi}=d_{\pi}$. % An $\bm{\Phi}^{\top}\bm{\text{X}}\bm{\Phi}$ matrix of this % form will be positive definite whenever the matrix $\bm{\text{X}}$ is positive definite. % Any matrix $\bm{\text{X}}$ is positive definite if and only if % the symmetric matrix $\bm{\text{S}}=\bm{\text{X}}+\bm{\text{X}}^{\top}$ is positive definite. % Any symmetric real matrix $\bm{\text{S}}$ is positive definite if the absolute values of % its diagonal entries are greater than the sum of the absolute values of the corresponding % off-diagonal entries\cite{sutton2016emphatic}. % All components of the matrix $\textbf{D}_{\pi}(\textbf{I}-\gamma \textbf{P}_{\pi})$ are positive. % The row sums of $\textbf{D}_{\pi}(\textbf{I}-\gamma \textbf{P}_{\pi})$ are positive. And The row sums of % $\textbf{D}_{\pi}(\textbf{I}-\gamma \textbf{P}_{\pi})$ are % \begin{equation*} % \begin{array}{ccl} % \textbf{1}^{\top}\textbf{D}_{\pi}(\textbf{I}-\gamma \textbf{P}_{\pi})&=&\textbf{d}_{\pi}^{\top}(\textbf{I}-\gamma \textbf{P}_{\pi})\\ % &=& \textbf{d}_{\pi}^{\top} - \gamma \textbf{d}_{\pi}^{\top}\textbf{P}_{\pi}\\ % &=& \textbf{d}_{\pi}^{\top} - \gamma \textbf{d}_{\pi}^{\top}\\ % &=& (1-\gamma)\textbf{d}_{\pi}^{\top}, % \end{array} % \end{equation*} % all components of which are positive. Thus, the key matrix and its $\textbf{A}_{\text{on}}$ matrix are positive % definite, and on-policy TD(0) is stable % The off-policy TD(0) update formula is % \begin{equation*} % \label{thetatd_offpolicy} % \begin{array}{ccl} % \bm{\theta}_{k+1}&\leftarrow&\bm{\theta}_k+\alpha_k \rho_k \delta_k\bm{\phi}_k, % \end{array} % \end{equation*} % where $\rho_k =\frac{\pi(A_k | S_k)}{\mu(A_k | S_k)}$, called importance sampling ratio, and The key matrix $\textbf{A}_{\text{off}}$ of off-policy TD(0) is \begin{equation*} \textbf{A}_{\text{off}} = \bm{\Phi}^{\top}\textbf{D}_{\mu}(\textbf{I}-\gamma \textbf{P}_{\pi})\bm{\Phi}, \end{equation*} where $\textbf{D}_{\mu}$ is the $|S| \times |S|$ diagonal matrix with $d_{\mu}$ on its diagonal. $d_{\mu}$ is a vector, each component representing the steady-state distribution under behavior policy $\mu$. % If the key matrix % $\textbf{A}$ in the algorithm is positive definite, then the % algorithm is stable and converges. However, in the off-policy TD(0) % algorithm, it cannot be guaranteed that % $\textbf{A}$ is a positive definite matrix. In the off-policy 2-state, $\textbf{A}_{\text{off}}=-0.2$, which means that off-policy TD(0) cannot stably converge, while , in the on-policy 2-state, $\textbf{A}_{\text{on}}=0.475$, which means that on-policy TD(0) can stably converge. % TDC and ETD are two well-known off-policy algorithms. % The former is an off-policy algorithm derived from the % objective function Mean Squared Projected Bellman error (MSPBE), while the latter employs a technique % to transform the key matrix % $\textbf{A}$ in the original off-policy TD(0) from non-positive % definite to positive definite, thereby ensuring the algorithm's % convergence under off-policy conditions. % The MSPBE with importance sampling is % \begin{equation*} % \begin{array}{ccl} % \text{MSPBE}(\bm{\theta})&=&||\textbf{V}_{\bm{\theta}} - \Pi \textbf{T}^{\pi}\textbf{V}_{\bm{\theta}}||^{2}_{\mu}\\ % &=&||\Pi(\textbf{V}_{\bm{\theta}} - \textbf{T}^{\pi}\textbf{V}_{\bm{\theta}})||^{2}_{\mu}\\ % &=&\mathbb{E}[\rho \delta \bm{\phi}]^{\top} \mathbb{E}[\bm{\phi} \bm{\phi}^{\top}]^{-1}\mathbb{E}[\rho \delta \bm{\phi}], % \end{array} % \end{equation*} % where $\textbf{V}_{\bm{\theta}}$ is viewed as vectors with one element for each state, % the norm $||\bm{v}||^{2}_{\mu}=\sum_{s}^{}\mu(s)\bm{v}^{2}(s)$, $\textbf{T}^{\pi}$, simplified to % $\textbf{T}$ in the following text, is Bellman operator and $\bm{\Pi}=\bm{\Phi}(\bm{\Phi}^{\top}\textbf{D}\bm{\Phi})^{-1}\bm{\Phi}^{\top}\textbf{D}$. % The TDC update formula with importance sampling is % \begin{equation*} % \bm{\theta}_{k+1}\leftarrow\bm{\theta}_{k}+\alpha_{k} \rho_{k}[\delta_{k} \bm{\phi}_k- \gamma\bm{\phi}_{k+1}(\bm{\phi}^{\top}_k \bm{u}_{k})], % \label{thetatdc} % \end{equation*} % \begin{equation*} % \bm{u}_{k+1}\leftarrow \bm{u}_{k}+\zeta_{k}[\rho_k \delta_{k} - \bm{\phi}^{\top}_k \bm{u}_{k}]\bm{\phi}_k. % \label{utdc} % \end{equation*} The key matrix $\textbf{A}_{\text{TDC}}= \textbf{A}^{\top}_{\text{off}}\textbf{C}^{-1}\textbf{A}_{\text{off}}$, where $\textbf{C}=\mathbb{E}[\bm{\bm{\phi}}\bm{\bm{\phi}}^{\top}]$. In the 2-state counterexample, $\textbf{A}_{\text{TDC}}=0.016$, which means that TDC can stably converge. The key matrix $\textbf{A}_{\text{TDC}}$ of on-policy TDC is \begin{equation*} \textbf{A}_{\text{TDC}} = \textbf{A}^{\top}_{\text{on}}\textbf{C}^{-1}\textbf{A}_{\text{on}}. \end{equation*} The key matrix $\textbf{A}_{\text{TDC}}$ of off-policy TDC is \begin{equation*} \textbf{A}_{\text{TDC}} = \textbf{A}^{\top}_{\text{off}}\textbf{C}^{-1}\textbf{A}_{\text{off}}. \end{equation*} $\textbf{A}_{\text{TDC}}=0.016$ in the off-policy 2-state and $\textbf{A}_{\text{TDC}}=0.09025$ in the on-policy 2-state, which means that TDC can stably converge in two settings. To address the issue of the key matrix $\textbf{A}_{\text{off}}$ in off-policy TD(0) being non-positive definite, a scalar variable, $F_t$, is introduced to obtain the off-policy TD(0) algorithm, which ensures convergence under off-policy conditions. The key matrix $\textbf{A}_{\text{ETD}}$ is \begin{equation*} \textbf{A}_{\text{ETD}} = \bm{\Phi}^{\top}\textbf{F}(\textbf{I}-\gamma \textbf{P}_{\pi})\bm{\Phi}, \end{equation*} where $\textbf{F}$ is a diagonal matrix with diagonal elements $f(s)\dot{=}d_{\mu}(s)\lim_{t\rightarrow \infty}\mathbb{E}_{\mu}[F_t|S_t=s]$, which we assume exists. The vector $\textbf{f}\in \mathbb{R}^N$ with components $[\textbf{f}]_s\dot{=}f(s)$ can be written as \begin{equation*} \begin{split} \textbf{f}&=\textbf{d}_{\mu}+\gamma \textbf{P}_{\pi}^{\top}\textbf{d}_{\mu}+(\gamma \textbf{P}_{\pi}^{\top})^2\textbf{d}_{\mu}+\ldots\\ &=(\textbf{I}-\gamma\textbf{P}_{\pi}^{\top})^{-1}\textbf{d}_{\mu}. \end{split} \end{equation*} In the off-policy 2-state, $\textbf{A}_{\text{ETD}}=3.4$, which means that ETD can stably converge. Table \ref{tab:min_eigenvalues} shows Minimum eigenvalues of various algorithms in the 2-state counterexample. In the on-policy 2-state environment, the minimum eigenvalue of the key matrix for TDC is greater than that of TD(0), indicating that TDC converges faster than TD(0) in this environment. In the off-policy 2-state environment, the minimum eigenvalue of the key matrix for ETD is the largest, suggesting that ETD has the fastest convergence rate. Minimum eigenvalue larger, algorithm's convergence faster. To derive an algorithm with a larger minimum eigenvalue for matrix $\textbf{A}$, it is necessary to propose new objective functions. The mentioned objective functions in the Introduction are all forms of error. Is minimizing error the only option for value-based reinforcement learning? Based on this observation, we propose alternative objective functions instead of minimizing errors. % The ETD update formula is % \begin{equation} % \label{fvmetd} % F_k \leftarrow \gamma \rho_{k-1}F_{k-1}+1, % \end{equation} % \begin{equation*} % \label{thetaetd} % \bm{\theta}_{k+1}\leftarrow \bm{\theta}_k+\alpha_k F_k \rho_k\delta_k\bm{\phi}_k, % \end{equation*} % where $F_t$ is a scalar variable and $F_0=1$. % The key matrix $\textbf{A}_{\text{ETD}}= \bm{\Phi}^{\top}\textbf{F}(\textbf{I}-\gamma \textbf{P}_{\pi})\bm{\Phi}$, % where % $\textbf{F}$ is a diagonal matrix with diagonal elements % $f(s)\dot{=}d_{\mu}(s)\lim_{t\rightarrow \infty}\mathbb{E}_{\mu}[F_k|S_k=s]$, % which we assume exists. % The vector $\textbf{f}\in \mathbb{R}^N$ with components % $[\textbf{f}]_s\dot{=}f(s)$ can be written as % \begin{equation*} % \begin{split} % \textbf{f}&=\textbf{d}_{\mu}+\gamma \textbf{P}_{\pi}^{\top}\textbf{d}_{\mu}+(\gamma \textbf{P}_{\pi}^{\top})^2\textbf{d}_{\mu}+\ldots\\ % &=(\textbf{I}-\gamma\textbf{P}_{\pi}^{\top})^{-1}\textbf{d}_{\mu}. % \end{split} % \end{equation*}. % The row sums of % $\textbf{F}(\textbf{I}-\gamma \textbf{P}_{\pi})$ are % \begin{equation*} % \begin{array}{ccl} % \textbf{1}^{\top}\textbf{F}(\textbf{I}-\gamma \textbf{P}_{\pi})&=&\textbf{f}^{\top}(\textbf{I}-\gamma \textbf{P}_{\pi})\\ % &=& \textbf{d}_{\mu}^{\top}(\textbf{I}-\gamma \textbf{P}_{\pi})^{-1}(\textbf{I}-\gamma \textbf{P}_{\pi})\\ % &=& \textbf{d}_{\mu}^{\top}, % \end{array} % \end{equation*} % and in the 2-state counterexample, % $\textbf{A}_{\text{ETD}}=3.4$, which means that ETD can stably converge. % In the 2-state case, the minimum eigenvalue of the matrix % $\textbf{A}$ in ETD is the largest, so it converges the fastest. % Based on this theorem, can we derive an algorithm with a larger minimum eigenvalue for matrix $\textbf{A}$.