\section{Background}
\label{preliminaries}
\subsection{Markov Decision Process}
Reinforcement learning agent interacts with environment, observes state,
 takes sequential decision makings to influence environment, and obtains
 rewards.
 Consider an infinite-horizon discounted 
 Markov Decision Process (MDP), defined by a tuple $\langle S,A,R,P,\gamma
 \rangle$, where $S=\{1,2,\ldots,N\}$ is a finite set of states of the environment;  $A$
 is a finite set of actions of the agent; 
 $R:S\times A \times S \rightarrow \mathbb{R}$ is a bounded deterministic reward
 function; $P:S\times A\times S \rightarrow [0,1]$ is the transition
 probability distribution;  and $\gamma\in (0,1)$
  is the discount factor \cite{Sutton2018book}.
  Due to the requirements of  online learning, value iteration based on sampling
  is considered in this paper. 
  In each sampling, an experience (or transition) $\langle s, a, s', r\rangle$ is
  obtained.

  A policy is a mapping $\pi:S\times A \rightarrow [0,1]$. The goal of the
  agent is to find an optimal policy $\pi^*$ to maximize the expectation of a
  discounted cumulative rewards in a long period.
  State value function $V^{\pi}(s)$  for a stationary policy $\pi$ is 
  defined as:
  \begin{equation*}
  V^{\pi}(s)=\mathbb{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{k}|s_0=s].
  \label{valuefunction}
  \end{equation*}
  Linear value function for state $s\in S$ is defined as:
   \begin{equation}
   V_{{\theta}}(s):= {\bm{\theta}}^{\top}{\bm{\phi}}(s) = \sum_{i=1}^{m}
   \theta_i \phi_i(s),
   \label{linearvaluefunction}
   \end{equation}
  where ${\bm{\theta}}:=(\theta_1,\theta_2,\ldots,\theta_m)^{\top}\in
  \mathbb{R}^m$ is a parameter vector, 
  ${\bm{\phi}}:=(\phi_1,\phi_2,\ldots,\phi_m)^{\top}\in \mathbb{R}^m$ is a feature
  function  defined on state space $S$, and $m$ is the feature size. 

  Tabular temporal difference (TD) learning \cite{Sutton2018book} has been successfully applied to small-scale problems.
  To deal with the well-known curse of dimensionality of large scale MDPs, value
  function is usually approximated by a linear model (the focus of this paper), kernel methods, decision
   trees, or neural networks, etc. 
   % This paper focuses on the linear model.
% TD learning can also be used to find optimal strategies. The problem of finding an optimal policy is 
% often called the control problem. Two popular TD methods are Sarsa and Q-leaning. The former is an on-policy 
% TD control, while the latter is an off-policy control. 

% It is well known that TDC algorithm \cite{sutton2009fast} guarantees 
% convergence under off-policy conditions while the off-policy TD algorithm may diverge. The 
% objective function of TDC is MSPBE. 
% TDC is essentially an adjustment or correction of the TD update so that it
% follows the gradient of the MSPBE objective function. In the context of the TDC algorithm, the control algorithm 
% is known as Greedy-GQ($\lambda$) \cite{sutton2009fast}. When $\lambda$ is set to 0, it is denoted 
% as GQ(0).


\subsection{On-policy and Off-policy}
On-policy and off-policy algorithms are currently hot topics in research. 
Off-policy algorithms, in particular, present greater challenges due to the 
difficulty in ensuring their convergence, making them more complex to study. 
The main difference between the two lies in the fact that in on-policy algorithms, 
the behavior policy $\mu$ and the target policy $\pi$ are the same during the learning process. 
The algorithm directly generates data from the current policy and optimizes it. 
In off-policy algorithms, however, the behavior policy and the target policy are different. 
The algorithm uses data generated from the behavior policy to optimize the 
target policy, which leads to higher sample efficiency and complex stability issues.

Taking the TD(0) algorithm as an example can help understand the different 
performances of on-policy and off-policy:

In the on-policy TD(0) algorithm, the behavior policy and the target policy 
are the same. The algorithm uses the data generated by the current policy to 
update its value estimates. Since the behavior policy and the target policy 
are consistent, the convergence of TD(0) is more assured. In each step of the 
update, the algorithm is based on the actual behavior of the current policy, 
which gradually leads the value function estimate to converge to the true 
value of the target policy.

The on-policy TD(0) update formula is
\begin{equation*}
   \label{thetatd_onpolicy}
   \begin{array}{ccl}
      \bm{\theta}_{k+1}&\leftarrow&\bm{\theta}_k+\alpha_k \delta_k\bm{\phi}_k,
   \end{array}
\end{equation*}
where $\delta_k = r_{k+1}+\gamma \bm{\theta}_k^{\top}\bm{\phi}_{k+1}-\bm{\theta}_k^{\top}\bm{\phi}_k$ and 
the key matrix $\textbf{A}_{\text{on}}$ of on-policy TD(0) is
\begin{equation*}
   \textbf{A}_{\text{on}} = \bm{\Phi}^{\top}\textbf{D}_{\pi}(\textbf{I}-\gamma \textbf{P}_{\pi})\bm{\Phi},
\end{equation*}
where $\bm{\Phi}$ is the $N \times n$ matrix with the $\phi(s)$ as its rows, and $\textbf{D}_{\pi}$ is the $N \times N$ diagonal
matrix with $\textbf{d}_{\pi}$ on its diagonal. $\textbf{d}_{\pi}$ is a vector, each component representing the steady-state
distribution under $\pi$. $\textbf{P}_{\pi}$ denote the $N \times N$ matrix of transition probabilities under $\pi$. And $\textbf{P}_{\pi}^{\top}\textbf{d}_{\pi}=\textbf{d}_{\pi}$.

An $\bm{\Phi}^{\top}\bm{\text{X}}\bm{\Phi}$ matrix of this
        form will be positive definite whenever the matrix $\bm{\text{X}}$ is positive definite.
        Any matrix $\bm{\text{X}}$ is positive definite if and only if
        the symmetric matrix $\bm{\text{S}}=\bm{\text{X}}+\bm{\text{X}}^{\top}$ is positive definite. 
        Any symmetric real matrix $\bm{\text{S}}$ is positive definite if the absolute values of
        its diagonal entries are greater than the sum of the absolute values of the corresponding
off-diagonal entries\cite{sutton2016emphatic}.

All components of the matrix $\textbf{D}_{\pi}(\textbf{I}-\gamma \textbf{P}_{\pi})$ are positive.
The row sums of $\textbf{D}_{\pi}(\textbf{I}-\gamma \textbf{P}_{\pi})$ are positive. And The row sums of 
$\textbf{D}_{\pi}(\textbf{I}-\gamma \textbf{P}_{\pi})$ are
\begin{equation*}
   \begin{array}{ccl}
      \textbf{1}^{\top}\textbf{D}_{\pi}(\textbf{I}-\gamma \textbf{P}_{\pi})&=&\textbf{d}_{\pi}^{\top}(\textbf{I}-\gamma \textbf{P}_{\pi})\\
      &=& \textbf{d}_{\pi}^{\top} - \gamma \textbf{d}_{\pi}^{\top}\textbf{P}_{\pi}\\
      &=& \textbf{d}_{\pi}^{\top}  - \gamma \textbf{d}_{\pi}^{\top}\\
      &=& (1-\gamma)\textbf{d}_{\pi}^{\top},
   \end{array}
\end{equation*}
all components of which are positive. Thus, the key matrix and its $\textbf{A}_{\text{on}}$ matrix are positive
definite, and on-policy TD(0) is stable

The off-policy TD(0) update formula is
\begin{equation*}
   \label{thetatd_offpolicy}
   \begin{array}{ccl}
      \bm{\theta}_{k+1}&\leftarrow&\bm{\theta}_k+\alpha_k \rho_k \delta_k\bm{\phi}_k,
   \end{array}
\end{equation*}
where $\rho_k =\frac{\pi(A_k | S_k)}{\mu(A_k | S_k)}$, called importance sampling ratio, 
 and the key matrix $\textbf{A}_{\text{off}}$ of off-policy TD(0) is
\begin{equation*}
   \textbf{A}_{\text{off}} = \bm{\Phi}^{\top}\textbf{D}_{\mu}(\textbf{I}-\gamma \textbf{P}_{\pi})\bm{\Phi}.
\end{equation*}
where $\textbf{D}_{\mu}$ is the $N \times N$ diagonal
matrix with $\textbf{d}_{\mu}$ on its diagonal. $\textbf{d}_{\mu}$ is a vector, each component representing the steady-state
distribution under $\mu$

If the key matrix 
$\textbf{A}$ in the algorithm is positive definite, then the 
algorithm is stable and converges. However, in the off-policy TD(0)
 algorithm, it cannot be guaranteed that 
$\textbf{A}$ is a positive definite matrix. In the 2-state counterexample, 
$\textbf{A}_{\text{off}}=-0.2$, which means that off-policy TD(0) cannot stably converge.

TDC and ETD are two well-known off-policy algorithms. 
The former is an off-policy algorithm derived from the 
objective function Mean Squared Projected Bellman error (MSPBE), while the latter employs a technique 
to transform the key matrix 
$\textbf{A}$ in the original off-policy TD(0) from non-positive 
definite to positive definite, thereby ensuring the algorithm's 
convergence under off-policy conditions.

The MSPBE with importance sampling is
\begin{equation*}
   \begin{array}{ccl}
   \text{MSPBE}(\bm{\theta})&=&||\textbf{V}_{\bm{\theta}} - \Pi \textbf{T}^{\pi}\textbf{V}_{\bm{\theta}}||^{2}_{\mu}\\
   &=&||\Pi(\textbf{V}_{\bm{\theta}} - \textbf{T}^{\pi}\textbf{V}_{\bm{\theta}})||^{2}_{\mu}\\
   &=&\mathbb{E}[\rho \delta \bm{\phi}]^{\top} \mathbb{E}[\bm{\phi} \bm{\phi}^{\top}]^{-1}\mathbb{E}[\rho \delta \bm{\phi}],
   \end{array}
\end{equation*}
where $\textbf{V}_{\bm{\theta}}$ is viewed as vectors with one element for each state,
 the norm $||\bm{v}||^{2}_{\mu}=\sum_{s}^{}\mu(s)\bm{v}^{2}(s)$, $\textbf{T}^{\pi}$, simplified to 
 $\textbf{T}$ in the following text, is Bellman operator and $\bm{\Pi}=\bm{\Phi}(\bm{\Phi}^{\top}\textbf{D}\bm{\Phi})^{-1}\bm{\Phi}^{\top}\textbf{D}$.
The TDC update formula with importance sampling is
\begin{equation*}
   \bm{\theta}_{k+1}\leftarrow\bm{\theta}_{k}+\alpha_{k} \rho_{k}[\delta_{k} \bm{\phi}_k- \gamma\bm{\phi}_{k+1}(\bm{\phi}^{\top}_k \bm{u}_{k})],
\label{thetatdc}
\end{equation*}
\begin{equation*}
   \bm{u}_{k+1}\leftarrow \bm{u}_{k}+\zeta_{k}[\rho_k \delta_{k} - \bm{\phi}^{\top}_k \bm{u}_{k}]\bm{\phi}_k.
\label{utdc}
\end{equation*}
The key matrix $\textbf{A}_{\text{TDC}}= \textbf{A}^{\top}_{\text{off}}\textbf{C}^{-1}\textbf{A}_{\text{off}}$, 
where $\textbf{C}=\mathbb{E}[\bm{\bm{\phi}}\bm{\bm{\phi}}^{\top}]$. 
In the 2-state counterexample, 
$\textbf{A}_{\text{TDC}}=0.016$, which means that TDC can stably converge.

The ETD update formula is
\begin{equation}
   \label{fvmetd}
   F_k \leftarrow \gamma \rho_{k-1}F_{k-1}+1,
\end{equation}
\begin{equation*}
   \label{thetaetd}
   \bm{\theta}_{k+1}\leftarrow \bm{\theta}_k+\alpha_k F_k \rho_k\delta_k\bm{\phi}_k,
\end{equation*}
where $F_t$ is a scalar variable and $F_0=1$.
The key matrix $\textbf{A}_{\text{ETD}}= \bm{\Phi}^{\top}\textbf{F}(\textbf{I}-\gamma \textbf{P}_{\pi})\bm{\Phi}$, 
where 
$\textbf{F}$ is a diagonal matrix with diagonal elements
$f(s)\dot{=}d_{\mu}(s)\lim_{t\rightarrow \infty}\mathbb{E}_{\mu}[F_k|S_k=s]$,
which we assume exists. 
The vector $\textbf{f}\in \mathbb{R}^N$ with components 
$[\textbf{f}]_s\dot{=}f(s)$ can be written as
\begin{equation*}
\begin{split}
\textbf{f}&=\textbf{d}_{\mu}+\gamma \textbf{P}_{\pi}^{\top}\textbf{d}_{\mu}+(\gamma \textbf{P}_{\pi}^{\top})^2\textbf{d}_{\mu}+\ldots\\
&=(\textbf{I}-\gamma\textbf{P}_{\pi}^{\top})^{-1}\textbf{d}_{\mu}.
\end{split}
\end{equation*}. 
The row sums of 
$\textbf{F}(\textbf{I}-\gamma \textbf{P}_{\pi})$ are
\begin{equation*}
   \begin{array}{ccl}
      \textbf{1}^{\top}\textbf{F}(\textbf{I}-\gamma \textbf{P}_{\pi})&=&\textbf{f}^{\top}(\textbf{I}-\gamma \textbf{P}_{\pi})\\
      &=& \textbf{d}_{\mu}^{\top}(\textbf{I}-\gamma \textbf{P}_{\pi})^{-1}(\textbf{I}-\gamma \textbf{P}_{\pi})\\
      &=& \textbf{d}_{\mu}^{\top},
   \end{array}
\end{equation*}
and in the 2-state counterexample, 
$\textbf{A}_{\text{ETD}}=3.4$, which means that ETD can stably converge.

The convergence rate of the algorithm is related to the matrix 
$\textbf{A}$. The larger the minimum eigenvalue of 
$\textbf{A}$, the faster the convergence rate. In the 2-state case, the minimum eigenvalue of the matrix 
$\textbf{A}$ in ETD is the largest, so it converges the fastest. 
Based on this theorem, can we derive an algorithm with a larger minimum eigenvalue for matrix $\textbf{A}$.