\section{Background}
\label{preliminaries}
Reinforcement learning agent interacts with environment, observes state,
 takes sequential decision makings to influence environment, and obtains
 rewards.
 Consider an infinite-horizon discounted 
 Markov Decision Process (MDP), defined by a tuple $\langle S,A,R,P,\gamma
 \rangle$, where $S=\{1,2,\ldots,N\}$ is a finite set of states of the environment;  $A$
 is a finite set of actions of the agent; 
 $R:S\times A \times S \rightarrow \mathbb{R}$ is a bounded deterministic reward
 function; $P:S\times A\times S \rightarrow [0,1]$ is the transition
 probability distribution;  and $\gamma\in (0,1)$
  is the discount factor \cite{Sutton2018book}.
  Due to the requirements of  online learning, value iteration based on sampling
  is considered in this paper. 
  In each sampling, an experience (or transition) $\langle s, a, s', r\rangle$ is
  obtained.

  A policy is a mapping $\pi:S\times A \rightarrow [0,1]$. The goal of the
  agent is to find an optimal policy $\pi^*$ to maximize the expectation of a
  discounted cumulative rewards in a long period.
  State value function $V^{\pi}(s)$  for a stationary policy $\pi$ is 
  defined as:
  \begin{equation*}
  V^{\pi}(s)=\mathbb{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{k}|s_0=s].
  \label{valuefunction}
  \end{equation*}
  Linear value function for state $s\in S$ is defined as:
   \begin{equation}
   V_{{\theta}}(s):= {\theta}^{\top}{\phi}(s) = \sum_{i=1}^{m}
   \theta_i \phi_i(s),
   \label{linearvaluefunction}
   \end{equation}
  where ${\theta}:=(\theta_1,\theta_2,\ldots,\theta_m)^{\top}\in
  \mathbb{R}^m$ is a parameter vector, 
  ${\phi}:=(\phi_1,\phi_2,\ldots,\phi_m)^{\top}\in \mathbb{R}^m$ is a feature
  function  defined on state space $S$, and $m$ is the feature size. 

  Tabular temporal difference (TD) learning \cite{Sutton2018book} has been successfully applied to small-scale problems.
  To deal with the well-known curse of dimensionality of large scale MDPs, value
  function is usually approximated by a linear model, kernel methods, decision
   trees, or neural networks, etc. This paper focuses on the linear model, where
   features are usually hand coded by domain experts.

TD learning can also be used to find optimal strategies. The problem of finding an optimal policy is 
often called the control problem. Two popular TD methods are Sarsa and Q-leaning. The former is an on-policy 
TD control, while the latter is an off-policy control. 

It is well known that TDC algorithm \cite{sutton2009fast} guarantees 
convergence under off-policy conditions while the off-policy TD algorithm may diverge. The 
objective function of TDC is MSPBE. 
TDC is essentially an adjustment or correction of the TD update so that it
follows the gradient of the MSPBE objective function. In the context of the TDC algorithm, the control algorithm 
is known as Greedy-GQ($\lambda$) \cite{sutton2009fast}. When $\lambda$ is set to 0, it is denoted 
as GQ(0).