\section{Background} \label{preliminaries} Reinforcement learning agent interacts with environment, observes state, takes sequential decision makings to influence environment, and obtains rewards. Consider an infinite-horizon discounted Markov Decision Process (MDP), defined by a tuple $\langle S,A,R,P,\gamma \rangle$, where $S=\{1,2,\ldots,N\}$ is a finite set of states of the environment; $A$ is a finite set of actions of the agent; $R:S\times A \times S \rightarrow \mathbb{R}$ is a bounded deterministic reward function; $P:S\times A\times S \rightarrow [0,1]$ is the transition probability distribution; and $\gamma\in (0,1)$ is the discount factor \cite{Sutton2018book}. Due to the requirements of online learning, value iteration based on sampling is considered in this paper. In each sampling, an experience (or transition) $\langle s, a, s', r\rangle$ is obtained. A policy is a mapping $\pi:S\times A \rightarrow [0,1]$. The goal of the agent is to find an optimal policy $\pi^*$ to maximize the expectation of a discounted cumulative rewards in a long period. State value function $V^{\pi}(s)$ for a stationary policy $\pi$ is defined as: \begin{equation*} V^{\pi}(s)=\mathbb{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{k}|s_0=s]. \label{valuefunction} \end{equation*} Linear value function for state $s\in S$ is defined as: \begin{equation} V_{{\theta}}(s):= {\theta}^{\top}{\phi}(s) = \sum_{i=1}^{m} \theta_i \phi_i(s), \label{linearvaluefunction} \end{equation} where ${\theta}:=(\theta_1,\theta_2,\ldots,\theta_m)^{\top}\in \mathbb{R}^m$ is a parameter vector, ${\phi}:=(\phi_1,\phi_2,\ldots,\phi_m)^{\top}\in \mathbb{R}^m$ is a feature function defined on state space $S$, and $m$ is the feature size. Tabular temporal difference (TD) learning \cite{Sutton2018book} has been successfully applied to small-scale problems. To deal with the well-known curse of dimensionality of large scale MDPs, value function is usually approximated by a linear model, kernel methods, decision trees, or neural networks, etc. This paper focuses on the linear model, where features are usually hand coded by domain experts. TD learning can also be used to find optimal strategies. The problem of finding an optimal policy is often called the control problem. Two popular TD methods are Sarsa and Q-leaning. The former is an on-policy TD control, while the latter is an off-policy control. It is well known that TDC algorithm \cite{sutton2009fast} guarantees convergence under off-policy conditions while the off-policy TD algorithm may diverge. The objective function of TDC is MSPBE. TDC is essentially an adjustment or correction of the TD update so that it follows the gradient of the MSPBE objective function. In the context of the TDC algorithm, the control algorithm is known as Greedy-GQ($\lambda$) \cite{sutton2009fast}. When $\lambda$ is set to 0, it is denoted as GQ(0).