\section{Background} \label{preliminaries} Reinforcement learning agent interacts with environment, observes state, takes sequential decision makings to influence environment, and obtains rewards. Consider an infinite-horizon discounted Markov Decision Process (MDP), defined by a tuple $\langle S,A,R,P,\gamma \rangle$, where $S=\{1,2,\ldots,N\}$ is a finite set of states of the environment; $A$ is a finite set of actions of the agent; $R:S\times A \times S \rightarrow \mathbb{R}$ is a bounded deterministic reward function; $P:S\times A\times S \rightarrow [0,1]$ is the transition probability distribution; and $\gamma\in (0,1)$ is the discount factor \cite{Sutton2018book}. Due to the requirements of online learning, value iteration based on sampling is considered in this paper. In each sampling, an experience (or transition) $\langle s, a, s', r\rangle$ is obtained. A policy is a mapping $\pi:S\times A \rightarrow [0,1]$. The goal of the agent is to find an optimal policy $\pi^*$ to maximize the expectation of a discounted cumulative rewards in a long period. State value function $V^{\pi}(s)$ for a stationary policy $\pi$ is defined as: \begin{equation*} V^{\pi}(s)=\mathbb{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{k}|s_0=s]. \label{valuefunction} \end{equation*} Linear value function for state $s\in S$ is defined as: \begin{equation} V_{{\theta}}(s):= {\bm{\theta}}^{\top}{\bm{\phi}}(s) = \sum_{i=1}^{m} \theta_i \phi_i(s), \label{linearvaluefunction} \end{equation} where ${\bm{\theta}}:=(\theta_1,\theta_2,\ldots,\theta_m)^{\top}\in \mathbb{R}^m$ is a parameter vector, ${\bm{\phi}}:=(\phi_1,\phi_2,\ldots,\phi_m)^{\top}\in \mathbb{R}^m$ is a feature function defined on state space $S$, and $m$ is the feature size. Tabular temporal difference (TD) learning \cite{Sutton2018book} has been successfully applied to small-scale problems. To deal with the well-known curse of dimensionality of large scale MDPs, value function is usually approximated by a linear model, kernel methods, decision trees, or neural networks, etc. This paper focuses on the linear model, where features are usually hand coded by domain experts. % TD learning can also be used to find optimal strategies. The problem of finding an optimal policy is % often called the control problem. Two popular TD methods are Sarsa and Q-leaning. The former is an on-policy % TD control, while the latter is an off-policy control. % It is well known that TDC algorithm \cite{sutton2009fast} guarantees % convergence under off-policy conditions while the off-policy TD algorithm may diverge. The % objective function of TDC is MSPBE. % TDC is essentially an adjustment or correction of the TD update so that it % follows the gradient of the MSPBE objective function. In the context of the TDC algorithm, the control algorithm % is known as Greedy-GQ($\lambda$) \cite{sutton2009fast}. When $\lambda$ is set to 0, it is denoted % as GQ(0).