\section{Background}
\label{preliminaries}
Reinforcement learning agent interacts with environment, observes state,
 takes sequential decision makings to influence environment, and obtains
 rewards.
 Consider an infinite-horizon discounted 
 Markov Decision Process (MDP), defined by a tuple $\langle S,A,R,P,\gamma
 \rangle$, where $S=\{1,2,\ldots,N\}$ is a finite set of states of the environment;  $A$
 is a finite set of actions of the agent; 
 $R:S\times A \times S \rightarrow \mathbb{R}$ is a bounded deterministic reward
 function; $P:S\times A\times S \rightarrow [0,1]$ is the transition
 probability distribution;  and $\gamma\in (0,1)$
  is the discount factor \cite{Sutton2018book}.
  Due to the requirements of  online learning, value iteration based on sampling
  is considered in this paper. 
  In each sampling, an experience (or transition) $\langle s, a, s', r\rangle$ is
  obtained.

  A policy is a mapping $\pi:S\times A \rightarrow [0,1]$. The goal of the
  agent is to find an optimal policy $\pi^*$ to maximize the expectation of a
  discounted cumulative rewards in a long period.
  State value function $V^{\pi}(s)$  for a stationary policy $\pi$ is 
  defined as:
  \begin{equation*}
  V^{\pi}(s)=\mathbb{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{k}|s_0=s].
  \label{valuefunction}
  \end{equation*}
  Linear value function for state $s\in S$ is defined as:
   \begin{equation}
   V_{{\theta}}(s):= {\bm{\theta}}^{\top}{\bm{\phi}}(s) = \sum_{i=1}^{m}
   \theta_i \phi_i(s),
   \label{linearvaluefunction}
   \end{equation}
  where ${\bm{\theta}}:=(\theta_1,\theta_2,\ldots,\theta_m)^{\top}\in
  \mathbb{R}^m$ is a parameter vector, 
  ${\bm{\phi}}:=(\phi_1,\phi_2,\ldots,\phi_m)^{\top}\in \mathbb{R}^m$ is a feature
  function  defined on state space $S$, and $m$ is the feature size. 

  Tabular temporal difference (TD) learning \cite{Sutton2018book} has been successfully applied to small-scale problems.
  To deal with the well-known curse of dimensionality of large scale MDPs, value
  function is usually approximated by a linear model, kernel methods, decision
   trees, or neural networks, etc. This paper focuses on the linear model, where
   features are usually hand coded by domain experts.

% TD learning can also be used to find optimal strategies. The problem of finding an optimal policy is 
% often called the control problem. Two popular TD methods are Sarsa and Q-leaning. The former is an on-policy 
% TD control, while the latter is an off-policy control. 

% It is well known that TDC algorithm \cite{sutton2009fast} guarantees 
% convergence under off-policy conditions while the off-policy TD algorithm may diverge. The 
% objective function of TDC is MSPBE. 
% TDC is essentially an adjustment or correction of the TD update so that it
% follows the gradient of the MSPBE objective function. In the context of the TDC algorithm, the control algorithm 
% is known as Greedy-GQ($\lambda$) \cite{sutton2009fast}. When $\lambda$ is set to 0, it is denoted 
% as GQ(0).