Commit 9c95a8b3 by GongYu

新版本

parent ff4dbddd
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) (preloaded format=pdflatex 2023.3.31) 30 JUN 2024 03:27
This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) (preloaded format=pdflatex 2023.3.31) 3 AUG 2024 19:11
entering extended mode
restricted \write18 enabled.
file:line:error style messages enabled.
......@@ -641,7 +641,7 @@ Here is how much of TeX's memory you used:
1141 hyphenation exceptions out of 8191
84i,17n,89p,423b,1058s stack positions out of 10000i,1000n,20000p,200000b,200000s
<d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmbx10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmex10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmmi10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmmi5.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmmi6.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmmi7.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmmi9.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmmib10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cmextra/cmmib7.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cmextra/cmmib9.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmr10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmr5.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmr6.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmr7.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmr9.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmsy10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmsy5.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmsy6.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmsy7.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/symbols/msbm10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/urw/times/utmb8a.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/urw/times/utmbi8a.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/urw/times/utmr8a.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/urw/times/utmri8a.pfb>
Output written on anonymous-submission-latex-2025.pdf (7 pages, 295290 bytes).
Output written on anonymous-submission-latex-2025.pdf (7 pages, 295485 bytes).
PDF statistics:
155 PDF objects out of 1000 (max. 8388607)
96 compressed objects within 1 object stream
......
......@@ -302,7 +302,7 @@ and the VMTDC algorithm, and also presents a corollary on the convergence rate o
Assume that $(\bm{\bm{\phi}}_k,r_k,\bm{\bm{\phi}}_k')$ is an i.i.d. sequence with
uniformly bounded second moments, where $\bm{\bm{\phi}}_k$ and $\bm{\bm{\phi}}'_{k}$ are sampled from the same Markov chain.
Let $\textbf{A}_{\textbf{VMETD}} ={\bm{\Phi}}^{\top} (\textbf{F} (\textbf{I} - \gamma \textbf{P}_{\pi})-\textbf{d}_{\mu} \textbf{d}_{\mu}^{\top} ){\bm{\Phi}}$,
$\bm{b}_{\textbf{VMETD}}=\bm{\Phi}^{\top}(\textbf{F}-\textbf{d}_{\mu} \textbf{d}_{\mu}^{\top})\textbf{r}_{\pi}$.
$\bm{b}_{\textbf{VMETD}}=\bm{\Phi}^{\top}(\textbf{F}-\textbf{d}_{\mu} \textbf{f}^{\top})\textbf{r}_{\pi}$.
Assume that matrix $A$ is non-singular.
Then the parameter vector $\bm{\theta}_k$ converges with probability one
to $\textbf{A}_{\textbf{VMETD}}^{-1}\bm{b}_{\textbf{VMETD}}$.
......
......@@ -35,21 +35,20 @@
\citation{sutton2016emphatic}
\newlabel{odetheta}{{A-17}{4}}
\newlabel{rowsum}{{A-20}{4}}
\citation{baird1995residual,sutton2009fast}
\citation{baird1995residual,sutton2009fast,maei2011gradient}
\providecommand*\caption@xref[2]{\@setref\relax\@undefined{#1}}
\newlabel{bairdexample}{{1}{5}}
\newlabel{columnsum}{{A-21}{5}}
\newlabel{odethetafinal}{{A-22}{5}}
\newlabel{mathematicalanalysis}{{B}{5}}
\providecommand*\caption@xref[2]{\@setref\relax\@undefined{#1}}
\newlabel{keymatrices}{{1}{5}}
\newlabel{minimumeigenvalues}{{2}{5}}
\newlabel{experimentaldetails}{{C}{5}}
\newlabel{bairdcounterexample}{{\caption@xref {bairdcounterexample}{ on input line 731}}{6}}
\newlabel{randomwalk}{{\caption@xref {randomwalk}{ on input line 754}}{6}}
\newlabel{boyanchain}{{\caption@xref {boyanchain}{ on input line 777}}{6}}
\newlabel{experimentaldetails}{{B}{5}}
\bibdata{aaai24}
\bibcite{borkar1997stochastic}{{1}{1997}{{Borkar}}{{}}}
\bibcite{borkar2000ode}{{2}{2000}{{Borkar and Meyn}}{{}}}
\bibcite{hirsch1989convergent}{{3}{1989}{{Hirsch}}{{}}}
\bibcite{sutton2009fast}{{4}{2009}{{Sutton et~al.}}{{Sutton, Maei, Precup, Bhatnagar, Silver, Szepesv{\'a}ri, and Wiewiora}}}
\bibcite{sutton2016emphatic}{{5}{2016}{{Sutton, Mahmood, and White}}{{}}}
\newlabel{lrofways}{{6}{7}}
\gdef \@abspage@last{7}
\bibcite{baird1995residual}{{1}{1995}{{Baird et~al.}}{{}}}
\bibcite{borkar1997stochastic}{{2}{1997}{{Borkar}}{{}}}
\bibcite{borkar2000ode}{{3}{2000}{{Borkar and Meyn}}{{}}}
\bibcite{hirsch1989convergent}{{4}{1989}{{Hirsch}}{{}}}
\bibcite{maei2011gradient}{{5}{2011}{{Maei}}{{}}}
\bibcite{sutton2009fast}{{6}{2009}{{Sutton et~al.}}{{Sutton, Maei, Precup, Bhatnagar, Silver, Szepesv{\'a}ri, and Wiewiora}}}
\bibcite{sutton2016emphatic}{{7}{2016}{{Sutton, Mahmood, and White}}{{}}}
\newlabel{lrofways}{{1}{6}}
\gdef \@abspage@last{6}
\begin{thebibliography}{5}
\begin{thebibliography}{7}
\providecommand{\natexlab}[1]{#1}
\bibitem[{Baird et~al.(1995)}]{baird1995residual}
Baird, L.; et~al. 1995.
\newblock Residual algorithms: Reinforcement learning with function approximation.
\newblock In \emph{Proc. 12th Int. Conf. Mach. Learn.}, 30--37.
\bibitem[{Borkar(1997)}]{borkar1997stochastic}
Borkar, V.~S. 1997.
\newblock Stochastic approximation with two time scales.
......@@ -16,6 +21,11 @@ Hirsch, M.~W. 1989.
\newblock Convergent activation dynamics in continuous time networks.
\newblock \emph{Neural Netw.}, 2(5): 331--349.
\bibitem[{Maei(2011)}]{maei2011gradient}
Maei, H.~R. 2011.
\newblock \emph{Gradient temporal-difference learning algorithms}.
\newblock Ph.D. thesis, University of Alberta.
\bibitem[{Sutton et~al.(2009)Sutton, Maei, Precup, Bhatnagar, Silver, Szepesv{\'a}ri, and Wiewiora}]{sutton2009fast}
Sutton, R.; Maei, H.; Precup, D.; Bhatnagar, S.; Silver, D.; Szepesv{\'a}ri, C.; and Wiewiora, E. 2009.
\newblock Fast gradient-descent methods for temporal-difference learning with linear function approximation.
......
......@@ -3,44 +3,44 @@ Capacity: max_strings=200000, hash_size=200000, hash_prime=170003
The top-level auxiliary file: anonymous-submission-latex-2024.aux
The style file: aaai24.bst
Database file #1: aaai24.bib
You've used 5 entries,
You've used 7 entries,
2840 wiz_defined-function locations,
619 strings with 5446 characters,
and the built_in function-call counts, 3370 in all, are:
= -- 277
> -- 153
630 strings with 5707 characters,
and the built_in function-call counts, 4424 in all, are:
= -- 372
> -- 189
< -- 0
+ -- 60
- -- 52
* -- 242
:= -- 547
add.period$ -- 20
call.type$ -- 5
change.case$ -- 36
chr.to.int$ -- 6
cite$ -- 5
duplicate$ -- 223
empty$ -- 240
format.name$ -- 60
if$ -- 649
+ -- 74
- -- 64
* -- 295
:= -- 731
add.period$ -- 28
call.type$ -- 7
change.case$ -- 49
chr.to.int$ -- 8
cite$ -- 7
duplicate$ -- 302
empty$ -- 320
format.name$ -- 75
if$ -- 861
int.to.chr$ -- 1
int.to.str$ -- 1
missing$ -- 49
newline$ -- 29
num.names$ -- 20
pop$ -- 92
missing$ -- 63
newline$ -- 39
num.names$ -- 28
pop$ -- 125
preamble$ -- 1
purify$ -- 34
purify$ -- 45
quote$ -- 0
skip$ -- 96
skip$ -- 134
stack$ -- 0
substring$ -- 200
swap$ -- 128
substring$ -- 246
swap$ -- 160
text.length$ -- 0
text.prefix$ -- 0
top$ -- 0
type$ -- 45
type$ -- 63
warning$ -- 0
while$ -- 31
while$ -- 42
width$ -- 0
write$ -- 68
write$ -- 94
This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) (preloaded format=pdflatex 2023.3.31) 30 JUN 2024 03:07
This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) (preloaded format=pdflatex 2023.3.31) 12 AUG 2024 17:11
entering extended mode
restricted \write18 enabled.
file:line:error style messages enabled.
......@@ -627,67 +627,41 @@ Package caption Info: listings package is loaded.
Package caption Info: End \AtBeginDocument code.
Package newfloat Info: `float' package detected.
\c@lstlisting=\count342
LaTeX Font Info: Trying to load font information for U+msa on input line 196.
LaTeX Font Info: Trying to load font information for U+msa on input line 234.
(d:/software/texlive/2023/texmf-dist/tex/latex/amsfonts/umsa.fd
File: umsa.fd 2013/01/14 v3.01 AMS symbols A
)
LaTeX Font Info: Trying to load font information for U+msb on input line 196.
LaTeX Font Info: Trying to load font information for U+msb on input line 234.
(d:/software/texlive/2023/texmf-dist/tex/latex/amsfonts/umsb.fd
File: umsb.fd 2013/01/14 v3.01 AMS symbols B
)
LaTeX Font Info: Trying to load font information for U+esvect on input line 196.
LaTeX Font Info: Trying to load font information for U+esvect on input line 234.
(d:/software/texlive/2023/texmf-dist/tex/latex/esvect/uesvect.fd
File: uesvect.fd
) [1
{d:/software/texlive/2023/texmf-var/fonts/map/pdftex/updmap/pdftex.map}{d:/software/texlive/2023/texmf-dist/fonts/enc/dvips/base/8r.enc}] [2] [3] [4] [5]
LaTeX Warning: Reference `Evaluation_full' on page 6 undefined on input line 843.
[6]
LaTeX Warning: Reference `Complete_full' on page 7 undefined on input line 875.
Underfull \hbox (badness 10000) in paragraph at lines 861--878
[]\OT1/ptm/m/n/10 7-state ver-sion of Baird's off-policy coun-terex-am-ple: for TD al-go-rithm, $\OML/cmm/m/it/10 $ \OT1/ptm/m/n/10 is set to 0.1. For the
[]
Underfull \hbox (badness 10000) in paragraph at lines 861--878
\OT1/ptm/m/n/10 TDC al-go-rithm, the range of $\OML/cmm/m/it/10 $ \OT1/ptm/m/n/10 is $\OMS/cmsy/m/n/10 f\OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 05\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 1\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 2\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 3\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 4\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 5\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 6\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 7\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 8\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 9\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 1\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 0\OMS/cmsy/m/n/10 g$\OT1/ptm/m/n/10 , and the range
[]
Underfull \hbox (badness 10000) in paragraph at lines 861--878
\OT1/ptm/m/n/10 of $\OML/cmm/m/it/10 ^^P$ \OT1/ptm/m/n/10 is $\OMS/cmsy/m/n/10 f\OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 05\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 1\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 2\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 3\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 4\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 5\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 6\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 7\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 8\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 9\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 1\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 0\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 1\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 1\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 1\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 2\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 1\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 3\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 1\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 4\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 1\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 5\OMS/cmsy/m/n/10 g$\OT1/ptm/m/n/10 . For the VMTD al-go-
[]
Underfull \hbox (badness 10000) in paragraph at lines 861--878
\OT1/ptm/m/n/10 rithm, the range of $\OML/cmm/m/it/10 $ \OT1/ptm/m/n/10 is $\OMS/cmsy/m/n/10 f\OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 05\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 1\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 2\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 3\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 4\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 5\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 6\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 7\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 8\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 0\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 9\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 1\OML/cmm/m/it/10 :\OT1/cmr/m/n/10 0\OMS/cmsy/m/n/10 g$\OT1/ptm/m/n/10 , and the range of $\OML/cmm/m/it/10 ^^L$ \OT1/ptm/m/n/10 is
[]
(./anonymous-submission-latex-2024.bbl) [7] (./anonymous-submission-latex-2024.aux)
LaTeX Warning: There were undefined references.
)
{d:/software/texlive/2023/texmf-var/fonts/map/pdftex/updmap/pdftex.map}{d:/software/texlive/2023/texmf-dist/fonts/enc/dvips/base/8r.enc}] [2] [3] [4] (./pic/BairdExample.tex)
<pic/maze_13_13.pdf, id=34, 493.1646pt x 387.62602pt>
File: pic/maze_13_13.pdf Graphic file (type pdf)
<use pic/maze_13_13.pdf>
Package pdftex.def Info: pic/maze_13_13.pdf used on input line 902.
(pdftex.def) Requested size: 172.61018pt x 135.67113pt.
[5] (./anonymous-submission-latex-2024.bbl) [6 <./pic/maze_13_13.pdf>] (./anonymous-submission-latex-2024.aux) )
Here is how much of TeX's memory you used:
22572 strings out of 476025
476134 string characters out of 5789524
1887382 words of memory out of 5000000
42651 multiletter control sequences out of 15000+600000
22926 strings out of 476025
482831 string characters out of 5789524
1878382 words of memory out of 5000000
43000 multiletter control sequences out of 15000+600000
531474 words of font info for 71 fonts, out of 8000000 for 9000
1141 hyphenation exceptions out of 8191
84i,22n,89p,423b,526s stack positions out of 10000i,1000n,20000p,200000b,200000s
<d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmbx10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmex10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmmi10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmmi5.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmmi7.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmmib10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cmextra/cmmib7.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmr10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmr7.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmsy10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmsy5.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmsy7.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/symbols/msam10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/symbols/msbm10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/urw/times/utmb8a.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/urw/times/utmr8a.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/urw/times/utmri8a.pfb>
Output written on anonymous-submission-latex-2024.pdf (7 pages, 213463 bytes).
84i,22n,89p,423b,789s stack positions out of 10000i,1000n,20000p,200000b,200000s
<d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmbx10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmex10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmmi10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmmi5.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmmi7.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmmib10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cmextra/cmmib7.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmr10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmr7.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmsy10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmsy5.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/cm/cmsy7.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/public/amsfonts/symbols/msbm10.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/urw/times/utmb8a.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/urw/times/utmr8a.pfb><d:/software/texlive/2023/texmf-dist/fonts/type1/urw/times/utmri8a.pfb>
Output written on anonymous-submission-latex-2024.pdf (6 pages, 200712 bytes).
PDF statistics:
117 PDF objects out of 1000 (max. 8388607)
73 compressed objects within 1 object stream
110 PDF objects out of 1000 (max. 8388607)
68 compressed objects within 1 object stream
0 named destinations out of 1000 (max. 500000)
13 words of extra memory for PDF output out of 10000 (max. 10000000)
18 words of extra memory for PDF output out of 10000 (max. 10000000)
\resizebox{7cm}{4.4cm}{
\begin{tikzpicture}[smooth]
\node[coordinate] (origin) at (0.3,0) {};
\node[coordinate] (num7) at (3,0) {};
\node[coordinate] (num1) at (1,2.5) {};
\path (num7) ++ (-10:0.5cm) node (num7_bright1) [coordinate] {};
\path (num7) ++ (-30:0.7cm) node (num7_bright2) [coordinate] {};
\path (num7) ++ (-60:0.35cm) node (num7_bright3) [coordinate] {};
\path (num7) ++ (-60:0.6cm) node (num7_bright4) [coordinate] {};
\path (origin) ++ (90:3cm) node (origin_above) [coordinate] {};
\path (origin_above) ++ (0:5.7cm) node (origin_aright) [coordinate] {};
\path (num1) ++ (90:0.5cm) node (num1_a) [coordinate] {};
\path (num1) ++ (-90:0.3cm) node (num1_b) [coordinate] {};
\path (num1) ++ (0:1cm) node (num2) [coordinate] {};
\path (num1_a) ++ (0:1cm) node (num2_a) [coordinate] {};
\path (num1_b) ++ (0:1cm) node (num2_b) [coordinate] {};
\path (num2) ++ (0:1cm) node (num3) [coordinate] {};
\path (num2_a) ++ (0:1cm) node (num3_a) [coordinate] {};
\path (num2_b) ++ (0:1cm) node (num3_b) [coordinate] {};
\path (num3) ++ (0:1cm) node (num4) [coordinate] {};
\path (num3_a) ++ (0:1cm) node (num4_a) [coordinate] {};
\path (num3_b) ++ (0:1cm) node (num4_b) [coordinate] {};
\path (num4) ++ (0:1cm) node (num5) [coordinate] {};
\path (num4_a) ++ (0:1cm) node (num5_a) [coordinate] {};
\path (num4_b) ++ (0:1cm) node (num5_b) [coordinate] {};
\path (num5) ++ (0:1cm) node (num6) [coordinate] {};
\path (num5_a) ++ (0:1cm) node (num6_a) [coordinate] {};
\path (num5_b) ++ (0:1cm) node (num6_b) [coordinate] {};
%\draw[->](0,0) -- (1,1);
%\draw[dashed,line width = 0.03cm] (0,0) -- (1,1);
%\fill (0.5,0.5) circle (0.5);
%\draw[shape=circle,fill=white,draw=black] (a) at (num7) {7};
\draw[dashed,line width = 0.03cm,xshift=3cm] plot[tension=0.06]
coordinates{(num7) (origin) (origin_above) (origin_aright)};
\draw[->,>=stealth,line width = 0.02cm,xshift=3cm] plot[tension=0.5]
coordinates{(num7) (num7_bright1) (num7_bright2)(num7_bright4) (num7_bright3)};
\node[line width = 0.02cm,shape=circle,fill=white,draw=black] (g) at (num7) {7};
\draw[<->,>=stealth,dashed,line width = 0.03cm,] (num1) -- (num1_a) ;
\node[line width = 0.02cm,shape=circle,fill=white,draw=black] (a) at (num1_b) {1};
\draw[<->,>=stealth,dashed,line width = 0.03cm,] (num2) -- (num2_a) ;
\node[line width = 0.02cm,shape=circle,fill=white,draw=black] (b) at (num2_b) {2};
\draw[<->,>=stealth,dashed,line width = 0.03cm,] (num3) -- (num3_a) ;
\node[line width = 0.02cm,shape=circle,fill=white,draw=black] (c) at (num3_b) {3};
\draw[<->,>=stealth,dashed,line width = 0.03cm,] (num4) -- (num4_a) ;
\node[line width = 0.02cm,shape=circle,fill=white,draw=black] (d) at (num4_b) {4};
\draw[<->,>=stealth,dashed,line width = 0.03cm,] (num5) -- (num5_a) ;
\node[line width = 0.02cm,shape=circle,fill=white,draw=black] (e) at (num5_b) {5};
\draw[<->,>=stealth,dashed,line width = 0.03cm,] (num6) -- (num6_a) ;
\node[line width = 0.02cm,shape=circle,fill=white,draw=black] (f) at (num6_b) {6};
\draw[->,>=stealth,line width = 0.02cm] (a)--(g);
\draw[->,>=stealth,line width = 0.02cm] (b)--(g);
\draw[->,>=stealth,line width = 0.02cm] (c)--(g);
\draw[->,>=stealth,line width = 0.02cm] (d)--(g);
\draw[->,>=stealth,line width = 0.02cm] (e)--(g);
\draw[->,>=stealth,line width = 0.02cm] (f)--(g);
\end{tikzpicture}
}
\relax
\bibstyle{aaai25}
\citation{sutton1988learning}
\citation{tsitsiklis1997analysis}
\citation{Sutton2018book}
\citation{baird1995residual}
\citation{sutton2008convergent}
\citation{sutton2009fast}
\citation{sutton2016emphatic}
\citation{chen2023modified}
\citation{hackman2012faster}
\citation{liu2015finite,liu2016proximal,liu2018proximal}
\citation{givchi2015quasi}
\citation{pan2017accelerated}
\citation{hallak2016generalized}
\citation{zhang2022truncated}
\citation{johnson2013accelerating}
\citation{korda2015td}
\citation{xu2019reanalysis}
\citation{Sutton2018book}
\citation{baird1995residual}
\citation{sutton2009fast}
\citation{sutton2009fast}
\citation{feng2019kernel}
\citation{basserrano2021logistic}
\newlabel{introduction}{{}{1}}
\citation{Sutton2018book}
\citation{Sutton2018book}
\citation{sutton2016emphatic}
\newlabel{preliminaries}{{}{2}}
\newlabel{valuefunction}{{}{2}}
\newlabel{linearvaluefunction}{{1}{2}}
\newlabel{thetatd_onpolicy}{{}{2}}
\newlabel{thetatd_offpolicy}{{}{2}}
\newlabel{thetatdc}{{}{3}}
\newlabel{utdc}{{}{3}}
\newlabel{fvmetd}{{2}{3}}
\newlabel{thetaetd}{{}{3}}
\providecommand*\caption@xref[2]{\@setref\relax\@undefined{#1}}
\newlabel{alg:algorithm 2}{{1}{3}}
\newlabel{alg:algorithm 5}{{2}{4}}
\newlabel{thetavmtdc}{{5}{4}}
\newlabel{uvmtdc}{{6}{4}}
\newlabel{omegavmtdc}{{7}{4}}
\newlabel{rho_VPBE}{{8}{4}}
\newlabel{thetavmetd}{{12}{4}}
\newlabel{omegavmetd}{{13}{4}}
\citation{sutton2009fast}
\citation{hirsch1989convergent}
\newlabel{theorem2}{{1}{5}}
\newlabel{thetavmtdcFastest}{{14}{5}}
\newlabel{uvmtdcFastest}{{15}{5}}
\newlabel{omegavmtdcFastest}{{16}{5}}
\newlabel{omegavmtdcFastestFinal}{{17}{5}}
\newlabel{omegavmtdcInfty}{{18}{5}}
\citation{borkar2000ode}
\citation{borkar2000ode}
\citation{borkar2000ode}
\citation{borkar1997stochastic}
\citation{ng1999policy}
\citation{devlin2012dynamic}
\newlabel{theorem3}{{2}{6}}
\newlabel{rowsum}{{19}{6}}
\newlabel{example_bias}{{2}{6}}
\newlabel{columnsum}{{20}{6}}
\bibdata{aaai25}
\bibcite{baird1995residual}{{1}{1995}{{Baird et~al.}}{{}}}
\newlabel{2-state}{{1(a)}{7}}
\newlabel{sub@2-state}{{(a)}{7}}
\newlabel{7-state}{{1(b)}{7}}
\newlabel{sub@7-state}{{(b)}{7}}
\newlabel{MazeFull}{{1(c)}{7}}
\newlabel{sub@MazeFull}{{(c)}{7}}
\newlabel{CliffWalkingFull}{{1(d)}{7}}
\newlabel{sub@CliffWalkingFull}{{(d)}{7}}
\newlabel{MountainCarFull}{{1(e)}{7}}
\newlabel{sub@MountainCarFull}{{(e)}{7}}
\newlabel{AcrobotFull}{{1(f)}{7}}
\newlabel{sub@AcrobotFull}{{(f)}{7}}
\newlabel{Complete_full}{{1}{7}}
\bibcite{basserrano2021logistic}{{2}{2021}{{Bas-Serrano et~al.}}{{Bas-Serrano, Curi, Krause, and Neu}}}
\bibcite{borkar1997stochastic}{{3}{1997}{{Borkar}}{{}}}
\bibcite{borkar2000ode}{{4}{2000}{{Borkar and Meyn}}{{}}}
\bibcite{chen2023modified}{{5}{2023}{{Chen et~al.}}{{Chen, Ma, Li, Yang, Yang, and Gao}}}
\bibcite{devlin2012dynamic}{{6}{2012}{{Devlin and Kudenko}}{{}}}
\bibcite{feng2019kernel}{{7}{2019}{{Feng, Li, and Liu}}{{}}}
\bibcite{givchi2015quasi}{{8}{2015}{{Givchi and Palhang}}{{}}}
\bibcite{hackman2012faster}{{9}{2012}{{Hackman}}{{}}}
\bibcite{hallak2016generalized}{{10}{2016}{{Hallak et~al.}}{{Hallak, Tamar, Munos, and Mannor}}}
\bibcite{hirsch1989convergent}{{11}{1989}{{Hirsch}}{{}}}
\bibcite{johnson2013accelerating}{{12}{2013}{{Johnson and Zhang}}{{}}}
\bibcite{korda2015td}{{13}{2015}{{Korda and La}}{{}}}
\bibcite{liu2018proximal}{{14}{2018}{{Liu et~al.}}{{Liu, Gemp, Ghavamzadeh, Liu, Mahadevan, and Petrik}}}
\bibcite{liu2015finite}{{15}{2015}{{Liu et~al.}}{{Liu, Liu, Ghavamzadeh, Mahadevan, and Petrik}}}
\bibcite{liu2016proximal}{{16}{2016}{{Liu et~al.}}{{Liu, Liu, Ghavamzadeh, Mahadevan, and Petrik}}}
\bibcite{ng1999policy}{{17}{1999}{{Ng, Harada, and Russell}}{{}}}
\bibcite{pan2017accelerated}{{18}{2017}{{Pan, White, and White}}{{}}}
\bibcite{sutton2009fast}{{19}{2009}{{Sutton et~al.}}{{Sutton, Maei, Precup, Bhatnagar, Silver, Szepesv{\'a}ri, and Wiewiora}}}
\bibcite{sutton1988learning}{{20}{1988}{{Sutton}}{{}}}
\bibcite{Sutton2018book}{{21}{2018}{{Sutton and Barto}}{{}}}
\bibcite{sutton2008convergent}{{22}{2008}{{Sutton, Maei, and Szepesv{\'a}ri}}{{}}}
\bibcite{sutton2016emphatic}{{23}{2016}{{Sutton, Mahmood, and White}}{{}}}
\bibcite{tsitsiklis1997analysis}{{24}{1997}{{Tsitsiklis and Van~Roy}}{{}}}
\bibcite{xu2019reanalysis}{{25}{2019}{{Xu et~al.}}{{Xu, Wang, Zhou, and Liang}}}
\bibcite{zhang2022truncated}{{26}{2022}{{Zhang and Whiteson}}{{}}}
\gdef \@abspage@last{8}
\begin{thebibliography}{26}
\providecommand{\natexlab}[1]{#1}
\bibitem[{Baird et~al.(1995)}]{baird1995residual}
Baird, L.; et~al. 1995.
\newblock Residual algorithms: Reinforcement learning with function approximation.
\newblock In \emph{Proc. 12th Int. Conf. Mach. Learn.}, 30--37.
\bibitem[{Bas-Serrano et~al.(2021)Bas-Serrano, Curi, Krause, and Neu}]{basserrano2021logistic}
Bas-Serrano, J.; Curi, S.; Krause, A.; and Neu, G. 2021.
\newblock Logistic Q-Learning.
\newblock In \emph{International Conference on Artificial Intelligence and Statistics}, 3610--3618.
\bibitem[{Borkar(1997)}]{borkar1997stochastic}
Borkar, V.~S. 1997.
\newblock Stochastic approximation with two time scales.
\newblock \emph{Syst. \& Control Letters}, 29(5): 291--294.
\bibitem[{Borkar and Meyn(2000)}]{borkar2000ode}
Borkar, V.~S.; and Meyn, S.~P. 2000.
\newblock The ODE method for convergence of stochastic approximation and reinforcement learning.
\newblock \emph{SIAM J. Control Optim.}, 38(2): 447--469.
\bibitem[{Chen et~al.(2023)Chen, Ma, Li, Yang, Yang, and Gao}]{chen2023modified}
Chen, X.; Ma, X.; Li, Y.; Yang, G.; Yang, S.; and Gao, Y. 2023.
\newblock Modified Retrace for Off-Policy Temporal Difference Learning.
\newblock In \emph{Uncertainty in Artificial Intelligence}, 303--312. PMLR.
\bibitem[{Devlin and Kudenko(2012)}]{devlin2012dynamic}
Devlin, S.; and Kudenko, D. 2012.
\newblock Dynamic potential-based reward shaping.
\newblock In \emph{Proc. 11th Int. Conf. Autonomous Agents and Multiagent Systems}, 433--440.
\bibitem[{Feng, Li, and Liu(2019)}]{feng2019kernel}
Feng, Y.; Li, L.; and Liu, Q. 2019.
\newblock A kernel loss for solving the Bellman equation.
\newblock In \emph{Advances in Neural Information Processing Systems}, 15430--15441.
\bibitem[{Givchi and Palhang(2015)}]{givchi2015quasi}
Givchi, A.; and Palhang, M. 2015.
\newblock Quasi newton temporal difference learning.
\newblock In \emph{Asian Conference on Machine Learning}, 159--172.
\bibitem[{Hackman(2012)}]{hackman2012faster}
Hackman, L. 2012.
\newblock \emph{Faster Gradient-TD Algorithms}.
\newblock Ph.D. thesis, University of Alberta.
\bibitem[{Hallak et~al.(2016)Hallak, Tamar, Munos, and Mannor}]{hallak2016generalized}
Hallak, A.; Tamar, A.; Munos, R.; and Mannor, S. 2016.
\newblock Generalized emphatic temporal difference learning: bias-variance analysis.
\newblock In \emph{Proceedings of the 30th AAAI Conference on Artificial Intelligence}, 1631--1637.
\bibitem[{Hirsch(1989)}]{hirsch1989convergent}
Hirsch, M.~W. 1989.
\newblock Convergent activation dynamics in continuous time networks.
\newblock \emph{Neural Netw.}, 2(5): 331--349.
\bibitem[{Johnson and Zhang(2013)}]{johnson2013accelerating}
Johnson, R.; and Zhang, T. 2013.
\newblock Accelerating stochastic gradient descent using predictive variance reduction.
\newblock In \emph{Advances in Neural Information Processing Systems}, 315--323.
\bibitem[{Korda and La(2015)}]{korda2015td}
Korda, N.; and La, P. 2015.
\newblock On TD (0) with function approximation: Concentration bounds and a centered variant with exponential convergence.
\newblock In \emph{International conference on machine learning}, 626--634. PMLR.
\bibitem[{Liu et~al.(2018)Liu, Gemp, Ghavamzadeh, Liu, Mahadevan, and Petrik}]{liu2018proximal}
Liu, B.; Gemp, I.; Ghavamzadeh, M.; Liu, J.; Mahadevan, S.; and Petrik, M. 2018.
\newblock Proximal gradient temporal difference learning: Stable reinforcement learning with polynomial sample complexity.
\newblock \emph{Journal of Artificial Intelligence Research}, 63: 461--494.
\bibitem[{Liu et~al.(2015)Liu, Liu, Ghavamzadeh, Mahadevan, and Petrik}]{liu2015finite}
Liu, B.; Liu, J.; Ghavamzadeh, M.; Mahadevan, S.; and Petrik, M. 2015.
\newblock Finite-sample analysis of proximal gradient TD algorithms.
\newblock In \emph{Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence}, 504--513.
\bibitem[{Liu et~al.(2016)Liu, Liu, Ghavamzadeh, Mahadevan, and Petrik}]{liu2016proximal}
Liu, B.; Liu, J.; Ghavamzadeh, M.; Mahadevan, S.; and Petrik, M. 2016.
\newblock Proximal Gradient Temporal Difference Learning Algorithms.
\newblock In \emph{Proceedings of the International Joint Conference on Artificial Intelligence}, 4195--4199.
\bibitem[{Ng, Harada, and Russell(1999)}]{ng1999policy}
Ng, A.~Y.; Harada, D.; and Russell, S. 1999.
\newblock Policy invariance under reward transformations: Theory and application to reward shaping.
\newblock In \emph{Proc. 16th Int. Conf. Mach. Learn.}, 278--287.
\bibitem[{Pan, White, and White(2017)}]{pan2017accelerated}
Pan, Y.; White, A.; and White, M. 2017.
\newblock Accelerated gradient temporal difference learning.
\newblock In \emph{Proceedings of the 21st AAAI Conference on Artificial Intelligence}, 2464--2470.
\bibitem[{Sutton et~al.(2009)Sutton, Maei, Precup, Bhatnagar, Silver, Szepesv{\'a}ri, and Wiewiora}]{sutton2009fast}
Sutton, R.; Maei, H.; Precup, D.; Bhatnagar, S.; Silver, D.; Szepesv{\'a}ri, C.; and Wiewiora, E. 2009.
\newblock Fast gradient-descent methods for temporal-difference learning with linear function approximation.
\newblock In \emph{Proc. 26th Int. Conf. Mach. Learn.}, 993--1000.
\bibitem[{Sutton(1988)}]{sutton1988learning}
Sutton, R.~S. 1988.
\newblock Learning to predict by the methods of temporal differences.
\newblock \emph{Machine learning}, 3(1): 9--44.
\bibitem[{Sutton and Barto(2018)}]{Sutton2018book}
Sutton, R.~S.; and Barto, A.~G. 2018.
\newblock \emph{Reinforcement Learning: An Introduction}.
\newblock The MIT Press, second edition.
\bibitem[{Sutton, Maei, and Szepesv{\'a}ri(2008)}]{sutton2008convergent}
Sutton, R.~S.; Maei, H.~R.; and Szepesv{\'a}ri, C. 2008.
\newblock A Convergent $ O (n) $ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation.
\newblock In \emph{Advances in Neural Information Processing Systems}, 1609--1616. Cambridge, MA: MIT Press.
\bibitem[{Sutton, Mahmood, and White(2016)}]{sutton2016emphatic}
Sutton, R.~S.; Mahmood, A.~R.; and White, M. 2016.
\newblock An emphatic approach to the problem of off-policy temporal-difference learning.
\newblock \emph{The Journal of Machine Learning Research}, 17(1): 2603--2631.
\bibitem[{Tsitsiklis and Van~Roy(1997)}]{tsitsiklis1997analysis}
Tsitsiklis, J.~N.; and Van~Roy, B. 1997.
\newblock Analysis of temporal-diffference learning with function approximation.
\newblock In \emph{Advances in Neural Information Processing Systems}, 1075--1081.
\bibitem[{Xu et~al.(2019)Xu, Wang, Zhou, and Liang}]{xu2019reanalysis}
Xu, T.; Wang, Z.; Zhou, Y.; and Liang, Y. 2019.
\newblock Reanalysis of Variance Reduced Temporal Difference Learning.
\newblock In \emph{International Conference on Learning Representations}.
\bibitem[{Zhang and Whiteson(2022)}]{zhang2022truncated}
Zhang, S.; and Whiteson, S. 2022.
\newblock Truncated emphatic temporal difference methods for prediction and control.
\newblock \emph{The Journal of Machine Learning Research}, 23(1): 6859--6917.
\end{thebibliography}
This is BibTeX, Version 0.99d (TeX Live 2023)
Capacity: max_strings=200000, hash_size=200000, hash_prime=170003
The top-level auxiliary file: anonymous-submission-latex-2025.aux
The style file: aaai25.bst
Database file #1: aaai25.bib
You've used 26 entries,
2840 wiz_defined-function locations,
737 strings with 9168 characters,
and the built_in function-call counts, 19179 in all, are:
= -- 1644
> -- 870
< -- 0
+ -- 321
- -- 288
* -- 1273
:= -- 2961
add.period$ -- 107
call.type$ -- 26
change.case$ -- 217
chr.to.int$ -- 27
cite$ -- 26
duplicate$ -- 1316
empty$ -- 1372
format.name$ -- 353
if$ -- 3900
int.to.chr$ -- 1
int.to.str$ -- 1
missing$ -- 261
newline$ -- 134
num.names$ -- 104
pop$ -- 614
preamble$ -- 1
purify$ -- 182
quote$ -- 0
skip$ -- 694
stack$ -- 0
substring$ -- 1043
swap$ -- 703
text.length$ -- 0
text.prefix$ -- 0
top$ -- 0
type$ -- 231
warning$ -- 0
while$ -- 166
width$ -- 0
write$ -- 343
%File: anonymous-submission-latex-2025.tex
\documentclass[letterpaper]{article} % DO NOT CHANGE THIS
\usepackage[submission]{aaai25} % DO NOT CHANGE THIS
\usepackage{times} % DO NOT CHANGE THIS
\usepackage{helvet} % DO NOT CHANGE THIS
\usepackage{courier} % DO NOT CHANGE THIS
\usepackage[hyphens]{url} % DO NOT CHANGE THIS
\usepackage{graphicx} % DO NOT CHANGE THIS
\urlstyle{rm} % DO NOT CHANGE THIS
\def\UrlFont{\rm} % DO NOT CHANGE THIS
\usepackage{natbib} % DO NOT CHANGE THIS AND DO NOT ADD ANY OPTIONS TO IT
\usepackage{caption} % DO NOT CHANGE THIS AND DO NOT ADD ANY OPTIONS TO IT
\frenchspacing % DO NOT CHANGE THIS
\setlength{\pdfpagewidth}{8.5in} % DO NOT CHANGE THIS
\setlength{\pdfpageheight}{11in} % DO NOT CHANGE THIS
%
% These are recommended to typeset algorithms but not required. See the subsubsection on algorithms. Remove them if you don't have algorithms in your paper.
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{subfigure}
\usepackage{diagbox}
\usepackage{booktabs}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}
\usepackage{tikz}
\usepackage{bm}
\usepackage{esvect}
\usepackage{multirow}
\theoremstyle{plain}
% \newtheorem{theorem}{Theorem}[section]
\newtheorem{theorem}{Theorem}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}
%
% These are are recommended to typeset listings but not required. See the subsubsection on listing. Remove this block if you don't have listings in your paper.
\usepackage{newfloat}
\usepackage{listings}
\DeclareCaptionStyle{ruled}{labelfont=normalfont,labelsep=colon,strut=off} % DO NOT CHANGE THIS
\lstset{%
basicstyle={\footnotesize\ttfamily},% footnotesize acceptable for monospace
numbers=left,numberstyle=\footnotesize,xleftmargin=2em,% show line numbers, remove this entire line if you don't want the numbers.
aboveskip=0pt,belowskip=0pt,%
showstringspaces=false,tabsize=2,breaklines=true}
\floatstyle{ruled}
\newfloat{listing}{tb}{lst}{}
\floatname{listing}{Listing}
%
% Keep the \pdfinfo as shown here. There's no need
% for you to add the /Title and /Author tags.
\pdfinfo{
/TemplateVersion (2025.1)
}
% DISALLOWED PACKAGES
% \usepackage{authblk} -- This package is specifically forbidden
% \usepackage{balance} -- This package is specifically forbidden
% \usepackage{color (if used in text)
% \usepackage{CJK} -- This package is specifically forbidden
% \usepackage{float} -- This package is specifically forbidden
% \usepackage{flushend} -- This package is specifically forbidden
% \usepackage{fontenc} -- This package is specifically forbidden
% \usepackage{fullpage} -- This package is specifically forbidden
% \usepackage{geometry} -- This package is specifically forbidden
% \usepackage{grffile} -- This package is specifically forbidden
% \usepackage{hyperref} -- This package is specifically forbidden
% \usepackage{navigator} -- This package is specifically forbidden
% (or any other package that embeds links such as navigator or hyperref)
% \indentfirst} -- This package is specifically forbidden
% \layout} -- This package is specifically forbidden
% \multicol} -- This package is specifically forbidden
% \nameref} -- This package is specifically forbidden
% \usepackage{savetrees} -- This package is specifically forbidden
% \usepackage{setspace} -- This package is specifically forbidden
% \usepackage{stfloats} -- This package is specifically forbidden
% \usepackage{tabu} -- This package is specifically forbidden
% \usepackage{titlesec} -- This package is specifically forbidden
% \usepackage{tocbibind} -- This package is specifically forbidden
% \usepackage{ulem} -- This package is specifically forbidden
% \usepackage{wrapfig} -- This package is specifically forbidden
% DISALLOWED COMMANDS
% \nocopyright -- Your paper will not be published if you use this command
% \addtolength -- This command may not be used
% \balance -- This command may not be used
% \baselinestretch -- Your paper will not be published if you use this command
% \clearpage -- No page breaks of any kind may be used for the final version of your paper
% \columnsep -- This command may not be used
% \newpage -- No page breaks of any kind may be used for the final version of your paper
% \pagebreak -- No page breaks of any kind may be used for the final version of your paperr
% \pagestyle -- This command may not be used
% \tiny -- This is not an acceptable font size.
% \vspace{- -- No negative value may be used in proximity of a caption, figure, table, section, subsection, subsubsection, or reference
% \vskip{- -- No negative value may be used to alter spacing above or below a caption, figure, table, section, subsection, subsubsection, or reference
\setcounter{secnumdepth}{0} %May be changed to 1 or 2 if section numbers are desired.
% The file aaai25.sty is the style file for AAAI Press
% proceedings, working notes, and technical reports.
%
% Title
% Your title must be in mixed case, not sentence case.
% That means all verbs (including short verbs like be, is, using,and go),
% nouns, adverbs, adjectives should be capitalized, including both words in hyphenated terms, while
% articles, conjunctions, and prepositions are lower case unless they
% directly follow a colon or long dash
\title{A Variance Minimization Approach to Off-policy Temporal-Difference Learning}
\author{
%Authors
% All authors must be in the same font size and format.
Written by AAAI Press Staff\textsuperscript{\rm 1}\thanks{With help from the AAAI Publications Committee.}\\
AAAI Style Contributions by Pater Patel Schneider,
Sunil Issar,\\
J. Scott Penberthy,
George Ferguson,
Hans Guesgen,
Francisco Cruz\equalcontrib,
Marc Pujol-Gonzalez\equalcontrib
}
\affiliations{
%Afiliations
\textsuperscript{\rm 1}Association for the Advancement of Artificial Intelligence\\
% If you have multiple authors and multiple affiliations
% use superscripts in text and roman font to identify them.
% For example,
% Sunil Issar\textsuperscript{\rm 2},
% J. Scott Penberthy\textsuperscript{\rm 3},
% George Ferguson\textsuperscript{\rm 4},
% Hans Guesgen\textsuperscript{\rm 5}
% Note that the comma should be placed after the superscript
1101 Pennsylvania Ave, NW Suite 300\\
Washington, DC 20004 USA\\
% email address must be in roman text type, not monospace or sans serif
proceedings-questions@aaai.org
%
% See more examples next
}
%Example, Single Author, ->> remove \iffalse,\fi and place them surrounding AAAI title to use it
\iffalse
\title{My Publication Title --- Single Author}
\author {
Author Name
}
\affiliations{
Affiliation\\
Affiliation Line 2\\
name@example.com
}
\fi
\iffalse
%Example, Multiple Authors, ->> remove \iffalse,\fi and place them surrounding AAAI title to use it
\title{My Publication Title --- Multiple Authors}
\author {
% Authors
First Author Name\textsuperscript{\rm 1},
Second Author Name\textsuperscript{\rm 2},
Third Author Name\textsuperscript{\rm 1}
}
\affiliations {
% Affiliations
\textsuperscript{\rm 1}Affiliation 1\\
\textsuperscript{\rm 2}Affiliation 2\\
firstAuthor@affiliation1.com, secondAuthor@affilation2.com, thirdAuthor@affiliation1.com
}
\fi
% REMOVE THIS: bibentry
% This is only needed to show inline citations in the guidelines document. You should not need it and can safely delete it.
\usepackage{bibentry}
% END REMOVE bibentry
\begin{document}
\setcounter{theorem}{0}
\maketitle
% \setcounter{theorem}{0}
\begin{abstract}
In this paper, we introduce the concept of improving the performance of parametric
Temporal-Difference (TD) learning algorithms by the Variance Minimization (VM) parameter, $\omega$,
which is dynamically updated at each time step. Specifically, we incorporate the VM parameter into off-policy linear algorithms such as TDC and ETD, resulting in the
Variance Minimization TDC (VMTDC) algorithm and the Variance Minimization ETD (VMETD) algorithm. In the two-state counterexample,
we analyze
the convergence speed of these algorithms by calculating the minimum eigenvalue of the key
matrices and find that the VMTDC algorithm converges faster than TDC, while VMETD is more stable in convergence than ETD
through the
experiment.In controlled experiments, the VM algorithms demonstrate
superior performance.
\end{abstract}
% Uncomment the following to link to your code, datasets, an extended version or similar.
%
% \begin{links}
% \link{Code}{https://aaai.org/example/code}
% \link{Datasets}{https://aaai.org/example/datasets}
% \link{Extended version}{https://aaai.org/example/extended-version}
% \end{links}
\input{main/introduction.tex}
\input{main/preliminaries.tex}
\input{main/motivation.tex}
\input{main/theory.tex}
\input{main/experiment.tex}
% \input{main/relatedwork.tex}
\input{main/conclusion.tex}
\bibliography{aaai25}
\end{document}
\section{Conclusion and Future Work}
Value-based reinforcement learning typically aims
to minimize error as an optimization objective.
As an alternation, this study proposes new objective
functions: VBE and VPBE, and derives many variance minimization algorithms, including VMTD,
VMTDC and VMETD.
All algorithms demonstrated superior performance in policy
evaluation and control experiments.
Future work may include, but are not limited
to, (1) analysis of the convergence rate of VMTDC and VMETD.
(2) extensions of VBE and VPBE to multi-step returns.
(3) extensions to nonlinear approximations, such as neural networks.
\ No newline at end of file
\section{Experimental Studies}
This section assesses algorithm performance through experiments,
which are divided into policy evaluation experiments and control experiments.
The control algorithms for TDC, ETD, VMTDC, and VMETD are named GQ, EQ, VMGQ, and VMEQ, respectively.
The evaluation experimental environments are the 2-state and 7-state counterexample.
The control experimental environments are Maze, CliffWalking-v0, MountainCar-v0, and Acrobot-v1.
For specific experimental parameters, please refer to the appendix.
For the evaluation experiment, the experimental results
align with our previous analysis. In the 2-state counterexample
environment, the TDC algorithm has the smallest minimum
eigenvalue of the key matrix, resulting in the slowest
convergence speed. In contrast, the minimum eigenvalue
of VMTDC is larger, leading to faster convergence.
Although VMETD's minimum eigenvalue is larger than ETD's,
causing VMETD to converge more slowly than ETD in the
2-state counterexample, the standard deviation (shaded area)
of VMETD is smaller than that of ETD, indicating that VMETD
converges more smoothly. In the 7-state counterexample
environment, VMTDC converges faster than TDC and both VMETD and ETD are diverge.
For the control experiments, the results for the maze and
cliff walking environments are similar: VMGQ
outperforms GQ, EQ outperforms VMGQ, and VMEQ performs
the best. In the mountain car and Acrobot experiments,
VMGQ and VMEQ show comparable performance, both outperforming
GQ and EQ. In summary, for control experiments, VM algorithms
outperform non-VM algorithms.
In summary, the performance of VMSarsa,
VMQ, and VMGQ(0) is better than that of other algorithms.
In the Cliff Walking environment,
the performance of VMGQ(0) is slightly better than that of
VMSarsa and VMQ. In the other three experimental environments,
the performances of VMSarsa, VMQ, and VMGQ(0) are close.
\ No newline at end of file
\section{Introduction}
\label{introduction}
Reinforcement learning can be mainly divided into two
categories: value-based reinforcement learning
and policy gradient-based reinforcement learning. This
paper focuses on temporal difference learning based on
linear approximated valued functions. Its research is
usually divided into two steps: the first step is to establish the convergence of the algorithm, and the second
step is to accelerate the algorithm.
In terms of stability, \citet{sutton1988learning} established the
convergence of on-policy TD(0), and \citet{tsitsiklis1997analysis}
established the convergence of on-policy TD($\lambda$).
However, ``The deadly triad'' consisting of off-policy learning,
bootstrapping, and function approximation makes
the stability a difficult problem \citep{Sutton2018book}.
To solve this problem, convergent off-policy temporal difference
learning algorithms are proposed, e.g., BR \cite{baird1995residual},
GTD \cite{sutton2008convergent}, GTD2 and TDC \cite{sutton2009fast},
ETD \cite{sutton2016emphatic}, and MRetrace \cite{chen2023modified}.
In terms of acceleration, \citet{hackman2012faster}
proposed Hybrid TD algorithm with on-policy matrix.
\citet{liu2015finite,liu2016proximal,liu2018proximal} proposed
true stochastic algorithms, i.e., GTD-MP and GTD2-MP, from
a convex-concave saddle-point formulation.
Second-order methods are used to accelerate TD learning,
e.g., Quasi Newton TD \cite{givchi2015quasi} and
accelerated TD (ATD) \citep{pan2017accelerated}.
\citet{hallak2016generalized} introduced an new parameter
to reduce variance for ETD.
\citet{zhang2022truncated} proposed truncated ETD with a lower variance.
Variance Reduced TD with direct variance reduction technique \citep{johnson2013accelerating} is proposed by \cite{korda2015td}
and analysed by \cite{xu2019reanalysis}.
How to further improve the convergence rates of reinforcement learning
algorithms is currently still an open problem.
Algorithm stability is prominently reflected in the changes
to the objective function, transitioning from mean squared
errors (MSE) \citep{Sutton2018book} to mean squared bellman errors (MSBE) \cite{baird1995residual}, then to
norm of the expected TD update \cite{sutton2009fast}, and further to
mean squared projected Bellman errors (MSPBE) \cite{sutton2009fast}. On the other hand, algorithm
acceleration is more centered around optimizing the iterative
update formula of the algorithm itself without altering the
objective function, thereby speeding up the convergence rate
of the algorithm. The emergence of new optimization objective
functions often leads to the development of novel algorithms.
The introduction of new algorithms, in turn, tends to inspire
researchers to explore methods for accelerating algorithms,
leading to the iterative creation of increasingly superior algorithms.
The kernel loss function can be optimized using standard
gradient-based methods, addressing the issue of double
sampling in residual gradient algorithm \cite{feng2019kernel}. It ensures convergence
in both on-policy and off-policy scenarios. The logistic bellman
error is convex and smooth in the action-value function parameters,
with bounded gradients \cite{basserrano2021logistic}. In contrast, the squared Bellman error is
not convex in the action-value function parameters, and RL algorithms
based on recursive optimization using it are known to be unstable.
% The value-based algorithms mentioned above aim to
% minimize some errors, e.g., mean squared errors \citep{Sutton2018book},
% mean squared Bellman errors \cite{baird1995residual}, norm
% of the expected TD update \cite{sutton2009fast},
% mean squared projected Bellman errors (MSPBE) \cite{sutton2009fast}, etc.
It is necessary to propose a new objective function, but the mentioned objective functions above are all some form of error.
Is minimizing error the only option for value-based reinforcement learning?
For policy evaluation experiments,
differences in objective functions may result
in inconsistent fixed points. This inconsistency
makes it difficult to uniformly compare the superiority
of algorithms derived from different objective functions.
However, for control experiments, since the choice of actions
depends on the relative values of the Q values rather than their
absolute values, the presence of solution bias is acceptable.
Based on this observation, we propose alternate objective functions
instead of minimizing errors. We minimize
Variance of Projected Bellman Error (VPBE)
and derive Variance Minimization (VM) algorithms.
These algorithms preserve the invariance of the optimal policy in the control environments,
but significantly reduce the variance of gradient estimation,
and thus hastening convergence.
The contributions of this paper are as follows:
(1) Introduction of novel objective functions based on
the invariance of the optimal policy.
(2) Propose two off-policy variance minimization algorithms.
(3) Proof of their convergence.
(5) Experiments demonstrating the faster convergence speed of the proposed algorithms.
\resizebox{5cm}{3cm}{
\begin{tikzpicture}[smooth]
\node[coordinate] (origin) at (0.3,0) {};
\node[coordinate] (num7) at (3,0) {};
\node[coordinate] (num1) at (1,2.5) {};
\path (num7) ++ (-10:0.5cm) node (num7_bright1) [coordinate] {};
\path (num7) ++ (-30:0.7cm) node (num7_bright2) [coordinate] {};
\path (num7) ++ (-60:0.35cm) node (num7_bright3) [coordinate] {};
\path (num7) ++ (-60:0.6cm) node (num7_bright4) [coordinate] {};
\path (origin) ++ (90:3cm) node (origin_above) [coordinate] {};
\path (origin_above) ++ (0:5.7cm) node (origin_aright) [coordinate] {};
\path (num1) ++ (90:0.5cm) node (num1_a) [coordinate] {};
\path (num1) ++ (-90:0.3cm) node (num1_b) [coordinate] {};
\path (num1) ++ (0:1cm) node (num2) [coordinate] {};
\path (num1_a) ++ (0:1cm) node (num2_a) [coordinate] {};
\path (num1_b) ++ (0:1cm) node (num2_b) [coordinate] {};
\path (num2) ++ (0:1cm) node (num3) [coordinate] {};
\path (num2_a) ++ (0:1cm) node (num3_a) [coordinate] {};
\path (num2_b) ++ (0:1cm) node (num3_b) [coordinate] {};
\path (num3) ++ (0:1cm) node (num4) [coordinate] {};
\path (num3_a) ++ (0:1cm) node (num4_a) [coordinate] {};
\path (num3_b) ++ (0:1cm) node (num4_b) [coordinate] {};
\path (num4) ++ (0:1cm) node (num5) [coordinate] {};
\path (num4_a) ++ (0:1cm) node (num5_a) [coordinate] {};
\path (num4_b) ++ (0:1cm) node (num5_b) [coordinate] {};
\path (num5) ++ (0:1cm) node (num6) [coordinate] {};
\path (num5_a) ++ (0:1cm) node (num6_a) [coordinate] {};
\path (num5_b) ++ (0:1cm) node (num6_b) [coordinate] {};
%\draw[->](0,0) -- (1,1);
%\draw[dashed,line width = 0.03cm] (0,0) -- (1,1);
%\fill (0.5,0.5) circle (0.5);
%\draw[shape=circle,fill=white,draw=black] (a) at (num7) {7};
\draw[dashed,line width = 0.03cm,xshift=3cm] plot[tension=0.06]
coordinates{(num7) (origin) (origin_above) (origin_aright)};
\draw[->,>=stealth,line width = 0.02cm,xshift=3cm] plot[tension=0.5]
coordinates{(num7) (num7_bright1) (num7_bright2)(num7_bright4) (num7_bright3)};
\node[line width = 0.02cm,shape=circle,fill=white,draw=black] (g) at (num7) {7};
\draw[<->,>=stealth,dashed,line width = 0.03cm,] (num1) -- (num1_a) ;
\node[line width = 0.02cm,shape=circle,fill=white,draw=black] (a) at (num1_b) {1};
\draw[<->,>=stealth,dashed,line width = 0.03cm,] (num2) -- (num2_a) ;
\node[line width = 0.02cm,shape=circle,fill=white,draw=black] (b) at (num2_b) {2};
\draw[<->,>=stealth,dashed,line width = 0.03cm,] (num3) -- (num3_a) ;
\node[line width = 0.02cm,shape=circle,fill=white,draw=black] (c) at (num3_b) {3};
\draw[<->,>=stealth,dashed,line width = 0.03cm,] (num4) -- (num4_a) ;
\node[line width = 0.02cm,shape=circle,fill=white,draw=black] (d) at (num4_b) {4};
\draw[<->,>=stealth,dashed,line width = 0.03cm,] (num5) -- (num5_a) ;
\node[line width = 0.02cm,shape=circle,fill=white,draw=black] (e) at (num5_b) {5};
\draw[<->,>=stealth,dashed,line width = 0.03cm,] (num6) -- (num6_a) ;
\node[line width = 0.02cm,shape=circle,fill=white,draw=black] (f) at (num6_b) {6};
\draw[->,>=stealth,line width = 0.02cm] (a)--(g);
\draw[->,>=stealth,line width = 0.02cm] (b)--(g);
\draw[->,>=stealth,line width = 0.02cm] (c)--(g);
\draw[->,>=stealth,line width = 0.02cm] (d)--(g);
\draw[->,>=stealth,line width = 0.02cm] (e)--(g);
\draw[->,>=stealth,line width = 0.02cm] (f)--(g);
\end{tikzpicture}
}
% \tikzstyle{int}=[draw, fill=blue!20, minimum size=2em]
% \tikzstyle{block}=[draw, fill=gray, minimum size=1.5em]
% \tikzstyle{init} = [pin edge={to-,thin,black}]
% \resizebox{8cm}{1.2cm}{
% \begin{tikzpicture}[node distance=1.5cm,auto,>=latex']
% \node [block] (o) {};
% \node (p) [left of=o,node distance=0.5cm, coordinate] {o};
% \node [shape=circle,int] (a) [right of=o]{$A$};
% \node (b) [left of=a,node distance=1.5cm, coordinate] {a};
% \node [shape=circle,int] (c) [right of=a] {$B$};
% \node (d) [left of=c,node distance=1.5cm, coordinate] {c};
% \node [shape=circle,int, pin={[init]above:$$}] (e) [right of=c]{$C$};
% \node (f) [left of=e,node distance=1.5cm, coordinate] {e};
% \node [shape=circle,int] (g) [right of=e] {$D$};
% \node (h) [left of=g,node distance=1.5cm, coordinate] {g};
% \node [shape=circle,int] (i) [right of=g] {$E$};
% \node (j) [left of=i,node distance=1.5cm, coordinate] {i};
% \node [block] (k) [right of=i] {};
% \node (l) [left of=k,node distance=0.5cm, coordinate] {k};
% \path[<-] (o) edge node {$0$} (a);
% \path[<->] (a) edge node {$0$} (c);
% \path[<->] (c) edge node {$0$} (e);
% \path[<->] (e) edge node {$0$} (g);
% \path[<->] (g) edge node {$0$} (i);
% \draw[->] (i) edge node {$1$} (k);
% \end{tikzpicture}
% }
\tikzstyle{int}=[draw, fill=blue!20, minimum size=2em]
\tikzstyle{block}=[draw, fill=gray, minimum size=1.5em]
\tikzstyle{init} = [pin edge={to-,thin,black}]
\resizebox{5cm}{1cm}{
\begin{tikzpicture}[node distance=1.5cm, auto, >=latex]
\node [block] (o) {};
\node (p) [left of=o, node distance=0.5cm, coordinate] {o};
\node [shape=circle, int] (a) [right of=o] {$A$};
\node (b) [left of=a, node distance=1.5cm, coordinate] {a};
\node [shape=circle, int] (c) [right of=a] {$B$};
\node (d) [left of=c, node distance=1.5cm, coordinate] {c};
\node [shape=circle, int, pin={[init]above:$ $}] (e) [right of=c] {$C$};
\node (f) [left of=e, node distance=1.5cm, coordinate] {e};
\node [shape=circle, int] (g) [right of=e] {$D$};
\node (h) [left of=g, node distance=1.5cm, coordinate] {g};
\node [shape=circle, int] (i) [right of=g] {$E$};
\node (j) [left of=i, node distance=1.5cm, coordinate] {i};
\node [block] (k) [right of=i] {};
\node (l) [left of=k, node distance=0.5cm, coordinate] {k};
\path[->] (o) edge node {$0$} (a);
\path[<->] (a) edge node {$0$} (c);
\path[<->] (c) edge node {$0$} (e);
\path[<->] (e) edge node {$0$} (g);
\path[<->] (g) edge node {$0$} (i);
\draw[->] (i) edge node {$1$} (k);
\end{tikzpicture}
}
\ No newline at end of file
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
\section{Related Work}
\subsection{Difference between VMQ and R-learning}
Tabular VMQ's update formula bears some resemblance
to R-learning's update formula. As shown in Table \ref{differenceRandVMQ}, the update formulas of the two algorithms have the following differences:
\\(1) The goal of the R-learning algorithm \cite{schwartz1993reinforcement} is to maximize the average
reward, rather than the cumulative reward, by learning an estimate
of the average reward. This estimate $m$ is then used to update the Q-values.
On the contrary, the $\omega$ in the tabular VMQ update formula eventually converges to $\mathbb{E}[\delta]$.
\\(2) When $\gamma=1$ in the tabular VMQ update formula, the
R-learning update formula is formally
the same as the tabular VMQ update formula.
Therefore, R-learning algorithm can be
considered as a special case of VMQ algorithm in form.
\subsection{Variance Reduction for TD Learning}
The TD with centering algorithm (CTD) \cite{korda2015td}
was proposed, which directly applies variance reduction techniques to
the TD algorithm. The CTD algorithm updates its parameters using the
average gradient of a batch of Markovian samples and a projection operator.
Unfortunately, the authors’ analysis of the CTD algorithm contains technical
errors. The VRTD algorithm \cite{xu2020reanalysis} is also a variance-reduced algorithm that updates
its parameters using the average gradient of a batch of i.i.d. samples. The
authors of VRTD provide a technically sound analysis to demonstrate the
advantages of variance reduction.
\subsection{Variance Reduction for Policy Gradient Algorithms}
Policy gradient algorithms are a class of reinforcement
learning algorithms that directly optimize cumulative rewards.
REINFORCE is a Monte Carlo algorithm that estimates
gradients through sampling, but may have a high variance.
Baselines are introduced to reduce variance and to
accelerate learning \cite{Sutton2018book}. In Actor-Critic,
value function as a baseline and bootstrapping
are used to reduce variance, also accelerating convergence \cite{Sutton2018book}.
TRPO \cite{schulman2015trust} and PPO \cite{schulman2017proximal}
use generalized advantage
estimation, which combines multi-step bootstrapping and Monte Carlo
estimation to reduce variance, making gradient estimation more stable and
accelerating convergence.
In Variance Minimization,
the incorporation of $\omega \doteq \mathbb{E}[\delta]$
bears a striking resemblance to the use of a baseline
in policy gradient methods. The introduction of a baseline
in policy gradient techniques does not alter
the expected value of the update;
rather, it significantly impacts the variance of gradient estimation.
The addition of $\omega \doteq \mathbb{E}[\delta]$ in Variance Minimization
preserves the invariance of the optimal
policy while stabilizing gradient estimation,
reducing the variance of gradient estimation,
and hastening convergence.
\ No newline at end of file
题目:A Variance Minimization Approach to Off-policy Temporal-Difference Learning
题目:A Variance Minimization Approach to Off-policy Temporal-Difference Learning
摘要:
在本文中,我们介绍了通过引入一个新的参数来提高参数化时序差分学习算法的性能的思想,文中称这个新的参数为方差最小化参数,并且该参数在每一时间步上动态更新。特别的,我们将方差最小化参数引入到非策略算法,如TDC和ETD,中去,得到variance minimization TDC算法和Variance minimization ETD算法。本文在特定评估实验环境中计算算法关键矩阵的最小特征值来分析算法的收敛速度,并通过实验得到验证。在控制实验中,variance minimization算法拥有更好的性能。
In this paper, we introduce the concept of improving the performance of parametric Temporal-Difference (TD) learning algorithms by the variance minimization parameter, which is dynamically updated at each time step. Specifically, we incorporate the variance minimization parameter into off-policy algorithms such as TDC and ETD, resulting in the Variance Minimization TDC algorithm and the Variance Minimization ETD algorithm.We analyze the convergence speed of these algorithms by calculating the minimum eigenvalue of the key matrices in a specific evaluation experimental environment and validate the results through experiments.In controlled experiments, the variance minimization algorithms demonstrate superior performance.
background:
1)简单介绍怎么从onpolicy变为offpolicy
计算两状态的最小特征值
2)介绍TDC和ETD
offpolicy TD发散
TDC从MSPBE角度出发解决发散
ETD从构造正定矩阵解决发散
计算两状态的最小特征值
3)引用,矩阵的最小特征值
说明ETD最快
如何找出收敛速度更快的算法
结束
On-policy和off-policy算法是当前研究的热门话题。尤其是off-policy算法,由于其收敛性更难以保证,使得研究更具挑战性。两者的主要区别在于,on-policy算法在学习过程中,行为策略与目标策略相同,算法直接从当前策略生成数据并进行优化。而在off-policy算法中,行为策略与目标策略不同,算法利用从行为策略生成的数据来优化目标策略,这带来了更高的样本效率和复杂的稳定性问题。
以TD(0)算法为例,可以帮助理解on-policy和off-policy的不同表现:
在on-policy TD(0)算法中,行为策略和目标策略是相同的。算法使用当前策略生成的数据来更新自身的价值估计。由于行为策略与目标策略一致,TD(0)的收敛性比较有保障。算法在每一步更新时,都是基于当前策略的实际行为,这使得价值函数的估计逐渐收敛于目标策略的真实价值。
on policy TD(0)更新公式如下:
on policy TD(0)的公式
给出on policy TD(0)的A的公式推导
从数学的角度分析,行和+列和大于0,则A正定,即on policy TD(0)稳定收敛。
在off-policy TD(0)算法中,行为策略与目标策略不同。算法利用从行为策略生成的数据来估计目标策略的价值。由于行为策略和目标策略之间存在差异,TD(0)算法面临着额外的挑战:
数据分布不匹配:行为策略生成的数据与目标策略的数据分布不同,这会导致估计偏差和方差增大,从而影响收敛性。
重要性采样的高方差:off-policy TD(0)通常需要使用重要性采样来调整数据分布。如果重要性采样的权重过大,可能会导致高方差,进而使得算法难以收敛。
off policy TD(0)更新公式如下:
off policy TD(0)的公式
给出off policy TD(0)的A的公式推导
从数学的角度分析,行和+列和小于0,则A非正定,即off policy TD(0)收敛不稳定。
在2-state counterexample中,A=[-0.2]
TDC和ETD是两个著名的offpolicy算法,前者由目标函数MSPBE推导得出的offpolicy算法,后者则通过技巧将原本offpolicy TD(0)的A由非正定转变为正定,从而使得算法在offpolicy下收敛。
带有重要性采样的MSPBE公式如下:
带有重要性采样的MSPBE的公式(看那个博士论文)
带有重要性采样的TDC的更新公式如下:
带有重要性采样的TDC的更新公式(看那个博士论文)
给出TDC的A的公式推导
在2-state counterexample中,A=[0.016]
ETD更新公式如下:
ETD的公式
给出ETD的A的公式推导
从数学的角度分析,行和+列和大于0,则A正定,即ETD稳定收敛。
在2-state counterexample中,A=[??]
陈老师arxiv那篇论文中的定理描述,得到算法的收敛速度跟矩阵A有关系,A的最小特征值越大,收敛速度越快。
由于在2-state中,ETD的矩阵A的最小特征值最大,因此收敛速度最快。基于这个定理,我们是否可以推导出矩阵A最小特征值越大的算法?
Variance Minimization Algorithms
其他都是误差最小化
我们提出方差最小化
控制算法去掉
可以在实验部分提一下
算法的伪代码可能也需要去掉
需要A的计算过程,并且带入到2-state具体计算
需要一个表格,总结一下四个算法的最小特征值
为了推导出矩阵A最小特征值越大的算法,我们有必要提出新的目标函数。
The mentioned objective functions in Introduction are all some form
of error. Is minimizing error the only option for value-based
reinforcement learning?
Based on this observation, we propose alternate objec-
tive functions instead of minimizing errors. We minimize
Variance of Projected Bellman Error (VPBE) and derive the VMTDC algorithm.
并将此想法对ETD进行创新,得出VMETD算法。
表格总结:表1展示了四种算法关键矩阵A的最小特征值,VMTDC的最小特征是大于TDC,VMTDC的收敛速度比TDC快在后文的实验中得到验证。ETD的最小特征值大于VMETD,但后文实验表明VMETD的稳定性由于ETD。
Theoretical Analysis
1)VMTDC收敛性,证明简要说,具体放到附录
2)VMETD收敛性,证明简要说,具体放到附录
3)策略不变性
实验:
评估:2-state 7-state
控制:maze cliffwalking mt acrobot
实验环境具体介绍:放到附录
Reproducibility Checklist
把nips模板复制过来
总结:
本文主要研究线性近似下的offpolicy算法的研究。基于陈的定理,算法的关键矩阵的最小特征值越大算法收敛速度越快的结论,我们提出了一个新的目标函数——VPBE,并推导出VMTDC。VMTDC比TDC多一个动态更新的标量参数,我们将该想法引入到ETD中得到VMETD。通过数值分析和实验发现VM算法有着更优越的性能或者收敛更稳定。
Future work may include, but are not limited
to, (1) 将标量参数引入到更多的TD算法中.
(2) extensions of VMTDC and VMETD to multi-step returns.
(3) extensions to nonlinear approximations, such as neural networks.
对于评估实验,实验结果与我们上分分析的结果一样,在2-state counterexample环境中,TDC的关键矩阵的最小特征最小,因此收敛速度最慢,相比之下,VMTDC的最小特征值更大,因此收敛速度更快。虽然VMETD的最小特征值比ETD要大使得在2-state counterexample中收敛速度慢于ETD,但VMETD的阴影部分(std)小于ETD,意味着VMETD收敛更加平稳。
对于控制实验,迷宫和cliff walking的实验结果相似,VMGQ表现优于GQ,EQ表现优于VMGQ,而VMEQ的性能最优。
mountain car和Acrobot的实验结果相似,VMGQ和VMEQ的性能接近都优于GQ和EQ。总之对于控制实验,VM算法优于非VM算法
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment