Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
2
20240414IEEETG
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
XingguoChen
20240414IEEETG
Commits
cfdf78c5
Commit
cfdf78c5
authored
Jun 01, 2024
by
Lenovo
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
内容还是略短了些
parent
5c6485a5
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
58 additions
and
15 deletions
+58
-15
document.tex
+14
-3
main/acyclic.tex
+4
-3
main/discussion.tex
+40
-9
No files found.
document.tex
View file @
cfdf78c5
...
...
@@ -26,7 +26,9 @@
\usetikzlibrary
{
automata, positioning
}
\usetikzlibrary
{
positioning
}
\usetikzlibrary
{
decorations.markings
}
\usepackage
{
cuted
}
\usepackage
{
multicol
}
% \usepackage{cuted}
% \usepackage{widetext}
\hyphenation
{
op-tical net-works semi-conduc-tor IEEE-Xplore
}
% updated with editorial comments 8/9/2021
\newcommand
{
\highlight
}
[1]
{
\textcolor
{
red
}{
#1
}}
...
...
@@ -65,11 +67,20 @@ wangwenhao11@nudt.edu.cn).
\maketitle
\begin{abstract}
In reinforcement learning of 2048 game,
we are intrigued by the absence of successful cases
involving explicit exploration, e.g.,
$
\epsilon
-
$
greedy,
softmax.
Through experiments comparing 2048 game and maze,
we argue that explicit exploration strategies
cannot be effectively combined to learn in the 2048 game,
and demonstrate the acyclic nature of the 2048 game.
The successful experiences in 2048 game AI
will contribute to solving acyclic MDPs.
\end{abstract}
\begin{IEEEkeywords}
Acyclicity, 2048 game, ergodicity, backward learning.
\end{IEEEkeywords}
...
...
main/acyclic.tex
View file @
cfdf78c5
...
...
@@ -44,8 +44,8 @@ Q_{\text{bo}}\dot{=}\begin{tiny}\left[ \begin{array}{cccccccccccc}
0
&
0
&
0
&
0
&
0
&
0
&
0
&
0
&
0
&
0
&
0
&
0
\end
{
array
}
\right
]
\end
{
tiny
}
\]
Then,
\begin{
strip
}
Then,
$
N
_{
\text
{
bo
}}$
is as (
\ref
{
nbo
}
).
\begin{
figure*
}
\begin{equation}
\begin{split}
N
_{
\text
{
bo
}}
=
&
(I
_{
12
}
-Q
_{
\text
{
bo
}}
)
^{
-1
}
\\
...
...
@@ -65,8 +65,9 @@ N_{\text{bo}}=&(I_{12}-Q_{\text{bo}})^{-1}\\
\end{array}
\right
]
\end{tiny}
\end{split}
\label
{
nbo
}
\end{equation}
\end{
strip
}
\end{
figure*
}
Bases on Definition
\ref
{
definition3
}
,
Boyan chain
is acyclic between non-absorbing states.
...
...
main/discussion.tex
View file @
cfdf78c5
\section
{
Discussions
}
\cite
{
boyan1996learning
}
Maze game is cyclic between non-absorbing states.
When an agent deviates from the optimal
path during exploration, it can quickly adjust back.
Therefore, in a maze, a policy
combined with an
$
\epsilon
$
-greedy exploration
remains
$
\epsilon
$
-greedy policy.
However, 2048 game is acyclic between non-absorbing states.
Any choice error caused by exploration will persist until the end of the game.
Therefore, in the 2048 game, a policy
combined with an
$
\epsilon
$
-greedy exploration
is no more an
$
\epsilon
$
-greedy policy.
This is why in AI training for the 2048 game,
explicit exploration strategies such as
$
\epsilon
$
-greedy and soft-max do not work;
exploration can only be encouraged through optimistic initialization.
Early in 1996, for large acyclic domains,
Boyan and Moore proposed a
backward algorithm
ROUT with function approximations
to improve learning
\cite
{
boyan1996learning
}
.
In 2017, Matsuzaki point out that
2048 game has two important unique
characteristics compared with conventional
board games: (1) ``It has a long sequence of moves'';
(2) ``The difficulty increases toward the end of the game''
\cite
{
matsuzaki2017developing
}
.
Then, he applied backward learning and restart to improve
learning.
We declare that the acyclic nature of
game 2048 leads to the efficient performance of backward learning.
Finally, MDPs with acyclic structures can benefit from
the algorithmic insights that have led to the success of the 2048 AI.
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment