Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
2
20240414IEEETG
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
XingguoChen
20240414IEEETG
Commits
00d7d8d4
Commit
00d7d8d4
authored
Jun 03, 2024
by
Lenovo
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
准备投稿了
parent
2e608808
Hide whitespace changes
Inline
Side-by-side
Showing
7 changed files
with
36 additions
and
34 deletions
+36
-34
document.tex
+4
-3
main/2048isAcyclic.tex
+5
-5
main/acyclic.tex
+2
-2
main/background.tex
+8
-8
main/biography.tex
+4
-3
main/discussion.tex
+4
-4
main/introduction.tex
+9
-9
No files found.
document.tex
View file @
00d7d8d4
...
@@ -71,12 +71,13 @@ In reinforcement learning of 2048 game,
...
@@ -71,12 +71,13 @@ In reinforcement learning of 2048 game,
we are intrigued by the absence of successful cases
we are intrigued by the absence of successful cases
involving explicit exploration, e.g.,
$
\epsilon
-
$
greedy,
involving explicit exploration, e.g.,
$
\epsilon
-
$
greedy,
softmax.
softmax.
Through experiments comparing 2048 game and maze,
Through experiments comparing
the
2048 game and maze,
we argue that explicit exploration strategies
we argue that explicit exploration strategies
cannot be effectively combined to learn in the 2048 game,
cannot be effectively combined to learn in the 2048 game,
and demonstrate the acyclic nature of the 2048 game.
and demonstrate the acyclic nature of the 2048 game.
The successful experiences in 2048 game AI
The successful experiences in the 2048 game AI
will contribute to solving acyclic MDPs.
will contribute to solving acyclic MDPs and
MDPs with acyclic structures.
\end{abstract}
\end{abstract}
\begin{IEEEkeywords}
\begin{IEEEkeywords}
...
...
main/2048isAcyclic.tex
View file @
00d7d8d4
...
@@ -15,12 +15,12 @@ The 2048 game consists of a 4$\times$4 grid board, totaling 16 squares.
...
@@ -15,12 +15,12 @@ The 2048 game consists of a 4$\times$4 grid board, totaling 16 squares.
a tile with the sum of the original numbers.
a tile with the sum of the original numbers.
Each tile can only participate in one merge operation per move.
Each tile can only participate in one merge operation per move.
After each move, a new tile appears on a random empty square.
After each move, a new tile appears on a random empty square.
The new tile is 2 with
probability 0.1, and 4 with probability 0.9.
The new tile is 2 with probability 0.1, and 4 with probability 0.9.
The game ends when all squares are filled, and no valid merge operations can be made.
The game ends when all squares are filled, and no valid merge operations can be made.
\begin{theorem}
\begin{theorem}
2048 game is acyclic between non-absorbing states.
The
2048 game is acyclic between non-absorbing states.
\end{theorem}
\end{theorem}
\begin{IEEEproof}
\begin{IEEEproof}
To apply Theorem
\ref
{
judgmentTheorem
}
, what we need
To apply Theorem
\ref
{
judgmentTheorem
}
, what we need
...
@@ -64,7 +64,7 @@ u(B) = 2^{64}\cdot sum(B)+long(B).
...
@@ -64,7 +64,7 @@ u(B) = 2^{64}\cdot sum(B)+long(B).
It is easy to verify that
It is easy to verify that
$
\forall
B
_
1
, B
_
2
\in
\mathcal
{
B
}$
,
$
\forall
B
_
1
, B
_
2
\in
\mathcal
{
B
}$
,
if
$
B
_
1
\neq
B
_
2
$
, then
$
u
(
B
_
1
)
\neq
u
(
B
_
2
)
$
.
if
$
B
_
1
\neq
B
_
2
$
, then
$
u
(
B
_
1
)
\neq
u
(
B
_
2
)
$
.
For all possible board,
For all possible board
s
,
$
\forall
B
\in
\mathcal
{
B
}$
, calculate the utility value
$
\forall
B
\in
\mathcal
{
B
}$
, calculate the utility value
$
u
(
B
)
$
, and sort
$
B
$
by
$
u
(
B
)
$
in ascending order.
$
u
(
B
)
$
, and sort
$
B
$
by
$
u
(
B
)
$
in ascending order.
Let
$
I
(
B
)
$
be the index of the board
$
B
$
after sorting,
Let
$
I
(
B
)
$
be the index of the board
$
B
$
after sorting,
...
@@ -77,7 +77,7 @@ For all possible board,
...
@@ -77,7 +77,7 @@ For all possible board,
For any transition
$
\langle
B
_
1
, a, B
_
1
', B
_
2
\rangle
$
in the 2048 game,
For any transition
$
\langle
B
_
1
, a, B
_
1
', B
_
2
\rangle
$
in the 2048 game,
we have
we have
$
sum
(
B
_
1
)=
sum
(
B
_
1
'
)
$
regardless of whether at least two tiles merge.
$
sum
(
B
_
1
)=
sum
(
B
_
1
'
)
$
regardless of whether at least two tiles merge.
Due to a new generated 2-tile or 4-tile in board
$
B
_
2
$
,
Due to a new
ly
generated 2-tile or 4-tile in board
$
B
_
2
$
,
$
sum
(
B
_
2
)
>sum
(
B
_
1
'
)
$
, that is
$
sum
(
B
_
2
)
>sum
(
B
_
1
)
$
.
$
sum
(
B
_
2
)
>sum
(
B
_
1
'
)
$
, that is
$
sum
(
B
_
2
)
>sum
(
B
_
1
)
$
.
Figure
\ref
{
2048merge1
}
and Figure
\ref
{
2048merge2
}
show transition
Figure
\ref
{
2048merge1
}
and Figure
\ref
{
2048merge2
}
show transition
examples with and without a merge.
examples with and without a merge.
...
@@ -103,7 +103,7 @@ examples with and without a merge.
...
@@ -103,7 +103,7 @@ examples with and without a merge.
Based on (
\ref
{
size
}
) and (
\ref
{
utility
}
),
Based on (
\ref
{
size
}
) and (
\ref
{
utility
}
),
we have
$
u
(
B
_
2
)
>u
(
B
_
1
)
$
.
we have
$
u
(
B
_
2
)
>u
(
B
_
1
)
$
.
That means
$
I
(
B
_
2
)
>I
(
B
_
1
)
$
.
That means
$
I
(
B
_
2
)
>I
(
B
_
1
)
$
.
The transition probability between non-absorbing state satis
i
fies (
\ref
{
condition
}
),
The transition probability between non-absorbing state satisfies (
\ref
{
condition
}
),
the claim follows by applying Theorem
\ref
{
judgmentTheorem
}
.
the claim follows by applying Theorem
\ref
{
judgmentTheorem
}
.
\end{IEEEproof}
\end{IEEEproof}
...
...
main/acyclic.tex
View file @
00d7d8d4
...
@@ -79,7 +79,7 @@ it is easy to provide a sufficient condition for acyclicity between non-absorbin
...
@@ -79,7 +79,7 @@ it is easy to provide a sufficient condition for acyclicity between non-absorbin
\label
{
judgmentTheorem
}
\label
{
judgmentTheorem
}
Given a Markov chain with absorbing states,
Given a Markov chain with absorbing states,
suppose the size of the non-absorbing states
$
|S
\setminus\{\text
{
T
}
\}
|
\geq
2
$
.
suppose the size of the non-absorbing states
$
|S
\setminus\{\text
{
T
}
\}
|
\geq
2
$
.
If the transition matrix
$
Q
$
between non-absorbing states satifies,
If the transition matrix
$
Q
$
between non-absorbing states sati
s
fies,
\begin{equation}
\begin{equation}
\forall
i,j
\in
S
\setminus\{\text
{
T
}
\}
, Q
_{
i,j
}
=
\begin{cases}
\forall
i,j
\in
S
\setminus\{\text
{
T
}
\}
, Q
_{
i,j
}
=
\begin{cases}
\geq
0,
&
\text
{
if
}
i
\leq
j;
\\
\geq
0,
&
\text
{
if
}
i
\leq
j;
\\
...
@@ -96,7 +96,7 @@ Furthermore, the sum of two upper triangular matrices
...
@@ -96,7 +96,7 @@ Furthermore, the sum of two upper triangular matrices
is still an upper triangular matrix.
is still an upper triangular matrix.
Based on Definition
\ref
{
definitionN
}
,
$
N
\dot
{
=
}
\sum
_{
i
=
0
}^{
\infty
}
Q
^
i
$
,
Based on Definition
\ref
{
definitionN
}
,
$
N
\dot
{
=
}
\sum
_{
i
=
0
}^{
\infty
}
Q
^
i
$
,
the
$
N
$
matrix is product and sum of upper triangular matrices.
the
$
N
$
matrix is
the
product and sum of upper triangular matrices.
Then, the
$
N
$
matrix is an upper triangular matrix.
Then, the
$
N
$
matrix is an upper triangular matrix.
The claim now follows based on Definition
\ref
{
definition3
}
.
The claim now follows based on Definition
\ref
{
definition3
}
.
...
...
main/background.tex
View file @
00d7d8d4
\section
{
Background
}
\section
{
Background
}
\subsection
{
Ergodicity and Non-ergodicity of Markov Chains
}
\subsection
{
Ergodicity and Non-ergodicity of Markov Chains
}
Consider Markov decision process (MDP)
Consider
a
Markov decision process (MDP)
$
\langle
\mathcal
{
S
}$
,
$
\mathcal
{
A
}$
,
$
\mathcal
{
R
}$
,
$
\mathcal
{
T
}$$
\rangle
$
, where
$
\langle
\mathcal
{
S
}$
,
$
\mathcal
{
A
}$
,
$
\mathcal
{
R
}$
,
$
\mathcal
{
T
}$$
\rangle
$
, where
$
\mathcal
{
S
}
=
\{
1
,
2
,
3
,
\ldots\}
$
is a finite state space,
$
|
\mathcal
{
S
}
|
=
n
$
,
$
\mathcal
{
A
}$
is an action space,
$
\mathcal
{
S
}
=
\{
1
,
2
,
3
,
\ldots\}
$
is a finite state space,
$
|
\mathcal
{
S
}
|
=
n
$
,
$
\mathcal
{
A
}$
is an action space,
$
\mathcal
{
T
}
:
\mathcal
{
S
}
\times
\mathcal
{
A
}
\times
\mathcal
{
S
}
\rightarrow
[
0
,
1
]
$
$
\mathcal
{
T
}
:
\mathcal
{
S
}
\times
\mathcal
{
A
}
\times
\mathcal
{
S
}
\rightarrow
[
0
,
1
]
$
...
@@ -44,7 +44,7 @@ $d_{\pi}(s)>0$.
...
@@ -44,7 +44,7 @@ $d_{\pi}(s)>0$.
This mean all states are reachable under any policy from the
This me
an all states are reachable under any policy from the
current state after sufficiently many steps
\cite
{
majeed2018q
}
.
current state after sufficiently many steps
\cite
{
majeed2018q
}
.
A sufficient condition for this assumption is that
A sufficient condition for this assumption is that
1 is a simple eigenvalue of the matrix
$
P
_{
\pi
}$
and
1 is a simple eigenvalue of the matrix
$
P
_{
\pi
}$
and
...
@@ -56,11 +56,11 @@ all other eigenvalues of $P_{\pi}$ are of modulus <1.
...
@@ -56,11 +56,11 @@ all other eigenvalues of $P_{\pi}$ are of modulus <1.
\input
{
pic/randomWalk
}
\input
{
pic/randomWalk
}
Random walk, see Figure
\ref
{
randomwalk
}
,
Random walk, see Figure
\ref
{
randomwalk
}
,
is a Markov chain, where
agent starts from node C,
and takes
is a Markov chain, where
the agent starts from node C
and takes
a probability of 0.5 to move left or right, until
a probability of 0.5 to move left or right, until
reaching the leftmost or rightmost node where it terminates.
reaching the leftmost or rightmost node where it terminates.
The terminal states are usually called absorbing states.
The terminal states are usually called absorbing states.
The transition prob
o
bility matrix
The transition prob
a
bility matrix
of random walk with absorbing states
of random walk with absorbing states
$
P
^{
\text
{
ab
}}$
is defined as follows:
$
P
^{
\text
{
ab
}}$
is defined as follows:
\[
\[
...
@@ -86,7 +86,7 @@ the distribution $d^{\text{ab}}=\{1$,
...
@@ -86,7 +86,7 @@ the distribution $d^{\text{ab}}=\{1$,
When encountering an absorbing state, we immediately reset and
When encountering an absorbing state, we immediately reset and
transition to the initial states. Figure
\ref
{
randomwalkRestart
}
transition to the initial states. Figure
\ref
{
randomwalkRestart
}
is random walk with restarts.
is random walk with restarts.
The transition prob
o
bility matrix
The transition prob
a
bility matrix
of random walk with restarts
of random walk with restarts
$
P
^{
\text
{
restart
}}$
is defined as follows:
$
P
^{
\text
{
restart
}}$
is defined as follows:
\[
\[
...
@@ -104,7 +104,7 @@ P^{\text{restart}}\dot{=}\begin{array}{c|ccccccc}
...
@@ -104,7 +104,7 @@ P^{\text{restart}}\dot{=}\begin{array}{c|ccccccc}
According to (
\ref
{
invariance
}
),
According to (
\ref
{
invariance
}
),
the distribution
$
d
^{
\text
{
restart
}}
=
\{
0
.
1
$
,
the distribution
$
d
^{
\text
{
restart
}}
=
\{
0
.
1
$
,
$
0
.
1
$
,
$
0
.
2
$
,
$
0
.
3
$
,
$
0
.
2
$
,
$
0
.
1
\}
$
.
$
0
.
1
$
,
$
0
.
2
$
,
$
0
.
3
$
,
$
0
.
2
$
,
$
0
.
1
\}
$
.
Since the probabilities of T, A, B, C, D, E are non-zeros,
Since the probabilities of T, A, B, C, D,
and
E are non-zeros,
random walk with restarts is ergodic.
random walk with restarts is ergodic.
\subsection
{
Ergodicity between non-absorbing states
}
\subsection
{
Ergodicity between non-absorbing states
}
...
@@ -133,7 +133,7 @@ where $Q$ is the matrix of transition probabilities between
...
@@ -133,7 +133,7 @@ where $Q$ is the matrix of transition probabilities between
\label
{
definitionN
}
\label
{
definitionN
}
\end{equation}
\end{equation}
where
$
I
_{
n
-
1
}$
is the
$
(
n
-
1
)
\times
(
n
-
1
)
$
identity matrix.
where
$
I
_{
n
-
1
}$
is the
$
(
n
-
1
)
\times
(
n
-
1
)
$
identity matrix.
$
N
$
is a transitive closure
,
and a reachability relation.
$
N
$
is a transitive closure and a reachability relation.
From state
$
i
$
, it is possible to reach state
$
j
$
in an
From state
$
i
$
, it is possible to reach state
$
j
$
in an
expected number of steps
$
N
_{
ij
}$
.
expected number of steps
$
N
_{
ij
}$
.
$
N
_{
ij
}
=
0
$
means that state
$
i
$
is not reachable to state
$
j
$
.
$
N
_{
ij
}
=
0
$
means that state
$
i
$
is not reachable to state
$
j
$
.
...
@@ -198,7 +198,7 @@ N^{\text{ab}}=(I_5-Q^{\text{ab}})^{-1}=\begin{array}{c|ccccc}
...
@@ -198,7 +198,7 @@ N^{\text{ab}}=(I_5-Q^{\text{ab}})^{-1}=\begin{array}{c|ccccc}
\end
{
array
}
\end
{
array
}
\]
\]
Base
s
on Definition
\ref
{
definition2
}
,
Base
d
on Definition
\ref
{
definition2
}
,
random walk with absorbing states
random walk with absorbing states
is ergodic between non-absorbing states.
is ergodic between non-absorbing states.
main/biography.tex
View file @
00d7d8d4
...
@@ -35,11 +35,11 @@ Game AI.
...
@@ -35,11 +35,11 @@ Game AI.
\vspace
{
-33pt
}
\vspace
{
-33pt
}
\begin{IEEEbiography}
[
{
\includegraphics
[width=1in,height=1.25in,clip,keepaspectratio]
{
pic/wangwh
}}
]
{
Wenhao Wang
}
\begin{IEEEbiography}
[
{
\includegraphics
[width=1in,height=1.25in,clip,keepaspectratio]
{
pic/wangwh
}}
]
{
Wenhao Wang
}
is a lecturer in the College of
is a lecturer in the College of
Electronic Engineering at National University of
Electronic Engineering at
the
National University of
Defense Technology. He received the Ph.D degree
Defense Technology. He received the Ph.D degree
in Military Science from National University of
in Military Science from
the
National University of
Defense Technology, Hunan, China, in 2023.
Defense Technology, Hunan, China, in 2023.
His interested research directions include network security,
His interested research directions include network security,
penetration testing, game theory and reinforcement
penetration testing, game theory
,
and reinforcement
learning.
learning.
\end{IEEEbiography}
\end{IEEEbiography}
\ No newline at end of file
main/discussion.tex
View file @
00d7d8d4
\section
{
Discussions
}
\section
{
Discussions
}
M
aze game is cyclic between non-absorbing states.
The m
aze game is cyclic between non-absorbing states.
When an agent deviates from the optimal
When an agent deviates from the optimal
path during exploration, it can quickly adjust back.
path during exploration, it can quickly adjust back.
Therefore, in a maze, a policy
Therefore, in a maze, a policy
combined with an
$
\epsilon
$
-greedy exploration
combined with an
$
\epsilon
$
-greedy exploration
remains an
$
\epsilon
$
-greedy policy.
remains an
$
\epsilon
$
-greedy policy.
However, 2048 game is acyclic between non-absorbing states.
However,
the
2048 game is acyclic between non-absorbing states.
Any choice error caused by exploration will persist until the end of the game.
Any choice error caused by exploration will persist until the end of the game.
Therefore, in the 2048 game, a policy
Therefore, in the 2048 game, a policy
combined with an
$
\epsilon
$
-greedy exploration
combined with an
$
\epsilon
$
-greedy exploration
...
@@ -25,8 +25,8 @@ ROUT with function approximations
...
@@ -25,8 +25,8 @@ ROUT with function approximations
to improve learning
to improve learning
\cite
{
boyan1996learning
}
.
\cite
{
boyan1996learning
}
.
In 2017, Matsuzaki point out that
In 2017, Matsuzaki point
ed
out that
2048 game has two important unique
The
2048 game has two important unique
characteristics compared with conventional
characteristics compared with conventional
board games: (1) ``It has a long sequence of moves'';
board games: (1) ``It has a long sequence of moves'';
(2) ``The difficulty increases toward the end of the game''
(2) ``The difficulty increases toward the end of the game''
...
...
main/introduction.tex
View file @
00d7d8d4
...
@@ -2,10 +2,10 @@
...
@@ -2,10 +2,10 @@
\IEEEPARstart
{
G
}{
ame
}
2048 is a popular single-player sliding block puzzle game,
\IEEEPARstart
{
G
}{
ame
}
2048 is a popular single-player sliding block puzzle game,
where the game is played on a 4
$
\times
$
4 grid, the player
where the game is played on a 4
$
\times
$
4 grid, the player
can move the tiles in four directions - up, down, left, and right,
can move the tiles in four directions - up, down, left, and right,
and the objective is to reach 2048 tile or higher tile.
and the objective is to reach
a
2048 tile or higher tile.
While the game is simple to understand, it requires strategic
While the game is simple to understand, it requires strategic
thinking and planning to reach the 2048 tile.
thinking and planning to reach the 2048 tile.
Natural decision problems in 2048
is
proved to be NP-Complete
Natural decision problems in 2048
are
proved to be NP-Complete
\cite
{
abdelkader20152048
}
.
\cite
{
abdelkader20152048
}
.
2048 has gained widespread popularity due to its addictive
2048 has gained widespread popularity due to its addictive
gameplay and simple mechanics, making it a favorite
gameplay and simple mechanics, making it a favorite
...
@@ -20,21 +20,21 @@ to train the 2048 AI, where afterstate values were
...
@@ -20,21 +20,21 @@ to train the 2048 AI, where afterstate values were
approximated with N-tuple networks
\cite
{
szubert2014temporal
}
.
approximated with N-tuple networks
\cite
{
szubert2014temporal
}
.
Wu et al. proposed multi-stage TD learning incorporating shallow
Wu et al. proposed multi-stage TD learning incorporating shallow
expectimax search to improve the performance
\cite
{
wu2014multi
}
.
expectimax search to improve the performance
\cite
{
wu2014multi
}
.
In 2016, a set of N-tuple network combinations that yield a better performance
In 2016, a set of N-tuple network combinations that yield
ed
a better performance
is selected by
local greedy search
was selected by a
local greedy search
\cite
{
oka2016systematic,matsuzaki2016systematic
}
.
\cite
{
oka2016systematic,matsuzaki2016systematic
}
.
In 2017, Ja
{
\'
s
}
kowski employed temporal coherence learning
In 2017, Ja
{
\'
s
}
kowski employed temporal coherence learning
with multi-state weight promotion, redudant encoding
with multi-state weight promotion, redu
n
dant encoding
and carousel shaping for
\cite
{
jaskowski2017mastering
}
.
and carousel shaping for
\cite
{
jaskowski2017mastering
}
.
Matsuzaki develop
p
ed backward temporal coherence learning
Matsuzaki developed backward temporal coherence learning
and restart for fast training
\cite
{
matsuzaki2017developing
}
.
and restart for fast training
\cite
{
matsuzaki2017developing
}
.
In 2022, Guei et al. introduced optimistic initialization
In 2022, Guei et al. introduced optimistic initialization
to encourage exploration for 2048, and improve
d
the learning
to encourage exploration for 2048, and improve the learning
quality
\cite
{
guei2021optimistic
}
.
quality
\cite
{
guei2021optimistic
}
.
The optimistic initialization is equivalent to
The optimistic initialization is equivalent to
a static potential function in reward shaping
a static potential function in reward shaping
\cite
{
wiewiora2003potential,devlin2012dynamic
}
.
\cite
{
wiewiora2003potential,devlin2012dynamic
}
.
Besides, neural network approximators are develop
p
ed
Besides, neural network approximators are developed
for training 2048 AI
for training 2048 AI
\cite
{
kondo2019playing,matsuzaki2020further,matsuzaki2021developing,bangole2023game
}
.
\cite
{
kondo2019playing,matsuzaki2020further,matsuzaki2021developing,bangole2023game
}
.
...
@@ -50,7 +50,7 @@ and they found that experiments with $\epsilon$-greedy exploration
...
@@ -50,7 +50,7 @@ and they found that experiments with $\epsilon$-greedy exploration
did not improve the performance
\cite
{
szubert2014temporal
}
.
did not improve the performance
\cite
{
szubert2014temporal
}
.
We argue that in 2048 AI training, it's not that
We argue that in 2048 AI training, it's not that
explo
a
ration is not needed, but rather that it cannot be explored with
exploration is not needed, but rather that it cannot be explored with
softmax or
$
\epsilon
$
-greedy strategies.
softmax or
$
\epsilon
$
-greedy strategies.
% \begin{figure}
% \begin{figure}
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment