需要昕闻补充材料

691fa138 · Lenovo · 7f6fb8c1 · 691fa138 · 691fa138 · 691fa138
Commit 691fa138 authored May 16, 2024 by Lenovo
Hide whitespace changes
Inline Side-by-side

Showing with 84 additions and 1 deletions

document.tex
+2 -1

main/background.tex
+45 -0

main/theorem.tex
+37 -0

No files found.
--- a/document.tex
+++ b/document.tex
@@ -69,10 +69,11 @@ wangwenhao11@nudt.edu.cn).
 \end{IEEEkeywords}
+\input{main/background}
 %\input{main/introduction}
 %\input{main/nonergodicity}
 %\input{main/paradox}
-\input{main/theorem}
+%\input{main/theorem}

--- a/main/background.tex
+++ b/main/background.tex
+\section{Background}
+Consider Markov decision process (MDP)
+$\langle \mathcal{S}$, $\mathcal{A}$, $\mathcal{R}$, $\mathcal{T}$$\rangle$, where 
+$\mathcal{S}=\{1,2,3,\ldots\}$ is a finite state space, $|\mathcal{S}|=n$, $\mathcal{A}$ is an action space,
+$\mathcal{T}:\mathcal{S}\times \mathcal{A}\times \mathcal{S}\rightarrow [0,1]$ 
+is a transition function,
+$\mathcal{R}:\mathcal{S}\times \mathcal{A}\times \mathcal{S}\rightarrow \mathbb{R}$ is a reward function.
+Policy $\pi:S\times A\rightarrow [0,1]$ 
+selects an action $a$ in state $s$ 
+with probability $\pi(a|s)$.
+State value function under policy $\pi$, denoted $V^{\pi}:S\rightarrow
+\mathbb{R}$, represents the expected sum of rewards in
+the MDP under policy $\pi$:
+$V^{\pi}(s)=\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}r_t|s_0=s\right]$.
+Given a steady policy $\pi$, MDP becomes a Markov chain on state space
+$\mathcal{S}$ with a   matrix
+$P^{\pi}\in[0,1]^{n\times n}$, where
+$P^{\pi}(s_1,s_2)=\sum_{a\in \mathcal{A}}\pi(a|s_1)\mathcal{T}(s_1,a,s_2)$
+is the transition probobility from $s_1$ to $s_2$, 
+$\forall s\in \mathcal{S}$,   $\sum_{s'\in \mathcal{S}}P^{\pi}(s,s')=1$. 
+A stationary  measure for $P$ is a distribution measure
+$d$ on $\mathcal{S}$ such that
+\begin{equation}
+d^{\top}=d^{\top}P^{\pi}.
+\end{equation}
+That is $\forall s\in \mathcal{S}$, we have
+\begin{equation}
+\sum_{s'\in \mathcal{S}}P^{\pi}(s',s)d(s')=d(s).
+\end{equation}
+给出Markov Chain的遍历性定义，和充分条件。
+根据随机游走例子说明 带有Absorbing state的是不满足遍历性的，
+带有重启的强化学习训练设定是满足遍历性的。
+本文关注的是去除吸收态时，非吸收态之间的遍历性。
+通过圣彼得堡例子说明，圣彼得堡不满足非吸收态之间的遍历性。
+给出定理，同样证明2048游戏不满足非吸收态之间的遍历性。
--- a/main/theorem.tex
+++ b/main/theorem.tex
@@ -48,4 +48,41 @@ $\{X_n\}$ is not ergodic:
 一个矩阵，需要把矩阵元素明确定义出来，然后基于两个定理，明确推导出两个公式是否满足}
+\section{2024年5月4日晚10:49与李昕闻讨论}
+遍历性指的是任意状态两两之间都可以达，即两两之间的若干次转移概率大于0，并且具有稳定的分布。
+具有吸收态的马尔科夫链是不满足遍历性的，因为它的稳定分布最终吸收态是1，非吸收态是0.
+我们的强化学习例子，包括迷宫、随机游走、2048等等，都包含吸收态，所以都不满足遍历性。
+但是并不影响强化学习，因为我们都有游戏结束后的restart设置。所以它从吸收态又重新开始了。
+因此，满足遍历性。
+但是，从需求角度出发，我们真正想看到的是 除去吸收态，那些非吸收态相互之间是否能走通，
+即去除吸收态，剩下的状态是否具有遍历性。因为，显然迷宫、随机游走是有遍历性的。
+圣彼得堡悖论、2048是没有遍历性的。
+根据遍历性定义，
+P可以分解为Q R I 0，那么$N=(I-Q)^{-1}$，即描述了非吸收态之间的遍历关系，
+但凡有一个是0，就说明这两两之间不可达。只要都大于0，就是可达的。
+\textcolor{red}{如果这事可行，那么请李昕闻仔细对比概念，是否叫拟遍历性，还是有其它的概念？
+一定要分得清！}
+这样的话，就可以计算随机游走、圣彼得堡悖论的N矩阵，看它们是否具有遍历性？
+按照设想，随机游走应该每个值都大于0，而圣彼得堡悖论应该是上三角矩阵，甚至对角线都是0.
+基于这样的观察，2048，如何证明具有``非遍历性''？
+是否定义i，j，以及ij的转移概率即可？用构造性证明方法
+最终也是上三角，并且对角线为0？
+这样的话，相当于我们提出了一种满足非遍历性的充分条件吧？
+似乎论文可以从这方面下手！