Info
深圳市-福田区-景田路-中国茶宫-6层
Markov Decision Process (MDP)
State Space Form, with Controller
Consider a space system with perturbation
where $w_t\sim P$.
The system $x_{t+1}=f(x_t,u_t,w_t)$ is equivalent to $x_{t+1}=D(x_t,u_t)$.
MDP
Assumption. $|S|,|A|\le \infty$
An infinite-horizon Markov Decision Process is a tuple $(S,A,r,\gamma,\mathbb P)$, where $S$ is the state space, $A$ is the action space, $r: A\times S\rightarrow [0,1]$, $\gamma\in(0,1)$ is the discount factor and $\mathbb P: S\times A\rightarrow \Delta(S)$ is the transition kernel(aka. model/dynamics).
State-Value function
$V:S\rightarrow [0,1]$, defined as
where $\pi = (\pi_1,\pi_2,\cdots)$
State-Action Value Function
$Q^\pi(s,a) \triangleq ?$
Value Iteration
$\{Q_1,Q_2,\cdots,Q_\infty\}$. Goal: $Q_\infty=Q^\star$, $Q^\star\triangleq Q^{\pi^\star}$, $Q^\star(s,a)\triangleq r(s,a) + \gamma\mathbb E_{s’\sim\mathbb P(s,a)}(V^{\pi^\star}(s’))$.
Bellman Optimality Operator
$\text T^\star: Q\rightarrow Q$ defined as
for all $(s,a)\in S\times A$.
Theorem. $Q_\infty$ is the fixed point of $\text T^\star$, i.e., $Q_\infty=\text T^\star Q_\infty$
Proof: $Q$ is a contraction mapping of $\text T^\star$’s Banach Space
Appendix
$r(s,a) = \int R(s,a)d D(s,a) = \mathbb E(R(s,a))$
$\Delta_n$: Probability simplex
$\Delta_2 = \{x\in \mathbb R^3 : \mathbb 1^T x=1,x\succeq 0\}$
Policy: $\pi: S\rightarrow \Delta(A)$
Deterministic policy $\pi:S\rightarrow A$
Nonstationary policy $\pi: S\times T\rightarrow \Delta(A)$
Optimal policy $\pi^\star=\mathop{\arg\max}\limits_{\pi} V^\pi(s_1)$
Banach space
- Complete: Any Cauchy Sequence converges
- Cauchy Sequence: $\{x_k\}_{k=1}^\infty,\ \forall \epsilon\ \exists N$ s.t. $k>N$, $|x_{k+1}-x_k|<\epsilon$
- Normed: Distance can be defined (Metric space?)
- Metric: two-point distance, norm: distance to $O$
- Vector space
Contraction mapping
$\text T: V\rightarrow V$
$\mathbb F||v||\le ||v||,\ F||v||\le\gamma||v||, \gamma\in(0,1)$
Theorem. Banach Space Fixed Point Theorem
If $\text T$ is a contraction mapping on $\mathscr B$
$\forall x\in \mathscr B, \text T^\infty x=t_{fix}$
where $\text T^\infty t_{fix}=t_{fix}$
Assignment
- Prove that the system $x_{t+1}=f(x_t,u_t,w_t)$ is equivalent to $x_{t+1}=D(x_t,u_t)$.
- Prove that an infinite horizon MDP with a stationary policy $\pi$ is a Markov process. (hint: prove $s’\sim\mathbb P(s,\pi(s))$ is Markovian)
- Prove that for an infinite horizon MDP with discount factor $\gamma\in (0,1)$, the optimal policy is stationary and deterministic, i.e., $\pi^\star = (\pi^\star,\pi^\star,\cdots,\pi^\star)$ or $\pi^\star:S\rightarrow \Delta(A)$.
- State and prove the Banach Space Fixed Point Theorem.
- Prove that $Q_\infty=\text T^\star Q_\infty$ using Banach Space Fixed Point Theorem. (Hint: Prove that the $\text T^\star$ in )
Epilogue
Unfortunately, the seminar was not continued and Yansong Li returned to his university abroad.