\usepackage{amsmath,amssymb}
\usepackage{pifont}

\usepackage{bigints}
\usepackage{bm}
\usepackage{siunitx}

\newcommand{\hs}[1]{\hspace{#1zw}}
\newcommand{\vs}[1]{\vspace{#1zh}}
\newcommand{\hsh}{\hs{0.5}}
\newcommand{\vsh}{\vs{0.5}}

\newcommand{\eref}[1]{\text{式\eqref{eq:#1}}}

\newcommand{\Nset}{\mathbf{N}}
\newcommand{\Zset}{\mathbf{Z}}
\newcommand{\Qset}{\mathbf{Q}}
\newcommand{\Rset}{\mathbf{R}}
\newcommand{\Cset}{\mathbf{C}}

\DeclareMathOperator*{\argmax}{\mathrm{arg\,max}}
\DeclareMathOperator*{\argmin}{\mathrm{arg\,min}}

\mathchardef\ordinarycolon\mathcode`\:
\mathcode`\:=\string"8000
\begingroup \catcode`\:=\active
  \gdef:{\mathrel{\mathop\ordinarycolon}}
\endgroup

\newcommand{\cond}[2]{#1\,|\,#2}

\newcommand{\expct}{\mathbb{E}}
\newcommand{\expctpi}{\mathbb{E}_{\pi}}

\newcommand{\Scal}{\mathcal{S}}
\newcommand{\Scalp}{\mathcal{S}^+}
\newcommand{\Acal}{\mathcal{A}}
\newcommand{\Rcal}{\mathcal{R}}

\newcommand{\sums}{\sum_s}
\newcommand{\sumsp}{\sum_{s'}}
\newcommand{\suma}{\sum_a}
\newcommand{\sumap}{\sum_{a'}}
\newcommand{\sumr}{\sum_r}

\newcommand{\pias}{\pi(\cond{a}s)}
\newcommand{\piapsp}{\pi(\cond{a'}{s'})}
\newcommand{\pitheta}{\pi_{\bm\theta}}
\newcommand{\pithetaas}{\pi_{\bm\theta}(\cond{a}s)}
\newcommand{\Ppi}{P_{\pi}}
\newcommand{\Ppin}{P_{\pi}^{~n}}
\newcommand{\Pcond}[2]{P(\cond{#1}{#2})}
\newcommand{\Ppicond}[2]{\Ppi(\cond{#1}{#2})}
\newcommand{\Ppincond}[2]{\Ppin(\cond{#1}{#2})}
\newcommand{\Psrsa}{\Pcond{s', r}{s, a}}
\newcommand{\Pssa}{\Pcond{s'}{s, a}}
\newcommand{\Prsa}{\Pcond{r}{s, a}}
\newcommand{\Ppisrs}{\Ppicond{s', r}s}
\newcommand{\Ppiss}{\Ppicond{s'}s}
\newcommand{\Ppirs}{\Ppicond{r}s}
\newcommand{\Ppinss}{\Ppincond{s'}s}
\newcommand{\Dpigamma}{D_{\pi}^{\,\gamma}}

\newcommand{\follow}{\sim}
\newcommand{\followpis}{\follow\pi(\cond{\cdot}s)}
\newcommand{\followpisp}{\follow\pi(\cond{\cdot}s')}
\newcommand{\twofollowPsa}{\follow\Pcond{\cdot, \cdot}{s, a}}
\newcommand{\onefollowPsa}{\follow\Pcond{\cdot}{s, a}}
\newcommand{\twofollowPpis}{\follow\Ppicond{\cdot, \cdot}s}
\newcommand{\onefollowPpis}{\follow\Ppicond{\cdot}s}

\newcommand{\Rsa}{R(s, a)}
\newcommand{\Rpis}{R_{\pi}(s)}

\newcommand{\prd}[1]{\bar{#1}}
\newcommand{\prds}{\prd{s}}
\newcommand{\prda}{\prd{a}}

\newcommand{\Vpi}{V_{\pi}}
\newcommand{\Vpis}{\Vpi(s)}
\newcommand{\Vpisp}{\Vpi(s')}
\newcommand{\Qpi}{Q_{\pi}}
\newcommand{\Qpisa}{\Qpi(s, a)}
\newcommand{\Qpispap}{\Qpi(s', a')}

\newcommand{\Vpihat}{\hat{V}_{\pi}}
\newcommand{\Vpihats}{\hat{V}_{\pi}(s)}
\newcommand{\Qpihat}{\hat{Q}_{\pi}}
\newcommand{\Qpihatsa}{\hat{Q}_{\pi}(s, a)}

\newcommand{\QED}{\blacksquare}

前回の続きです。

前回は5つの式を扱いました（下に再掲）。このうち $\eref{C} 以外は、サンプリングした変数が含まれていて、方程式としてはかなり怪しい感じになっています。というか、そのままでは普通にダメです。確率的な変数が式に入っているのにイコールが成り立つ方がおかしいですからね。（前回は、イメージを掴みやすくするために、なるべく簡単な形の式にしたかった、という目的があり、敢えて不正確な形で終わらせました。）

というわけで今回は、サンプリング処理を取り除いた、完全な形の式を紹介します。

C_t &=r_{t+1}+\gamma C_{t+1} \label{eq:C} \\
\Vpis &=\Rpis+\gamma\Vpisp & \text{（$s'\onefollowPpis$）} \label{eq:V} \\
\Vpis &=\Qpisa & \text{（$a\followpis$）} \label{eq:VQ} \\
\Qpisa &=\Rsa+\gamma\Vpisp & \text{（$s'\onefollowPsa$）} \label{eq:QV} \\
\Qpisa &=\Rsa+\gamma\Qpispap & \text{（$s'\onefollowPsa$、$a'\followpisp$）} \label{eq:Q}

方針

前述した通り、問題なのは確率変数が（そのまま）方程式に含まれていることです。例えば $\eref{V} で言えば、$s' をサンプリングしているため、$\Vpisp そのものが確率変数になっています。そこで、$\Vpisp をそのまま使うのではなく、$\Vpisp の期待値 $\expct\bigl[\Vpisp\bigr] を用いた式にすることで、サンプリングという処理を包括した、正確な式にすることができます。

その後は、期待値の定義に則って、具体的な確率と総和を用いた形式に変形するだけです。

この方針で、記事の最初に示した式からスタートして、4つの式を書き換えていきます。

状態価値に対するBellman方程式

\Vpis &=\Rpis+\gamma\Vpisp\hs1\text{（$s'\onefollowPpis$）}  \notag \\
         &=\Rpis+\gamma\expct\bigl[\cond{\Vpisp}{s'\onefollowPpis}\bigr] \notag \\
         &=\Rpis+\gamma\suma\pias\sumsp\Pssa\Vpisp

状態価値と行動価値の関係式その1

\Vpis &=\Qpisa\hs1\text{（$a\followpis$）} \notag \\
         &=\expct\bigl[\cond{\Qpisa}{a\followpis}\bigr] \notag \\
         &=\suma\pias\Qpisa

状態価値と行動価値の関係式その2

\Qpisa &=\Rsa+\gamma\Vpisp\hs1\text{（$s'\onefollowPsa$）} \notag \\
            &=\Rsa+\gamma\expct\bigl[\cond{\Vpisp}{s'\onefollowPsa}\bigr] \notag \\
            &=\Rsa+\gamma\sumsp\Pssa\Vpisp

行動価値に対するBellman方程式

\Qpisa &=\Rsa+\gamma\Qpispap\hs1\text{（$s'\onefollowPsa$、$a'\followpisp$）} \notag \\
            &=\Rsa+\gamma\expct\bigl[\cond{\Qpispap}{s'\onefollowPsa, a'\followpisp}\bigr] \notag \\
            &=\Rsa+\gamma\sumsp\Pssa\sumap\piapsp\Qpispap

まとめ

今回は正確な形の、ちゃんとした「方程式」が得られました。得られた4つの式を下にまとめて再掲します。

\Vpis &=\Rpis+\gamma\suma\pias\sumsp\Pssa\Vpisp \\
\Vpis &=\suma\pias\Qpisa \\
\Qpisa &=\Rsa+\gamma\sumsp\Pssa\Vpisp \\
\Qpisa &=\Rsa+\gamma\sumsp\Pssa\sumap\piapsp\Qpispap

次回は方策勾配定理を扱います。

読むとGPAが上がるブログ(仮)

GPA芸人が気の赴くままに何かを書くブログ

数学を厭わない強化学習（その2：Bellman方程式など（続））

方針

状態価値に対するBellman方程式

状態価値と行動価値の関係式その1

状態価値と行動価値の関係式その2

行動価値に対するBellman方程式

まとめ