From: A reinforcement learning approach for protein–ligand binding pose prediction
Notations | Meaning |
---|---|
\(s \in {\mathcal{S}}\) | State |
\(a \in {\mathcal{A}}\) | Actions |
\(r \in {\mathcal{R}}\) | Immediate reward |
\(\gamma\) | Discount factor |
\(G_{t}\) | The long-term reward: \(G_{t} = \mathop \sum \limits_{k = 0}^{\infty } \gamma^{k} R_{t + k + 1}\) |
\(\pi_{\theta } (a|s)\) | Actor model with parameters \(\theta ;\) it is a distribution of action given the state |
\(V_{\omega }^{\pi } \left( s \right)\) | Critic model with parameters \(\omega ;\) it depends on the policy model and can output score |