Model-based RL
Professur für Künstliche Intelligenz - Fakultät für Informatik
p(s_{t+1} | s_t, a_t) \; \; r(s_t, a_t, s_{t+1})
We just sample transitions (s, a, r, s') and update Q-values or a policy network.
The main advantage is that the agent does not need to “think” when acting: just select the action with highest Q-value (reflexive behavior).
The other advantage is that you can use MF methods on any MDP: you do not need to know anything about them.
Learning the world model is not complicated in theory.
We just need to collect enough transitions s_t, a_t, s_{t+1}, r_{t+1} using a random agent (or during learning) and train a supervised model to predict the next state and reward.
Such a model is called the dynamics model, the transition model or the forward model.
The model can be deterministic (use neural networks) or stochastic (use Gaussian Processes).
Given an initial state s_0 and a policy \pi, you can unroll the future using the local model.
\tau = (s_o, a_o, r_ 1, s_1, a_1, \ldots, s_T)
\mathcal{J}(\theta) = \mathbb{E}_{\tau}[R(\tau)]
The only sample complexity is the one needed to train the model: the rest is emulated.
Drawback: This can only work when the model is close to perfect, especially for long trajectories or probabilistic MDPs.
Kurutach et al. (2018) Model-Ensemble Trust-Region Policy Optimization. arXiv:1802.10592.
The emulated trajectory will have a biased return, the algorithm does not converge to the optimal policy.
If you have a perfect model, you should not be using RL anyway, as classical control methods would be much faster (but see AlphaGo).
Collect transitions (s, a, r, s') using a (random/expert) policy b and create an initial dataset \mathcal{D} = \{(s_k, a_k, r_, s'_k\}_{k}.
while not converged:
(Re)Train the dynamics model M(s, a) = (s', r) on \mathcal{D} using supervised learning.
foreach step t in the trajectory:
a_{t}, a_{t+1}, \ldots, a_T
Take the first action a_t, observe the next state s_{t+1}.
Append the transition (s_t, a_t, s_{t+1}) to the dataset.
Nagabandi et al. (2017). Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning. arXiv:1708.02596.
The planner can actually be anything, it does not have to be a RL algorithm.
For example, it can be iLQR (Iterative Linear Quadratic Regulator), a non-linear optimization method.
https://jonathan-hui.medium.com/rl-lqr-ilqr-linear-quadratic-regulator-a5de5104c750.
Alternatively, one can use random-sampling shooting:
in the current state, select a set of possible actions.
generate rollouts with these actions and compute their returns using the model.
select the action whose rollout has the highest return.
Nagabandi et al. (2017) Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning. arXiv:1708.02596.
The main advantage of MPC is that you can change the reward function (the goal) on the fly: what you learn is the model, but planning is just an optimization procedure.
You can set intermediary goals to the agent very flexibly: no need for a well-defined reward function.
Model imperfection is not a problem as you replan all the time. The model can adapt to changes in the environment (slippery terrain, simulation to real-world).
Nagabandi et al. (2017) Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning. arXiv:1708.02596.
Another approach to MB RL is to augment MF methods with MB rollouts.
The MF algorithm (e.g. Q-learning) learns from transitions (s, a, r, s') sampled either with:
real experience: interaction with the environment.
simulated experience: simulation by the model.
If the simulated transitions are good enough, the MF algorithm can converge using much less real transitions, thereby reducing its sample complexity.
The Dyna-Q algorithm is an extension of Q-learning to integrate a model M(s, a) = (s', r').
The model can be tabular or approximated with a NN.
Sutton (1990) Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. Machine Learning Proceedings.
Initialize values Q(s, a) and model M(s, a).
for t \in [0, T_\text{total}]:
Select a_t using Q, take it on the real environment and observe s_{t+1} and r_{t+1}.
Update the Q-value of the real action:
\Delta Q(s_t, a_t) = \alpha \, (r_{t+1} + \gamma \, \max_a Q(s_{t+1}, a) - Q(s_t, a_t))
M(s_t, a_t) \leftarrow (s_{t+1}, r_{t+1})
for K steps:
Sample a state s_k from a list of visited states.
Select a_k using Q, predict s_{k+1} and r_{k+1} using the model M(s_k, a_k).
Update the Q-value of the imagined action:
\Delta Q(s_k, a_k) = \alpha \, (r_{k+1} + \gamma \, \max_a Q(s_{k+1}, a) - Q(s_k, a_k))
It is interesting to notice that Dyna-Q is very similar to DQN and its experience replay memory.
In DQN, the ERM stores real transitions generated in the past.
In Dyna-Q, the model generates imagined transitions based on past real transitions.