Proximal Policy Optimization (PPO)

PPO is a model-free, stochastic on-policy policy gradient algorithm that alternates between sampling data through interaction with the environment, and optimizing a surrogate objective function while avoiding that the new policy does not move too far away from the old one

Paper: Proximal Policy Optimization Algorithms

Algorithm

For each iteration do:
\(\bullet \;\) Collect, in a rollout memory, a set of states \(s\), actions \(a\), rewards \(r\), dones \(d\), log probabilities \(logp\) and values \(V\) on policy using \(\pi_\theta\) and \(V_\phi\)
\(\bullet \;\) Estimate returns \(R\) and advantages \(A\) using Generalized Advantage Estimation (GAE(\(\lambda\))) from the collected data [\(r, d, V\)]
\(\bullet \;\) Compute the entropy loss \({L}_{entropy}\)
\(\bullet \;\) Compute the clipped surrogate objective (policy loss) with \(ratio\) as the probability ratio between the action under the current policy and the action under the previous policy: \(L^{clip}_{\pi_\theta} = \mathbb{E}[\min(A \; ratio, A \; \text{clip}(ratio, 1-c, 1+c))]\)
\(\bullet \;\) Compute the value loss \(L_{V_\phi}\) as the mean squared error (MSE) between the predicted values \(V_{_{predicted}}\) and the estimated returns \(R\)
\(\bullet \;\) Optimize the total loss \(L = L^{clip}_{\pi_\theta} - c_1 \, L_{V_\phi} + c_2 \, {L}_{entropy}\)

Algorithm implementation

Main notation/symbols:
- policy function approximator (\(\pi_\theta\)), value function approximator (\(V_\phi\))
- states (\(s\)), actions (\(a\)), rewards (\(r\)), next states (\(s'\)), dones (\(d\))
- values (\(V\)), advantages (\(A\)), returns (\(R\))
- log probabilities (\(logp\))
- loss (\(L\))

Learning algorithm (_update(...))

compute_gae(...)
def \(\;f_{GAE} (r, d, V, V_{_{last}}') \;\rightarrow\; R, A:\)
\(adv \leftarrow 0\)
\(A \leftarrow \text{zeros}(r)\)
# advantages computation
FOR each reverse iteration \(i\) up to the number of rows in \(r\) DO
IF \(i\) is not the last row of \(r\) THEN
\(V_i' = V_{i+1}\)
ELSE
\(V_i' \leftarrow V_{_{last}}'\)
\(adv \leftarrow r_i - V_i \, +\) discount_factor \(\neg d_i \; (V_i' \, -\) lambda \(adv)\)
\(A_i \leftarrow adv\)
# returns computation
\(R \leftarrow A + V\)
# normalize advantages
\(A \leftarrow \dfrac{A - \bar{A}}{A_\sigma + 10^{-8}}\)
# compute returns and advantages
\(V_{_{last}}' \leftarrow V_\phi(s')\)
\(R, A \leftarrow f_{GAE}(r, d, V, V_{_{last}}')\)
# sample mini-batches from memory
[[\(s, a, logp, V, R, A\)]] \(\leftarrow\) states, actions, log_prob, values, returns, advantages
# learning epochs
FOR each learning epoch up to learning_epochs DO
# mini-batches loop
FOR each mini-batch [\(s, a, logp, V, R, A\)] up to mini_batches DO
\(logp' \leftarrow \pi_\theta(s, a)\)
# compute aproximate KL divergence
\(ratio \leftarrow logp' - logp\)
\(KL_{_{divergence}} \leftarrow \frac{1}{N} \sum_{i=1}^N ((e^{ratio} - 1) - ratio)\)
# early stopping with KL divergence
IF \(KL_{_{divergence}} >\) kl_threshold THEN
BREAK LOOP
# compute entropy loss
IF entropy computation is enabled THEN
\({L}_{entropy} \leftarrow \, -\) entropy_loss_scale \(\frac{1}{N} \sum_{i=1}^N \pi_{\theta_{entropy}}\)
ELSE
\({L}_{entropy} \leftarrow 0\)
# compute policy loss
\(ratio \leftarrow e^{logp' - logp}\)
\(L_{_{surrogate}} \leftarrow A \; ratio\)
\(L_{_{clipped\,surrogate}} \leftarrow A \; \text{clip}(ratio, 1 - c, 1 + c) \qquad\) with \(c\) as ratio_clip
\(L^{clip}_{\pi_\theta} \leftarrow - \frac{1}{N} \sum_{i=1}^N \min(L_{_{surrogate}}, L_{_{clipped\,surrogate}})\)
# compute value loss
\(V_{_{predicted}} \leftarrow V_\phi(s)\)
IF clip_predicted_values is enabled THEN
\(V_{_{predicted}} \leftarrow V + \text{clip}(V_{_{predicted}} - V, -c, c) \qquad\) with \(c\) as value_clip
\(L_{V_\phi} \leftarrow\) value_loss_scale \(\frac{1}{N} \sum_{i=1}^N (R - V_{_{predicted}})^2\)
# optimization step
reset \(\text{optimizer}_{\theta, \phi}\)
\(\nabla_{\theta, \, \phi} (L^{clip}_{\pi_\theta} + {L}_{entropy} + L_{V_\phi})\)
\(\text{clip}(\lVert \nabla_{\theta, \, \phi} \rVert)\) with grad_norm_clip
step \(\text{optimizer}_{\theta, \phi}\)
# update learning rate
IF there is a learning_rate_scheduler THEN
step \(\text{scheduler}_{\theta, \phi} (\text{optimizer}_{\theta, \phi})\)

Configuration and hyperparameters

skrl.agents.torch.ppo.ppo.PPO_DEFAULT_CONFIG
 1PPO_DEFAULT_CONFIG = {
 2    "rollouts": 16,                 # number of rollouts before updating
 3    "learning_epochs": 8,           # number of learning epochs during each update
 4    "mini_batches": 2,              # number of mini batches during each learning epoch
 5
 6    "discount_factor": 0.99,        # discount factor (gamma)
 7    "lambda": 0.95,                 # TD(lambda) coefficient (lam) for computing returns and advantages
 8
 9    "learning_rate": 1e-3,                  # learning rate
10    "learning_rate_scheduler": None,        # learning rate scheduler class (see torch.optim.lr_scheduler)
11    "learning_rate_scheduler_kwargs": {},   # learning rate scheduler's kwargs (e.g. {"step_size": 1e-3})
12
13    "state_preprocessor": None,             # state preprocessor class (see skrl.resources.preprocessors)
14    "state_preprocessor_kwargs": {},        # state preprocessor's kwargs (e.g. {"size": env.observation_space})
15    "value_preprocessor": None,             # value preprocessor class (see skrl.resources.preprocessors)
16    "value_preprocessor_kwargs": {},        # value preprocessor's kwargs (e.g. {"size": 1})
17
18    "random_timesteps": 0,          # random exploration steps
19    "learning_starts": 0,           # learning starts after this many steps
20
21    "grad_norm_clip": 0.5,              # clipping coefficient for the norm of the gradients
22    "ratio_clip": 0.2,                  # clipping coefficient for computing the clipped surrogate objective
23    "value_clip": 0.2,                  # clipping coefficient for computing the value loss (if clip_predicted_values is True)
24    "clip_predicted_values": False,     # clip predicted values during value loss computation
25
26    "entropy_loss_scale": 0.0,      # entropy loss scaling factor
27    "value_loss_scale": 1.0,        # value loss scaling factor
28
29    "kl_threshold": 0,              # KL divergence threshold for early stopping
30
31    "rewards_shaper": None,         # rewards shaping function: Callable(reward, timestep, timesteps) -> reward
32
33    "experiment": {
34        "directory": "",            # experiment's parent directory
35        "experiment_name": "",      # experiment name
36        "write_interval": 250,      # TensorBoard writing interval (timesteps)
37
38        "checkpoint_interval": 1000,        # interval for checkpoints (timesteps)
39        "store_separately": False,          # whether to store checkpoints separately
40
41        "wandb": False,             # whether to use Weights & Biases
42        "wandb_kwargs": {}          # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
43    }
44}

Spaces and models

The implementation supports the following Gym spaces / Gymnasium spaces

Gym/Gymnasium spaces

Observation

Action

Discrete

\(\square\)

\(\blacksquare\)

Box

\(\blacksquare\)

\(\blacksquare\)

Dict

\(\blacksquare\)

\(\square\)

The implementation uses 1 stochastic (discrete or continuous) and 1 deterministic function approximator. These function approximators (models) must be collected in a dictionary and passed to the constructor of the class under the argument models

Notation

Concept

Key

Input shape

Output shape

Type

\(\pi_\theta(s)\)

Policy

"policy"

observation

action

Categorical / Gaussian / MultivariateGaussian

\(V_\phi(s)\)

Value

"value"

observation

1

Deterministic

Support for advanced features is described in the next table

Feature

Support and remarks

Shared model

for Policy and Value

RNN support

RNN, LSTM, GRU and any other variant

API