Proximal Policy Optimization (PPO)
PPO is a model-free, stochastic on-policy policy gradient algorithm that alternates between sampling data through interaction with the environment, and optimizing a surrogate objective function while avoiding that the new policy does not move too far away from the old one
Paper: Proximal Policy Optimization Algorithms
Algorithm
Algorithm implementation
Learning algorithm (_update(...)
)
compute_gae(...)
Configuration and hyperparameters
- skrl.agents.torch.ppo.ppo.PPO_DEFAULT_CONFIG
1PPO_DEFAULT_CONFIG = {
2 "rollouts": 16, # number of rollouts before updating
3 "learning_epochs": 8, # number of learning epochs during each update
4 "mini_batches": 2, # number of mini batches during each learning epoch
5
6 "discount_factor": 0.99, # discount factor (gamma)
7 "lambda": 0.95, # TD(lambda) coefficient (lam) for computing returns and advantages
8
9 "learning_rate": 1e-3, # learning rate
10 "learning_rate_scheduler": None, # learning rate scheduler class (see torch.optim.lr_scheduler)
11 "learning_rate_scheduler_kwargs": {}, # learning rate scheduler's kwargs (e.g. {"step_size": 1e-3})
12
13 "state_preprocessor": None, # state preprocessor class (see skrl.resources.preprocessors)
14 "state_preprocessor_kwargs": {}, # state preprocessor's kwargs (e.g. {"size": env.observation_space})
15 "value_preprocessor": None, # value preprocessor class (see skrl.resources.preprocessors)
16 "value_preprocessor_kwargs": {}, # value preprocessor's kwargs (e.g. {"size": 1})
17
18 "random_timesteps": 0, # random exploration steps
19 "learning_starts": 0, # learning starts after this many steps
20
21 "grad_norm_clip": 0.5, # clipping coefficient for the norm of the gradients
22 "ratio_clip": 0.2, # clipping coefficient for computing the clipped surrogate objective
23 "value_clip": 0.2, # clipping coefficient for computing the value loss (if clip_predicted_values is True)
24 "clip_predicted_values": False, # clip predicted values during value loss computation
25
26 "entropy_loss_scale": 0.0, # entropy loss scaling factor
27 "value_loss_scale": 1.0, # value loss scaling factor
28
29 "kl_threshold": 0, # KL divergence threshold for early stopping
30
31 "rewards_shaper": None, # rewards shaping function: Callable(reward, timestep, timesteps) -> reward
32
33 "experiment": {
34 "directory": "", # experiment's parent directory
35 "experiment_name": "", # experiment name
36 "write_interval": 250, # TensorBoard writing interval (timesteps)
37
38 "checkpoint_interval": 1000, # interval for checkpoints (timesteps)
39 "store_separately": False, # whether to store checkpoints separately
40
41 "wandb": False, # whether to use Weights & Biases
42 "wandb_kwargs": {} # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
43 }
44}
Spaces and models
The implementation supports the following Gym spaces / Gymnasium spaces
Gym/Gymnasium spaces |
Observation |
Action |
---|---|---|
Discrete |
\(\square\) |
\(\blacksquare\) |
Box |
\(\blacksquare\) |
\(\blacksquare\) |
Dict |
\(\blacksquare\) |
\(\square\) |
The implementation uses 1 stochastic (discrete or continuous) and 1 deterministic function approximator. These function approximators (models) must be collected in a dictionary and passed to the constructor of the class under the argument models
Notation |
Concept |
Key |
Input shape |
Output shape |
Type |
---|---|---|---|---|---|
\(\pi_\theta(s)\) |
Policy |
|
observation |
action |
|
\(V_\phi(s)\) |
Value |
|
observation |
1 |
Support for advanced features is described in the next table
Feature |
Support and remarks |
---|---|
Shared model |
for Policy and Value |
RNN support |
RNN, LSTM, GRU and any other variant |