Trust Region Policy Optimization (TRPO)
TRPO is a model-free, stochastic on-policy policy gradient algorithm that deploys an iterative procedure to optimize the policy, with guaranteed monotonic improvement
Paper: Trust Region Policy Optimization
Algorithm
Algorithm implementation
Learning algorithm (_update(...)
)
compute_gae(...)
surrogate_loss(...)
conjugate_gradient(...)
(See conjugate gradient method)fisher_vector_product(...)
(See fisher vector product in TRPO)kl_divergence(...)
(See Kullback–Leibler divergence for normal distribution)Configuration and hyperparameters
- skrl.agents.torch.trpo.trpo.TRPO_DEFAULT_CONFIG
1TRPO_DEFAULT_CONFIG = {
2 "rollouts": 16, # number of rollouts before updating
3 "learning_epochs": 8, # number of learning epochs during each update
4 "mini_batches": 2, # number of mini batches during each learning epoch
5
6 "discount_factor": 0.99, # discount factor (gamma)
7 "lambda": 0.95, # TD(lambda) coefficient (lam) for computing returns and advantages
8
9 "value_learning_rate": 1e-3, # value learning rate
10 "learning_rate_scheduler": None, # learning rate scheduler class (see torch.optim.lr_scheduler)
11 "learning_rate_scheduler_kwargs": {}, # learning rate scheduler's kwargs (e.g. {"step_size": 1e-3})
12
13 "state_preprocessor": None, # state preprocessor class (see skrl.resources.preprocessors)
14 "state_preprocessor_kwargs": {}, # state preprocessor's kwargs (e.g. {"size": env.observation_space})
15 "value_preprocessor": None, # value preprocessor class (see skrl.resources.preprocessors)
16 "value_preprocessor_kwargs": {}, # value preprocessor's kwargs (e.g. {"size": 1})
17
18 "random_timesteps": 0, # random exploration steps
19 "learning_starts": 0, # learning starts after this many steps
20
21 "grad_norm_clip": 0.5, # clipping coefficient for the norm of the gradients
22 "value_loss_scale": 1.0, # value loss scaling factor
23
24 "damping": 0.1, # damping coefficient for computing the Hessian-vector product
25 "max_kl_divergence": 0.01, # maximum KL divergence between old and new policy
26 "conjugate_gradient_steps": 10, # maximum number of iterations for the conjugate gradient algorithm
27 "max_backtrack_steps": 10, # maximum number of backtracking steps during line search
28 "accept_ratio": 0.5, # accept ratio for the line search loss improvement
29 "step_fraction": 1.0, # fraction of the step size for the line search
30
31 "rewards_shaper": None, # rewards shaping function: Callable(reward, timestep, timesteps) -> reward
32
33 "experiment": {
34 "directory": "", # experiment's parent directory
35 "experiment_name": "", # experiment name
36 "write_interval": 250, # TensorBoard writing interval (timesteps)
37
38 "checkpoint_interval": 1000, # interval for checkpoints (timesteps)
39 "store_separately": False, # whether to store checkpoints separately
40
41 "wandb": False, # whether to use Weights & Biases
42 "wandb_kwargs": {} # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
43 }
44}
Spaces and models
The implementation supports the following Gym spaces / Gymnasium spaces
Gym/Gymnasium spaces |
Observation |
Action |
---|---|---|
Discrete |
\(\square\) |
\(\square\) |
Box |
\(\blacksquare\) |
\(\blacksquare\) |
Dict |
\(\blacksquare\) |
\(\square\) |
The implementation uses 1 stochastic and 1 deterministic function approximator. These function approximators (models) must be collected in a dictionary and passed to the constructor of the class under the argument models
Notation |
Concept |
Key |
Input shape |
Output shape |
Type |
---|---|---|---|---|---|
\(\pi_\theta(s)\) |
Policy |
|
observation |
action |
|
\(V_\phi(s)\) |
Value |
|
observation |
1 |
Support for advanced features is described in the next table
Feature |
Support and remarks |
---|---|
Shared model |
- |
RNN support |
RNN, LSTM, GRU and any other variant |