Advantage Actor Critic (A2C)
A2C (synchronous version of A3C) is a model-free, stochastic on-policy policy gradient algorithm
Paper: Asynchronous Methods for Deep Reinforcement Learning
Algorithm
Note
This algorithm implementation relies on the existence of parallel environments instead of parallel actor-learners
Algorithm implementation
Learning algorithm (_update(...)
)
compute_gae(...)
Configuration and hyperparameters
- skrl.agents.torch.a2c.a2c.A2C_DEFAULT_CONFIG
1
2A2C_DEFAULT_CONFIG = {
3 "rollouts": 16, # number of rollouts before updating
4 "mini_batches": 1, # number of mini batches to use for updating
5
6 "discount_factor": 0.99, # discount factor (gamma)
7 "lambda": 0.95, # TD(lambda) coefficient (lam) for computing returns and advantages
8
9 "learning_rate": 1e-3, # learning rate
10 "learning_rate_scheduler": None, # learning rate scheduler class (see torch.optim.lr_scheduler)
11 "learning_rate_scheduler_kwargs": {}, # learning rate scheduler's kwargs (e.g. {"step_size": 1e-3})
12
13 "state_preprocessor": None, # state preprocessor class (see skrl.resources.preprocessors)
14 "state_preprocessor_kwargs": {}, # state preprocessor's kwargs (e.g. {"size": env.observation_space})
15 "value_preprocessor": None, # value preprocessor class (see skrl.resources.preprocessors)
16 "value_preprocessor_kwargs": {}, # value preprocessor's kwargs (e.g. {"size": 1})
17
18 "random_timesteps": 0, # random exploration steps
19 "learning_starts": 0, # learning starts after this many steps
20
21 "grad_norm_clip": 0.5, # clipping coefficient for the norm of the gradients
22
23 "entropy_loss_scale": 0.0, # entropy loss scaling factor
24
25 "rewards_shaper": None, # rewards shaping function: Callable(reward, timestep, timesteps) -> reward
26
27 "experiment": {
28 "directory": "", # experiment's parent directory
29 "experiment_name": "", # experiment name
30 "write_interval": 250, # TensorBoard writing interval (timesteps)
31
32 "checkpoint_interval": 1000, # interval for checkpoints (timesteps)
33 "store_separately": False, # whether to store checkpoints separately
34
35 "wandb": False, # whether to use Weights & Biases
36 "wandb_kwargs": {} # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
37 }
Spaces and models
The implementation supports the following Gym spaces / Gymnasium spaces
Gym/Gymnasium spaces |
Observation |
Action |
---|---|---|
Discrete |
\(\square\) |
\(\blacksquare\) |
Box |
\(\blacksquare\) |
\(\blacksquare\) |
Dict |
\(\blacksquare\) |
\(\square\) |
The implementation uses 1 stochastic (discrete or continuous) and 1 deterministic function approximator. These function approximators (models) must be collected in a dictionary and passed to the constructor of the class under the argument models
Notation |
Concept |
Key |
Input shape |
Output shape |
Type |
---|---|---|---|---|---|
\(\pi_\theta(s)\) |
Policy |
|
observation |
action |
|
\(V_\phi(s)\) |
Value |
|
observation |
1 |
Support for advanced features is described in the next table
Feature |
Support and remarks |
---|---|
Shared model |
for Policy and Value |
RNN support |
RNN, LSTM, GRU and any other variant |