Advantage Actor Critic (A2C)

A2C (synchronous version of A3C) is a model-free, stochastic on-policy policy gradient algorithm

Paper: Asynchronous Methods for Deep Reinforcement Learning

Algorithm

Note

This algorithm implementation relies on the existence of parallel environments instead of parallel actor-learners

Algorithm implementation

Main notation/symbols:
- policy function approximator (\(\pi_\theta\)), value function approximator (\(V_\phi\))
- states (\(s\)), actions (\(a\)), rewards (\(r\)), next states (\(s'\)), dones (\(d\))
- values (\(V\)), advantages (\(A\)), returns (\(R\))
- log probabilities (\(logp\))
- loss (\(L\))

Learning algorithm (_update(...))

compute_gae(...)
def \(\;f_{GAE} (r, d, V, V_{_{last}}') \;\rightarrow\; R, A:\)
\(adv \leftarrow 0\)
\(A \leftarrow \text{zeros}(r)\)
# advantages computation
FOR each reverse iteration \(i\) up to the number of rows in \(r\) DO
IF \(i\) is not the last row of \(r\) THEN
\(V_i' = V_{i+1}\)
ELSE
\(V_i' \leftarrow V_{_{last}}'\)
\(adv \leftarrow r_i - V_i \, +\) discount_factor \(\neg d_i \; (V_i' \, -\) lambda \(adv)\)
\(A_i \leftarrow adv\)
# returns computation
\(R \leftarrow A + V\)
# normalize advantages
\(A \leftarrow \dfrac{A - \bar{A}}{A_\sigma + 10^{-8}}\)
# compute returns and advantages
\(V_{_{last}}' \leftarrow V_\phi(s')\)
\(R, A \leftarrow f_{GAE}(r, d, V, V_{_{last}}')\)
# sample mini-batches from memory
[[\(s, a, logp, V, R, A\)]] \(\leftarrow\) states, actions, log_prob, values, returns, advantages
# mini-batches loop
FOR each mini-batch [\(s, a, logp, V, R, A\)] up to mini_batches DO
\(logp' \leftarrow \pi_\theta(s, a)\)
# compute entropy loss
IF entropy computation is enabled THEN
\({L}_{entropy} \leftarrow \, -\) entropy_loss_scale \(\frac{1}{N} \sum_{i=1}^N \pi_{\theta_{entropy}}\)
ELSE
\({L}_{entropy} \leftarrow 0\)
# compute policy loss
\(L_{\pi_\theta} \leftarrow -\frac{1}{N} \sum_{i=1}^N A \; ratio\)
# compute value loss
\(V_{_{predicted}} \leftarrow V_\phi(s)\)
\(L_{V_\phi} \leftarrow \frac{1}{N} \sum_{i=1}^N (R - V_{_{predicted}})^2\)
# optimization step
reset \(\text{optimizer}_{\theta, \phi}\)
\(\nabla_{\theta, \, \phi} (L_{\pi_\theta} + {L}_{entropy} + L_{V_\phi})\)
\(\text{clip}(\lVert \nabla_{\theta, \, \phi} \rVert)\) with grad_norm_clip
step \(\text{optimizer}_{\theta, \phi}\)
# update learning rate
IF there is a learning_rate_scheduler THEN
step \(\text{scheduler}_{\theta} (\text{optimizer}_{\theta})\)
step \(\text{scheduler}_{\phi} (\text{optimizer}_{\phi})\)

Configuration and hyperparameters

skrl.agents.torch.a2c.a2c.A2C_DEFAULT_CONFIG
 1
 2A2C_DEFAULT_CONFIG = {
 3    "rollouts": 16,                 # number of rollouts before updating
 4    "mini_batches": 1,              # number of mini batches to use for updating
 5
 6    "discount_factor": 0.99,        # discount factor (gamma)
 7    "lambda": 0.95,                 # TD(lambda) coefficient (lam) for computing returns and advantages
 8
 9    "learning_rate": 1e-3,          # learning rate
10    "learning_rate_scheduler": None,        # learning rate scheduler class (see torch.optim.lr_scheduler)
11    "learning_rate_scheduler_kwargs": {},   # learning rate scheduler's kwargs (e.g. {"step_size": 1e-3})
12
13    "state_preprocessor": None,             # state preprocessor class (see skrl.resources.preprocessors)
14    "state_preprocessor_kwargs": {},        # state preprocessor's kwargs (e.g. {"size": env.observation_space})
15    "value_preprocessor": None,             # value preprocessor class (see skrl.resources.preprocessors)
16    "value_preprocessor_kwargs": {},        # value preprocessor's kwargs (e.g. {"size": 1})
17
18    "random_timesteps": 0,          # random exploration steps
19    "learning_starts": 0,           # learning starts after this many steps
20
21    "grad_norm_clip": 0.5,          # clipping coefficient for the norm of the gradients
22
23    "entropy_loss_scale": 0.0,      # entropy loss scaling factor
24
25    "rewards_shaper": None,         # rewards shaping function: Callable(reward, timestep, timesteps) -> reward
26
27    "experiment": {
28        "directory": "",            # experiment's parent directory
29        "experiment_name": "",      # experiment name
30        "write_interval": 250,      # TensorBoard writing interval (timesteps)
31
32        "checkpoint_interval": 1000,        # interval for checkpoints (timesteps)
33        "store_separately": False,          # whether to store checkpoints separately
34
35        "wandb": False,             # whether to use Weights & Biases
36        "wandb_kwargs": {}          # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
37    }

Spaces and models

The implementation supports the following Gym spaces / Gymnasium spaces

Gym/Gymnasium spaces

Observation

Action

Discrete

\(\square\)

\(\blacksquare\)

Box

\(\blacksquare\)

\(\blacksquare\)

Dict

\(\blacksquare\)

\(\square\)

The implementation uses 1 stochastic (discrete or continuous) and 1 deterministic function approximator. These function approximators (models) must be collected in a dictionary and passed to the constructor of the class under the argument models

Notation

Concept

Key

Input shape

Output shape

Type

\(\pi_\theta(s)\)

Policy

"policy"

observation

action

Categorical / Gaussian / MultivariateGaussian

\(V_\phi(s)\)

Value

"value"

observation

1

Deterministic

Support for advanced features is described in the next table

Feature

Support and remarks

Shared model

for Policy and Value

RNN support

RNN, LSTM, GRU and any other variant

API