Advantage Actor Critic (A2C)

A2C (synchronous version of A3C) is a model-free, stochastic on-policy policy gradient algorithm

Paper: Asynchronous Methods for Deep Reinforcement Learning

Algorithm

Note

This algorithm implementation relies on the existence of parallel environments instead of parallel actor-learners

Algorithm implementation

Main notation/symbols:

- policy function approximator (\(\pi_\theta\)), value function approximator (\(V_\phi\))
- states (\(s\)), actions (\(a\)), rewards (\(r\)), next states (\(s'\)), dones (\(d\))
- values (\(V\)), advantages (\(A\)), returns (\(R\))
- log probabilities (\(logp\))
- loss (\(L\))

Learning algorithm (_update(...))

compute_gae(...)

def \(\;f_{GAE} (r, d, V, V_{_{last}}') \;\rightarrow\; R, A:\)

\(adv \leftarrow 0\)
\(A \leftarrow \text{zeros}(r)\)
# advantages computation
FOR each reverse iteration \(i\) up to the number of rows in \(r\) DO
IF \(i\) is not the last row of \(r\) THEN
\(V_i' = V_{i+1}\)
ELSE
\(V_i' \leftarrow V_{_{last}}'\)
\(adv \leftarrow r_i - V_i \, +\) discount_factor \(\neg d_i \; (V_i' \, -\) lambda \(adv)\)
\(A_i \leftarrow adv\)
# returns computation
\(R \leftarrow A + V\)
# normalize advantages
\(A \leftarrow \dfrac{A - \bar{A}}{A_\sigma + 10^{-8}}\)

# compute returns and advantages
\(V_{_{last}}' \leftarrow V_\phi(s')\)
\(R, A \leftarrow f_{GAE}(r, d, V, V_{_{last}}')\)
# sample mini-batches from memory
[[\(s, a, logp, V, R, A\)]] \(\leftarrow\) states, actions, log_prob, values, returns, advantages
# mini-batches loop
FOR each mini-batch [\(s, a, logp, V, R, A\)] up to mini_batches DO
\(logp' \leftarrow \pi_\theta(s, a)\)
# compute entropy loss
IF entropy computation is enabled THEN
\({L}_{entropy} \leftarrow \, -\) entropy_loss_scale \(\frac{1}{N} \sum_{i=1}^N \pi_{\theta_{entropy}}\)
ELSE
\({L}_{entropy} \leftarrow 0\)
# compute policy loss
\(L_{\pi_\theta} \leftarrow -\frac{1}{N} \sum_{i=1}^N A \; ratio\)
# compute value loss
\(V_{_{predicted}} \leftarrow V_\phi(s)\)
\(L_{V_\phi} \leftarrow \frac{1}{N} \sum_{i=1}^N (R - V_{_{predicted}})^2\)
# optimization step
reset \(\text{optimizer}_{\theta, \phi}\)
\(\nabla_{\theta, \, \phi} (L_{\pi_\theta} + {L}_{entropy} + L_{V_\phi})\)
\(\text{clip}(\lVert \nabla_{\theta, \, \phi} \rVert)\) with grad_norm_clip
step \(\text{optimizer}_{\theta, \phi}\)
# update learning rate
IF there is a learning_rate_scheduler THEN
step \(\text{scheduler}_{\theta} (\text{optimizer}_{\theta})\)
step \(\text{scheduler}_{\phi} (\text{optimizer}_{\phi})\)

Configuration and hyperparameters

skrl.agents.torch.a2c.a2c.A2C_DEFAULT_CONFIG

A2C_DEFAULT_CONFIG = {
    "rollouts": 16,                 # number of rollouts before updating
    "mini_batches": 1,              # number of mini batches to use for updating

    "discount_factor": 0.99,        # discount factor (gamma)
    "lambda": 0.95,                 # TD(lambda) coefficient (lam) for computing returns and advantages

    "learning_rate": 1e-3,          # learning rate
    "learning_rate_scheduler": None,        # learning rate scheduler class (see torch.optim.lr_scheduler)
    "learning_rate_scheduler_kwargs": {},   # learning rate scheduler's kwargs (e.g. {"step_size": 1e-3})

    "state_preprocessor": None,             # state preprocessor class (see skrl.resources.preprocessors)
    "state_preprocessor_kwargs": {},        # state preprocessor's kwargs (e.g. {"size": env.observation_space})
    "value_preprocessor": None,             # value preprocessor class (see skrl.resources.preprocessors)
    "value_preprocessor_kwargs": {},        # value preprocessor's kwargs (e.g. {"size": 1})

    "random_timesteps": 0,          # random exploration steps
    "learning_starts": 0,           # learning starts after this many steps

    "grad_norm_clip": 0.5,          # clipping coefficient for the norm of the gradients

    "entropy_loss_scale": 0.0,      # entropy loss scaling factor

    "rewards_shaper": None,         # rewards shaping function: Callable(reward, timestep, timesteps) -> reward

    "experiment": {
        "directory": "",            # experiment's parent directory
        "experiment_name": "",      # experiment name
        "write_interval": 250,      # TensorBoard writing interval (timesteps)

        "checkpoint_interval": 1000,        # interval for checkpoints (timesteps)
        "store_separately": False,          # whether to store checkpoints separately

        "wandb": False,             # whether to use Weights & Biases
        "wandb_kwargs": {}          # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
    }

Spaces and models

The implementation supports the following Gym spaces / Gymnasium spaces

Gym/Gymnasium spaces	Observation	Action
Discrete	\(\square\)	\(\blacksquare\)
Box	\(\blacksquare\)	\(\blacksquare\)
Dict	\(\blacksquare\)	\(\square\)

The implementation uses 1 stochastic (discrete or continuous) and 1 deterministic function approximator. These function approximators (models) must be collected in a dictionary and passed to the constructor of the class under the argument models

Notation	Concept	Key	Input shape	Output shape	Type
\(\pi_\theta(s)\)	Policy	`"policy"`	observation	action	Categorical / Gaussian / MultivariateGaussian
\(V_\phi(s)\)	Value	`"value"`	observation	1	Deterministic

Support for advanced features is described in the next table

Feature	Support and remarks
Shared model	for Policy and Value
RNN support	RNN, LSTM, GRU and any other variant