Soft Actor-Critic (SAC)

SAC is a model-free, stochastic off-policy actor-critic algorithm that uses double Q-learning (like TD3) and entropy regularization to maximize a trade-off between exploration and exploitation

Paper: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Algorithm implementation

Main notation/symbols:
- policy function approximator (\(\pi_\theta\)), critic function approximator (\(Q_\phi\))
- states (\(s\)), actions (\(a\)), rewards (\(r\)), next states (\(s'\)), dones (\(d\))
- log probabilities (\(logp\)), entropy coefficient (\(\alpha\))
- loss (\(L\))

Learning algorithm (_update(...))

# sample a batch from memory
[\(s, a, r, s', d\)] \(\leftarrow\) states, actions, rewards, next_states, dones of size batch_size
# gradient steps
FOR each gradient step up to gradient_steps DO
# compute target values
\(a',\; logp' \leftarrow \pi_\theta(s')\)
\(Q_{1_{target}} \leftarrow Q_{{\phi 1}_{target}}(s', a')\)
\(Q_{2_{target}} \leftarrow Q_{{\phi 2}_{target}}(s', a')\)
\(Q_{_{target}} \leftarrow \text{min}(Q_{1_{target}}, Q_{2_{target}}) - \alpha \; logp'\)
\(y \leftarrow r \;+\) discount_factor \(\neg d \; Q_{_{target}}\)
# compute critic loss
\(Q_1 \leftarrow Q_{\phi 1}(s, a)\)
\(Q_2 \leftarrow Q_{\phi 2}(s, a)\)
\(L_{Q_\phi} \leftarrow 0.5 \; (\frac{1}{N} \sum_{i=1}^N (Q_1 - y)^2 + \frac{1}{N} \sum_{i=1}^N (Q_2 - y)^2)\)
# optimization step (critic)
reset \(\text{optimizer}_\phi\)
\(\nabla_{\phi} L_{Q_\phi}\)
\(\text{clip}(\lVert \nabla_{\phi} \rVert)\) with grad_norm_clip
step \(\text{optimizer}_\phi\)
# compute policy (actor) loss
\(a,\; logp \leftarrow \pi_\theta(s)\)
\(Q_1 \leftarrow Q_{\phi 1}(s, a)\)
\(Q_2 \leftarrow Q_{\phi 2}(s, a)\)
\(L_{\pi_\theta} \leftarrow \frac{1}{N} \sum_{i=1}^N (\alpha \; logp - \text{min}(Q_1, Q_2))\)
# optimization step (policy)
reset \(\text{optimizer}_\theta\)
\(\nabla_{\theta} L_{\pi_\theta}\)
\(\text{clip}(\lVert \nabla_{\theta} \rVert)\) with grad_norm_clip
step \(\text{optimizer}_\theta\)
# entropy learning
IF learn_entropy is enabled THEN
# compute entropy loss
\({L}_{entropy} \leftarrow - \frac{1}{N} \sum_{i=1}^N (log(\alpha) \; (logp + \alpha_{Target}))\)
# optimization step (entropy)
reset \(\text{optimizer}_\alpha\)
\(\nabla_{\alpha} {L}_{entropy}\)
step \(\text{optimizer}_\alpha\)
# compute entropy coefficient
\(\alpha \leftarrow e^{log(\alpha)}\)
# update target networks
\({\phi 1}_{target} \leftarrow\) polyak \({\phi 1} + (1 \;-\) polyak \() {\phi 1}_{target}\)
\({\phi 2}_{target} \leftarrow\) polyak \({\phi 2} + (1 \;-\) polyak \() {\phi 2}_{target}\)
# update learning rate
IF there is a learning_rate_scheduler THEN
step \(\text{scheduler}_\theta (\text{optimizer}_\theta)\)
step \(\text{scheduler}_\phi (\text{optimizer}_\phi)\)

Configuration and hyperparameters

skrl.agents.torch.sac.sac.SAC_DEFAULT_CONFIG
 1SAC_DEFAULT_CONFIG = {
 2    "gradient_steps": 1,            # gradient steps
 3    "batch_size": 64,               # training batch size
 4
 5    "discount_factor": 0.99,        # discount factor (gamma)
 6    "polyak": 0.005,                # soft update hyperparameter (tau)
 7
 8    "actor_learning_rate": 1e-3,    # actor learning rate
 9    "critic_learning_rate": 1e-3,   # critic learning rate
10    "learning_rate_scheduler": None,        # learning rate scheduler class (see torch.optim.lr_scheduler)
11    "learning_rate_scheduler_kwargs": {},   # learning rate scheduler's kwargs (e.g. {"step_size": 1e-3})
12
13    "state_preprocessor": None,             # state preprocessor class (see skrl.resources.preprocessors)
14    "state_preprocessor_kwargs": {},        # state preprocessor's kwargs (e.g. {"size": env.observation_space})
15
16    "random_timesteps": 0,          # random exploration steps
17    "learning_starts": 0,           # learning starts after this many steps
18
19    "grad_norm_clip": 0,            # clipping coefficient for the norm of the gradients
20
21    "learn_entropy": True,          # learn entropy
22    "entropy_learning_rate": 1e-3,  # entropy learning rate
23    "initial_entropy_value": 0.2,   # initial entropy value
24    "target_entropy": None,         # target entropy
25
26    "rewards_shaper": None,         # rewards shaping function: Callable(reward, timestep, timesteps) -> reward
27
28    "experiment": {
29        "base_directory": "",       # base directory for the experiment
30        "experiment_name": "",      # experiment name
31        "write_interval": 250,      # TensorBoard writing interval (timesteps)
32
33        "checkpoint_interval": 1000,        # interval for checkpoints (timesteps)
34        "store_separately": False,          # whether to store checkpoints separately
35
36        "wandb": False,             # whether to use Weights & Biases
37        "wandb_kwargs": {}          # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
38    }
39}

Spaces and models

The implementation supports the following Gym spaces / Gymnasium spaces

Gym/Gymnasium spaces

Observation

Action

Discrete

\(\square\)

\(\square\)

Box

\(\blacksquare\)

\(\blacksquare\)

Dict

\(\blacksquare\)

\(\square\)

The implementation uses 1 stochastic and 4 deterministic function approximators. These function approximators (models) must be collected in a dictionary and passed to the constructor of the class under the argument models

Notation

Concept

Key

Input shape

Output shape

Type

\(\pi_\theta(s)\)

Policy (actor)

"policy"

observation

action

Gaussian / MultivariateGaussian

\(Q_{\phi 1}(s, a)\)

Q1-network (critic 1)

"critic_1"

observation + action

1

Deterministic

\(Q_{\phi 2}(s, a)\)

Q2-network (critic 2)

"critic_2"

observation + action

1

Deterministic

\(Q_{{\phi 1}_{target}}(s, a)\)

Target Q1-network

"target_critic_1"

observation + action

1

Deterministic

\(Q_{{\phi 2}_{target}}(s, a)\)

Target Q2-network

"target_critic_2"

observation + action

1

Deterministic

Support for advanced features is described in the next table

Feature

Support and remarks

Shared model

-

RNN support

RNN, LSTM, GRU and any other variant

API