Soft Actor-Critic (SAC)
SAC is a model-free, stochastic off-policy actor-critic algorithm that uses double Q-learning (like TD3) and entropy regularization to maximize a trade-off between exploration and exploitation
Paper: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Algorithm implementation
Learning algorithm (_update(...)
)
Configuration and hyperparameters
- skrl.agents.torch.sac.sac.SAC_DEFAULT_CONFIG
1SAC_DEFAULT_CONFIG = {
2 "gradient_steps": 1, # gradient steps
3 "batch_size": 64, # training batch size
4
5 "discount_factor": 0.99, # discount factor (gamma)
6 "polyak": 0.005, # soft update hyperparameter (tau)
7
8 "actor_learning_rate": 1e-3, # actor learning rate
9 "critic_learning_rate": 1e-3, # critic learning rate
10 "learning_rate_scheduler": None, # learning rate scheduler class (see torch.optim.lr_scheduler)
11 "learning_rate_scheduler_kwargs": {}, # learning rate scheduler's kwargs (e.g. {"step_size": 1e-3})
12
13 "state_preprocessor": None, # state preprocessor class (see skrl.resources.preprocessors)
14 "state_preprocessor_kwargs": {}, # state preprocessor's kwargs (e.g. {"size": env.observation_space})
15
16 "random_timesteps": 0, # random exploration steps
17 "learning_starts": 0, # learning starts after this many steps
18
19 "grad_norm_clip": 0, # clipping coefficient for the norm of the gradients
20
21 "learn_entropy": True, # learn entropy
22 "entropy_learning_rate": 1e-3, # entropy learning rate
23 "initial_entropy_value": 0.2, # initial entropy value
24 "target_entropy": None, # target entropy
25
26 "rewards_shaper": None, # rewards shaping function: Callable(reward, timestep, timesteps) -> reward
27
28 "experiment": {
29 "base_directory": "", # base directory for the experiment
30 "experiment_name": "", # experiment name
31 "write_interval": 250, # TensorBoard writing interval (timesteps)
32
33 "checkpoint_interval": 1000, # interval for checkpoints (timesteps)
34 "store_separately": False, # whether to store checkpoints separately
35
36 "wandb": False, # whether to use Weights & Biases
37 "wandb_kwargs": {} # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
38 }
39}
Spaces and models
The implementation supports the following Gym spaces / Gymnasium spaces
Gym/Gymnasium spaces |
Observation |
Action |
---|---|---|
Discrete |
\(\square\) |
\(\square\) |
Box |
\(\blacksquare\) |
\(\blacksquare\) |
Dict |
\(\blacksquare\) |
\(\square\) |
The implementation uses 1 stochastic and 4 deterministic function approximators. These function approximators (models) must be collected in a dictionary and passed to the constructor of the class under the argument models
Notation |
Concept |
Key |
Input shape |
Output shape |
Type |
---|---|---|---|---|---|
\(\pi_\theta(s)\) |
Policy (actor) |
|
observation |
action |
|
\(Q_{\phi 1}(s, a)\) |
Q1-network (critic 1) |
|
observation + action |
1 |
|
\(Q_{\phi 2}(s, a)\) |
Q2-network (critic 2) |
|
observation + action |
1 |
|
\(Q_{{\phi 1}_{target}}(s, a)\) |
Target Q1-network |
|
observation + action |
1 |
|
\(Q_{{\phi 2}_{target}}(s, a)\) |
Target Q2-network |
|
observation + action |
1 |
Support for advanced features is described in the next table
Feature |
Support and remarks |
---|---|
Shared model |
- |
RNN support |
RNN, LSTM, GRU and any other variant |