Deep Deterministic Policy Gradient (DDPG)

DDPG is a model-free, deterministic off-policy actor-critic algorithm that uses deep function approximators to learn a policy (and to estimate the action-value function) in high-dimensional, continuous action spaces

Paper: Continuous control with deep reinforcement learning

Algorithm implementation

Main notation/symbols:

- policy function approximator (\(\mu_\theta\)), critic function approximator (\(Q_\phi\))
- states (\(s\)), actions (\(a\)), rewards (\(r\)), next states (\(s'\)), dones (\(d\))
- loss (\(L\))

Decision making (act(...))

\(a \leftarrow \mu_\theta(s)\)
\(noise \leftarrow\) sample noise
\(scale \leftarrow (1 - \text{timestep} \;/\) timesteps \() \; (\) initial_scale \(-\) final_scale \() \;+\) final_scale
\(a \leftarrow \text{clip}(a + noise * scale, {a}_{Low}, {a}_{High})\)

Learning algorithm (_update(...))

# sample a batch from memory

[\(s, a, r, s', d\)] \(\leftarrow\) states, actions, rewards, next_states, dones of size batch_size

# gradient steps

FOR each gradient step up to gradient_steps DO

# compute target values
\(a' \leftarrow \mu_{\theta_{target}}(s')\)
\(Q_{_{target}} \leftarrow Q_{\phi_{target}}(s', a')\)
\(y \leftarrow r \;+\) discount_factor \(\neg d \; Q_{_{target}}\)
# compute critic loss
\(Q \leftarrow Q_\phi(s, a)\)
\(L_{Q_\phi} \leftarrow \frac{1}{N} \sum_{i=1}^N (Q - y)^2\)
# optimization step (critic)
reset \(\text{optimizer}_\phi\)
\(\nabla_{\phi} L_{Q_\phi}\)
\(\text{clip}(\lVert \nabla_{\phi} \rVert)\) with grad_norm_clip
step \(\text{optimizer}_\phi\)
# compute policy (actor) loss
\(a \leftarrow \mu_\theta(s)\)
\(Q \leftarrow Q_\phi(s, a)\)
\(L_{\mu_\theta} \leftarrow - \frac{1}{N} \sum_{i=1}^N Q\)
# optimization step (policy)
reset \(\text{optimizer}_\theta\)
\(\nabla_{\theta} L_{\mu_\theta}\)
\(\text{clip}(\lVert \nabla_{\theta} \rVert)\) with grad_norm_clip
step \(\text{optimizer}_\theta\)
# update target networks
\(\theta_{target} \leftarrow\) polyak \(\theta + (1 \;-\) polyak \() \theta_{target}\)
\(\phi_{target} \leftarrow\) polyak \(\phi + (1 \;-\) polyak \() \phi_{target}\)
# update learning rate
IF there is a learning_rate_scheduler THEN
step \(\text{scheduler}_\theta (\text{optimizer}_\theta)\)
step \(\text{scheduler}_\phi (\text{optimizer}_\phi)\)

Configuration and hyperparameters

skrl.agents.torch.ddpg.ddpg.DDPG_DEFAULT_CONFIG

DDPG_DEFAULT_CONFIG = {
    "gradient_steps": 1,            # gradient steps
    "batch_size": 64,               # training batch size

    "discount_factor": 0.99,        # discount factor (gamma)
    "polyak": 0.005,                # soft update hyperparameter (tau)

    "actor_learning_rate": 1e-3,    # actor learning rate
    "critic_learning_rate": 1e-3,   # critic learning rate
    "learning_rate_scheduler": None,        # learning rate scheduler class (see torch.optim.lr_scheduler)
    "learning_rate_scheduler_kwargs": {},   # learning rate scheduler's kwargs (e.g. {"step_size": 1e-3})

    "state_preprocessor": None,             # state preprocessor class (see skrl.resources.preprocessors)
    "state_preprocessor_kwargs": {},        # state preprocessor's kwargs (e.g. {"size": env.observation_space})

    "random_timesteps": 0,          # random exploration steps
    "learning_starts": 0,           # learning starts after this many steps

    "grad_norm_clip": 0,            # clipping coefficient for the norm of the gradients

    "exploration": {
        "noise": None,              # exploration noise
        "initial_scale": 1.0,       # initial scale for the noise
        "final_scale": 1e-3,        # final scale for the noise
        "timesteps": None,          # timesteps for the noise decay
    },

    "rewards_shaper": None,         # rewards shaping function: Callable(reward, timestep, timesteps) -> reward

    "experiment": {
        "directory": "",            # experiment's parent directory
        "experiment_name": "",      # experiment name
        "write_interval": 250,      # TensorBoard writing interval (timesteps)

        "checkpoint_interval": 1000,        # interval for checkpoints (timesteps)
        "store_separately": False,          # whether to store checkpoints separately

        "wandb": False,             # whether to use Weights & Biases
        "wandb_kwargs": {}          # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
    }
}

Spaces and models

The implementation supports the following Gym spaces / Gymnasium spaces

Gym/Gymnasium spaces	Observation	Action
Discrete	\(\square\)	\(\square\)
Box	\(\blacksquare\)	\(\blacksquare\)
Dict	\(\blacksquare\)	\(\square\)

The implementation uses 4 deterministic function approximators. These function approximators (models) must be collected in a dictionary and passed to the constructor of the class under the argument models

Notation	Concept	Key	Input shape	Output shape	Type
\(\mu_\theta(s)\)	Policy (actor)	`"policy"`	observation	action	Deterministic
\(\mu_{\theta_{target}}(s)\)	Target policy	`"target_policy"`	observation	action	Deterministic
\(Q_\phi(s, a)\)	Q-network (critic)	`"critic"`	observation + action	1	Deterministic
\(Q_{\phi_{target}}(s, a)\)	Target Q-network	`"target_critic"`	observation + action	1	Deterministic

Support for advanced features is described in the next table

Feature	Support and remarks
Shared model	-
RNN support	RNN, LSTM, GRU and any other variant

Deep Deterministic Policy Gradient (DDPG)

Algorithm implementation

Configuration and hyperparameters

Spaces and models

API