Double Deep Q-Network (DDQN)

DDQN is a model-free, off-policy algorithm that relies on double Q-learning to avoid the overestimation of action-values introduced by DQN

Paper: Deep Reinforcement Learning with Double Q-Learning

Algorithm implementation

Decision making (act(...))

\(\epsilon \leftarrow \epsilon_{_{final}} + (\epsilon_{_{initial}} - \epsilon_{_{final}}) \; e^{-1 \; \frac{\text{timestep}}{\epsilon_{_{timesteps}}}}\)
\(a \leftarrow \begin{cases} a \in_R A & x < \epsilon \\ \underset{a}{\arg\max} \; Q_\phi(s) & x \geq \epsilon \end{cases} \qquad\) for \(\; x \leftarrow U(0,1)\)

Learning algorithm (_update(...))

# sample a batch from memory
[\(s, a, r, s', d\)] \(\leftarrow\) states, actions, rewards, next_states, dones of size batch_size
# gradient steps
FOR each gradient step up to gradient_steps DO
# compute target values
\(Q' \leftarrow Q_{\phi_{target}}(s')\)
\(Q_{_{target}} \leftarrow Q'[\underset{a}{\arg\max} \; Q_\phi(s')] \qquad\) # the only difference with DQN
\(y \leftarrow r \;+\) discount_factor \(\neg d \; Q_{_{target}}\)
# compute Q-network loss
\(Q \leftarrow Q_\phi(s)[a]\)
\({Loss}_{Q_\phi} \leftarrow \frac{1}{N} \sum_{i=1}^N (Q - y)^2\)
# optimize Q-network
\(\nabla_{\phi} {Loss}_{Q_\phi}\)
# update target network
IF it’s time to update target network THEN
\(\phi_{target} \leftarrow\) polyak \(\phi + (1 \;-\) polyak \() \phi_{target}\)
# update learning rate
IF there is a learning_rate_scheduler THEN
step \(\text{scheduler}_\phi (\text{optimizer}_\phi)\)

Configuration and hyperparameters

skrl.agents.torch.dqn.ddqn.DDQN_DEFAULT_CONFIG
 1DDQN_DEFAULT_CONFIG = {
 2    "gradient_steps": 1,            # gradient steps
 3    "batch_size": 64,               # training batch size
 4
 5    "discount_factor": 0.99,        # discount factor (gamma)
 6    "polyak": 0.005,                # soft update hyperparameter (tau)
 7
 8    "learning_rate": 1e-3,          # learning rate
 9    "learning_rate_scheduler": None,        # learning rate scheduler class (see torch.optim.lr_scheduler)
10    "learning_rate_scheduler_kwargs": {},   # learning rate scheduler's kwargs (e.g. {"step_size": 1e-3})
11
12    "state_preprocessor": None,             # state preprocessor class (see skrl.resources.preprocessors)
13    "state_preprocessor_kwargs": {},        # state preprocessor's kwargs (e.g. {"size": env.observation_space})
14
15    "random_timesteps": 0,          # random exploration steps
16    "learning_starts": 0,           # learning starts after this many steps
17
18    "update_interval": 1,           # agent update interval
19    "target_update_interval": 10,   # target network update interval
20
21    "exploration": {
22        "initial_epsilon": 1.0,       # initial epsilon for epsilon-greedy exploration
23        "final_epsilon": 0.05,        # final epsilon for epsilon-greedy exploration
24        "timesteps": 1000,            # timesteps for epsilon-greedy decay
25    },
26
27    "rewards_shaper": None,         # rewards shaping function: Callable(reward, timestep, timesteps) -> reward
28
29    "experiment": {
30        "directory": "",            # experiment's parent directory
31        "experiment_name": "",      # experiment name
32        "write_interval": 250,      # TensorBoard writing interval (timesteps)
33
34        "checkpoint_interval": 1000,        # interval for checkpoints (timesteps)
35        "store_separately": False,          # whether to store checkpoints separately
36
37        "wandb": False,             # whether to use Weights & Biases
38        "wandb_kwargs": {}          # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
39    }
40}

Spaces and models

The implementation supports the following Gym spaces / Gymnasium spaces

Gym/Gymnasium spaces

Observation

Action

Discrete

\(\square\)

\(\blacksquare\)

Box

\(\blacksquare\)

\(\square\)

Dict

\(\blacksquare\)

\(\square\)

The implementation uses 2 deterministic function approximators. These function approximators (models) must be collected in a dictionary and passed to the constructor of the class under the argument models

Notation

Concept

Key

Input shape

Output shape

Type

\(Q_\phi(s, a)\)

Q-network

"q_network"

observation

action

Deterministic

\(Q_{\phi_{target}}(s, a)\)

Target Q-network

"target_q_network"

observation

action

Deterministic

Support for advanced features is described in the next table

Feature

Support and remarks

Shared model

-

RNN support

-

API