Double Deep Q-Network (DDQN)

DDQN is a model-free, off-policy algorithm that relies on double Q-learning to avoid the overestimation of action-values introduced by DQN

Paper: Deep Reinforcement Learning with Double Q-Learning

Algorithm implementation

Decision making (act(...))

\(\epsilon \leftarrow \epsilon_{_{final}} + (\epsilon_{_{initial}} - \epsilon_{_{final}}) \; e^{-1 \; \frac{\text{timestep}}{\epsilon_{_{timesteps}}}}\)

\(a \leftarrow \begin{cases} a \in_R A & x < \epsilon \\ \underset{a}{\arg\max} \; Q_\phi(s) & x \geq \epsilon \end{cases} \qquad\) for \(\; x \leftarrow U(0,1)\)

Learning algorithm (_update(...))

# sample a batch from memory

[\(s, a, r, s', d\)] \(\leftarrow\) states, actions, rewards, next_states, dones of size batch_size

# gradient steps

FOR each gradient step up to gradient_steps DO

# compute target values
\(Q' \leftarrow Q_{\phi_{target}}(s')\)
\(Q_{_{target}} \leftarrow Q'[\underset{a}{\arg\max} \; Q_\phi(s')] \qquad\) # the only difference with DQN
\(y \leftarrow r \;+\) discount_factor \(\neg d \; Q_{_{target}}\)
# compute Q-network loss
\(Q \leftarrow Q_\phi(s)[a]\)
\({Loss}_{Q_\phi} \leftarrow \frac{1}{N} \sum_{i=1}^N (Q - y)^2\)
# optimize Q-network
\(\nabla_{\phi} {Loss}_{Q_\phi}\)
# update target network
IF it’s time to update target network THEN
\(\phi_{target} \leftarrow\) polyak \(\phi + (1 \;-\) polyak \() \phi_{target}\)
# update learning rate
IF there is a learning_rate_scheduler THEN
step \(\text{scheduler}_\phi (\text{optimizer}_\phi)\)

Configuration and hyperparameters

skrl.agents.torch.dqn.ddqn.DDQN_DEFAULT_CONFIG

DDQN_DEFAULT_CONFIG = {
    "gradient_steps": 1,            # gradient steps
    "batch_size": 64,               # training batch size

    "discount_factor": 0.99,        # discount factor (gamma)
    "polyak": 0.005,                # soft update hyperparameter (tau)

    "learning_rate": 1e-3,          # learning rate
    "learning_rate_scheduler": None,        # learning rate scheduler class (see torch.optim.lr_scheduler)
    "learning_rate_scheduler_kwargs": {},   # learning rate scheduler's kwargs (e.g. {"step_size": 1e-3})

    "state_preprocessor": None,             # state preprocessor class (see skrl.resources.preprocessors)
    "state_preprocessor_kwargs": {},        # state preprocessor's kwargs (e.g. {"size": env.observation_space})

    "random_timesteps": 0,          # random exploration steps
    "learning_starts": 0,           # learning starts after this many steps

    "update_interval": 1,           # agent update interval
    "target_update_interval": 10,   # target network update interval

    "exploration": {
        "initial_epsilon": 1.0,       # initial epsilon for epsilon-greedy exploration
        "final_epsilon": 0.05,        # final epsilon for epsilon-greedy exploration
        "timesteps": 1000,            # timesteps for epsilon-greedy decay
    },

    "rewards_shaper": None,         # rewards shaping function: Callable(reward, timestep, timesteps) -> reward

    "experiment": {
        "directory": "",            # experiment's parent directory
        "experiment_name": "",      # experiment name
        "write_interval": 250,      # TensorBoard writing interval (timesteps)

        "checkpoint_interval": 1000,        # interval for checkpoints (timesteps)
        "store_separately": False,          # whether to store checkpoints separately

        "wandb": False,             # whether to use Weights & Biases
        "wandb_kwargs": {}          # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
    }
}

Spaces and models

The implementation supports the following Gym spaces / Gymnasium spaces

Gym/Gymnasium spaces	Observation	Action
Discrete	\(\square\)	\(\blacksquare\)
Box	\(\blacksquare\)	\(\square\)
Dict	\(\blacksquare\)	\(\square\)

The implementation uses 2 deterministic function approximators. These function approximators (models) must be collected in a dictionary and passed to the constructor of the class under the argument models

Notation	Concept	Key	Input shape	Output shape	Type
\(Q_\phi(s, a)\)	Q-network	`"q_network"`	observation	action	Deterministic
\(Q_{\phi_{target}}(s, a)\)	Target Q-network	`"target_q_network"`	observation	action	Deterministic

Support for advanced features is described in the next table

Feature	Support and remarks
Shared model	-
RNN support	-

Double Deep Q-Network (DDQN)

Algorithm implementation

Configuration and hyperparameters

Spaces and models

API