Deep Q-Network (DQN)
DQN is a model-free, off-policy algorithm that trains a control policies directly from high-dimensional sensory using a deep function approximator to represent the Q-value function
Paper: Playing Atari with Deep Reinforcement Learning
Algorithm implementation
Decision making (act(...)
)
Learning algorithm (_update(...)
)
Configuration and hyperparameters
- skrl.agents.torch.dqn.dqn.DQN_DEFAULT_CONFIG
1DQN_DEFAULT_CONFIG = {
2 "gradient_steps": 1, # gradient steps
3 "batch_size": 64, # training batch size
4
5 "discount_factor": 0.99, # discount factor (gamma)
6 "polyak": 0.005, # soft update hyperparameter (tau)
7
8 "learning_rate": 1e-3, # learning rate
9 "learning_rate_scheduler": None, # learning rate scheduler class (see torch.optim.lr_scheduler)
10 "learning_rate_scheduler_kwargs": {}, # learning rate scheduler's kwargs (e.g. {"step_size": 1e-3})
11
12 "state_preprocessor": None, # state preprocessor class (see skrl.resources.preprocessors)
13 "state_preprocessor_kwargs": {}, # state preprocessor's kwargs (e.g. {"size": env.observation_space})
14
15 "random_timesteps": 0, # random exploration steps
16 "learning_starts": 0, # learning starts after this many steps
17
18 "update_interval": 1, # agent update interval
19 "target_update_interval": 10, # target network update interval
20
21 "exploration": {
22 "initial_epsilon": 1.0, # initial epsilon for epsilon-greedy exploration
23 "final_epsilon": 0.05, # final epsilon for epsilon-greedy exploration
24 "timesteps": 1000, # timesteps for epsilon-greedy decay
25 },
26
27 "rewards_shaper": None, # rewards shaping function: Callable(reward, timestep, timesteps) -> reward
28
29 "experiment": {
30 "directory": "", # experiment's parent directory
31 "experiment_name": "", # experiment name
32 "write_interval": 250, # TensorBoard writing interval (timesteps)
33
34 "checkpoint_interval": 1000, # interval for checkpoints (timesteps)
35 "store_separately": False, # whether to store checkpoints separately
36
37 "wandb": False, # whether to use Weights & Biases
38 "wandb_kwargs": {} # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
39 }
40}
Spaces and models
The implementation supports the following Gym spaces / Gymnasium spaces
Gym/Gymnasium spaces |
Observation |
Action |
---|---|---|
Discrete |
\(\square\) |
\(\blacksquare\) |
Box |
\(\blacksquare\) |
\(\square\) |
Dict |
\(\blacksquare\) |
\(\square\) |
The implementation uses 2 deterministic function approximators. These function approximators (models) must be collected in a dictionary and passed to the constructor of the class under the argument models
Notation |
Concept |
Key |
Input shape |
Output shape |
Type |
---|---|---|---|---|---|
\(Q_\phi(s, a)\) |
Q-network |
|
observation |
action |
|
\(Q_{\phi_{target}}(s, a)\) |
Target Q-network |
|
observation |
action |
Support for advanced features is described in the next table
Feature |
Support and remarks |
---|---|
Shared model |
- |
RNN support |
- |