Q-learning#

Q-learning is a model-free off-policy algorithm that uses a tabular Q-function to handle discrete observations and action spaces

Paper: Learning from delayed rewards

Algorithm#

Algorithm implementation#

Main notation/symbols:

- action-value function (\(Q\))

- states (\(s\)), actions (\(a\)), rewards (\(r\)), next states (\(s'\)), dones (\(d\))

Decision making#

act(...)
\(a \leftarrow \pi_{Q[s,a]}(s) \qquad\) where \(\; a \leftarrow \begin{cases} a \in_R A & x < \epsilon \\ \underset{a}{\arg\max} \; Q[s] & x \geq \epsilon \end{cases} \qquad\) for \(\; x \leftarrow U(0,1)\)

Learning algorithm#

_update(...)
# compute next actions
\(a' \leftarrow \underset{a}{\arg\max} \; Q[s'] \qquad\) # the only difference with SARSA
# update Q-table
\(Q[s,a] \leftarrow Q[s,a] \;+\) learning_rate \((r \;+\) discount_factor \(\neg d \; Q[s',a'] - Q[s,a])\)

Usage#

# import the agent and its default configuration
from skrl.agents.torch.q_learning import Q_LEARNING, Q_LEARNING_DEFAULT_CONFIG

# instantiate the agent's models
models = {}
models["policy"] = ...

# adjust some configuration if necessary
cfg_agent = Q_LEARNING_DEFAULT_CONFIG.copy()
cfg_agent["<KEY>"] = ...

# instantiate the agent
# (assuming a defined environment <env>)
agent = Q_LEARNING(models=models,
                   memory=None,
                   cfg=cfg_agent,
                   observation_space=env.observation_space,
                   action_space=env.action_space,
                   device=env.device)

Configuration and hyperparameters#

Q_LEARNING_DEFAULT_CONFIG = {
    "discount_factor": 0.99,        # discount factor (gamma)

    "random_timesteps": 0,          # random exploration steps
    "learning_starts": 0,           # learning starts after this many steps

    "learning_rate": 0.5,           # learning rate (alpha)

    "rewards_shaper": None,         # rewards shaping function: Callable(reward, timestep, timesteps) -> reward

    "experiment": {
        "directory": "",            # experiment's parent directory
        "experiment_name": "",      # experiment name
        "write_interval": 250,      # TensorBoard writing interval (timesteps)

        "checkpoint_interval": 1000,        # interval for checkpoints (timesteps)
        "store_separately": False,          # whether to store checkpoints separately

        "wandb": False,             # whether to use Weights & Biases
        "wandb_kwargs": {}          # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
    }
}

Spaces#

The implementation supports the following Gym spaces / Gymnasium spaces

Gym/Gymnasium spaces	Observation	Action
Discrete	\(\blacksquare\)	\(\blacksquare\)
MultiDiscrete	\(\square\)	\(\square\)
Box	\(\square\)	\(\square\)
Dict	\(\square\)	\(\square\)

Models#

The implementation uses 1 table. This table (model) must be collected in a dictionary and passed to the constructor of the class under the argument models

Notation	Concept	Key	Input shape	Output shape	Type
\(\pi_{Q[s,a]}(s)\)	Policy (\(\epsilon\)-greedy)	`"policy"`	observation	action	Tabular

API (PyTorch)#

skrl.agents.torch.q_learning.Q_LEARNING_DEFAULT_CONFIG#: alias of {‘discount_factor’: 0.99, ‘experiment’: {‘checkpoint_interval’: 1000, ‘directory’: ‘’, ‘experiment_name’: ‘’, ‘store_separately’: False, ‘wandb’: False, ‘wandb_kwargs’: {}, ‘write_interval’: 250}, ‘learning_rate’: 0.5, ‘learning_starts’: 0, ‘random_timesteps’: 0, ‘rewards_shaper’: None}

Bases: Agent

Q-learning

https://www.academia.edu/3294050/Learning_from_delayed_rewards

Parameters:

models (dictionary of skrl.models.torch.Model) – Models used by the agent
memory (skrl.memory.torch.Memory, list of skrl.memory.torch.Memory or None) – Memory to storage the transitions. If it is a tuple, the first element will be used for training and for the rest only the environment transitions will be added
observation_space (int, tuple or list of int, gym.Space, gymnasium.Space or None, optional) – Observation/state space or shape (default: None)
action_space (int, tuple or list of int, gym.Space, gymnasium.Space or None, optional) – Action space or shape (default: None)
device (str or torch.device, optional) – Device on which a tensor/array is or will be allocated (default: None). If None, the device will be either "cuda" if available or "cpu"
cfg (dict) – Configuration dictionary

Raises:

KeyError – If the models dictionary is missing a required key

_update(timestep: int, timesteps: int) → None#

Algorithm’s main update step

Parameters:

timestep (int) – Current timestep
timesteps (int) – Number of timesteps

act(states: torch.Tensor, timestep: int, timesteps: int) → torch.Tensor#

Process the environment’s states to make a decision (actions) using the main policy

Parameters:

states (torch.Tensor) – Environment’s states
timestep (int) – Current timestep
timesteps (int) – Number of timesteps

Returns:

Actions

Return type:

torch.Tensor

init(trainer_cfg: Mapping[str, Any] | None = None) → None#: Initialize the agent

post_interaction(timestep: int, timesteps: int) → None#

Callback called after the interaction with the environment

Parameters:

timestep (int) – Current timestep
timesteps (int) – Number of timesteps

pre_interaction(timestep: int, timesteps: int) → None#

Callback called before the interaction with the environment

Parameters:

timestep (int) – Current timestep
timesteps (int) – Number of timesteps

record_transition(states: torch.Tensor, actions: torch.Tensor, rewards: torch.Tensor, next_states: torch.Tensor, terminated: torch.Tensor, truncated: torch.Tensor, infos: Any, timestep: int, timesteps: int) → None#

Record an environment transition in memory

Parameters:

states (torch.Tensor) – Observations/states of the environment used to make the decision
actions (torch.Tensor) – Actions taken by the agent
rewards (torch.Tensor) – Instant rewards achieved by the current actions
next_states (torch.Tensor) – Next observations/states of the environment
terminated (torch.Tensor) – Signals to indicate that episodes have terminated
truncated (torch.Tensor) – Signals to indicate that episodes have been truncated
infos (Any type supported by the environment) – Additional information about the environment
timestep (int) – Current timestep
timesteps (int) – Number of timesteps