Q-learning¶

Q-learning is a model-free off-policy algorithm that uses a tabular Q-function to handle discrete observations and action spaces.

Paper: Learning from delayed rewards.

Algorithm¶

Algorithm implementation¶

Main notation/symbols:

- action-value function (\(Q\))

- states (\(s\)), actions (\(a\)), rewards (\(r\)), next states (\(s'\)), terminated (\(d_{_{end}}\)), truncated (\(d_{_{timeout}}\))

Decision making¶

act(...)
\(a \leftarrow \pi_{Q[s,a]}(s) \qquad\) where \(\; a \leftarrow \begin{cases} a \in_R A & x < \epsilon \\ \underset{a}{\arg\max} \; Q[s] & x \geq \epsilon \end{cases} \qquad\) for \(\; x \leftarrow U(0,1)\)

Learning algorithm¶

_update(...)
# compute next actions
\(a' \leftarrow \underset{a}{\arg\max} \; Q[s'] \qquad\) # the only difference with SARSA
# update Q-table
\(Q[s,a] \leftarrow Q[s,a] \;+\) learning_rate \((r \;+\) discount_factor \(\neg (d_{_{end}} \lor d_{_{timeout}}) \; Q[s',a'] - Q[s,a])\)

Usage¶

# import the agent and its default configuration
from skrl.agents.torch.q_learning import Q_LEARNING, Q_LEARNING_CFG

# instantiate the agent's models
models = {}
models["policy"] = ...

# adjust some configuration if necessary
cfg_agent = Q_LEARNING_CFG()
cfg_agent.KEY = ...

# instantiate the agent
# (assuming a defined environment <env>)
agent = Q_LEARNING(
    models=models,
    memory=None,
    cfg=cfg_agent,
    observation_space=env.observation_space,
    state_space=env.state_space,
    action_space=env.action_space,
    device=env.device,
)

Configuration and hyperparameters¶

Dataclass
`Q_LEARNING_CFG`	`Q_LEARNING_CFG`

Spaces¶

The implementation supports the following Gymnasium spaces:

Gymnasium spaces	Observation	Action
Discrete	\(\blacksquare\)	\(\blacksquare\)
MultiDiscrete	\(\square\)	\(\square\)
Box	\(\square\)	\(\square\)
Dict	\(\square\)	\(\square\)

Models¶

The implementation uses 1 table. This table (model) must be collected in a dictionary and passed to the constructor of the class under the argument models.

Notation	Concept	Key	Input shape	Output shape	Type
\(\pi_{Q[s,a]}(s)\)	Policy (\(\epsilon\)-greedy)	`"policy"`	observation	action	Tabular

API¶

PyTorch¶

`Q_LEARNING_CFG`	Configuration for the Q_LEARNING agent.
`Q_LEARNING`	Q-learning.

class skrl.agents.torch.q_learning.Q_LEARNING_CFG(*, experiment: ExperimentCfg = <factory>, discount_factor: float = 0.99, learning_rate: float = 0.001, random_timesteps: int = 0, learning_starts: int = 0, rewards_shaper: Callable | None = None)[source]¶

Bases: AgentCfg

Configuration for the Q_LEARNING agent.

Methods:

`expand`()	Expand the configuration.
`validate`()	Validate the configuration.

Attributes:

`discount_factor`	Parameter that balances the importance of future rewards (close to 1.0) versus immediate rewards (close to 0.0).
`experiment`	Experiment settings.
`learning_rate`	Learning rate.
`learning_starts`	Number of steps to perform before calling the algorithm update function.
`random_timesteps`	Number of random exploration (sampling random actions) steps to perform before sampling actions from the policy.
`rewards_shaper`	Rewards shaping function.

expand() → None[source]¶: Expand the configuration.

validate() → bool[source]¶: Validate the configuration.

discount_factor: float = 0.99¶

Parameter that balances the importance of future rewards (close to 1.0) versus immediate rewards (close to 0.0).

Range: [0.0, 1.0].

experiment: ExperimentCfg¶: Experiment settings.

learning_rate: float = 0.001¶: Learning rate.

learning_starts: int = 0¶: Number of steps to perform before calling the algorithm update function.

random_timesteps: int = 0¶: Number of random exploration (sampling random actions) steps to perform before sampling actions from the policy.

rewards_shaper: Callable | None = None¶: Rewards shaping function.

class skrl.agents.torch.q_learning.Q_LEARNING(*, models: dict[str, Model], memory: Memory | None = None, observation_space: gymnasium.Space | None = None, state_space: gymnasium.Space | None = None, action_space: gymnasium.Space | None = None, device: str | torch.device | None = None, cfg: Q_LEARNING_CFG | dict = {})[source]¶

Bases: Agent

Q-learning.

https://www.academia.edu/3294050/Learning_from_delayed_rewards

Parameters:

models – Agent’s models.
memory – Memory to storage agent’s data and environment transitions.
observation_space – Observation space.
state_space – State space.
action_space – Action space.
device – Data allocation and computation device. If not specified, the default device will be used.
cfg – Agent’s configuration.

Raises:

KeyError – If a configuration key is missing.

Methods:

`act`(observations, states, *, timestep, timesteps)	Process the environment's observations/states to make a decision (actions) using the main policy.
`enable_models_training_mode`([enabled])	Set the training mode of all the agent's models: enabled (training) or disabled (evaluation).
`enable_training_mode`([enabled, apply_to_models])	Set the training mode of the agent: enabled (training) or disabled (evaluation).
`init`(*[, trainer_cfg])	Initialize the agent.
`load`(path)	Load the agent from the specified path.
`post_interaction`(*, timestep, timesteps)	Method called after the interaction with the environment.
`pre_interaction`(*, timestep, timesteps)	Method called before the interaction with the environment.
`record_transition`(*, observations, states, ...)	Record an environment transition in memory.
`save`(path)	Save the agent to the specified path.
`track_data`(tag, value)	Track data to TensorBoard.
`update`(*, timestep, timesteps)	Algorithm's main update step.
`write_checkpoint`(*, timestep, timesteps)	Write checkpoint (modules) to persistent storage.
`write_tracking_data`(*, timestep, timesteps)	Write tracking data to TensorBoard.

act(observations: torch.Tensor, states: torch.Tensor | None, *, timestep: int, timesteps: int) → tuple[torch.Tensor, dict[str, Any]][source]¶

Process the environment’s observations/states to make a decision (actions) using the main policy.

Parameters:

observations – Environment observations.
states – Environment states.
timestep – Current timestep.
timesteps – Number of timesteps.

Returns:

Agent output. The first component is the expected action/value returned by the agent. The second component is a dictionary containing extra output values according to the model.

enable_models_training_mode(enabled: bool = True) → None[source]¶

Set the training mode of all the agent’s models: enabled (training) or disabled (evaluation).

Parameters:: enabled – True to enable the training mode, False to enable the evaluation mode.

enable_training_mode(enabled: bool = True, *, apply_to_models: bool = False) → None[source]¶

Set the training mode of the agent: enabled (training) or disabled (evaluation).

The training mode can be queried by the training property.

Parameters:

enabled – True to enable the training mode, False to enable the evaluation mode.
apply_to_models – Whether to apply the training mode to all the agent’s models.

init(*, trainer_cfg: dict[str, Any] | None = None) → None[source]¶

Initialize the agent.

Parameters:: trainer_cfg – Trainer configuration.

load(path: str) → None[source]¶

Load the agent from the specified path.

Note

The final storage device is determined by the constructor of the agent.

Parameters:: path – Path to load the agent from.

post_interaction(*, timestep: int, timesteps: int) → None[source]¶

Method called after the interaction with the environment.

Parameters:

timestep – Current timestep.
timesteps – Number of timesteps.

pre_interaction(*, timestep: int, timesteps: int) → None[source]¶

Method called before the interaction with the environment.

Parameters:

timestep – Current timestep.
timesteps – Number of timesteps.

record_transition(*, observations: torch.Tensor, states: torch.Tensor, actions: torch.Tensor, rewards: torch.Tensor, next_observations: torch.Tensor, next_states: torch.Tensor, terminated: torch.Tensor, truncated: torch.Tensor, infos: Any, timestep: int, timesteps: int) → None[source]¶

Record an environment transition in memory.

Parameters:

observations – Environment observations.
states – Environment states.
actions – Actions taken by the agent.
rewards – Instant rewards achieved by the current actions.
next_observations – Next environment observations.
next_states – Next environment states.
terminated – Signals that indicate episodes have terminated.
truncated – Signals that indicate episodes have been truncated.
infos – Additional information about the environment.
timestep – Current timestep.
timesteps – Number of timesteps.

save(path: str) → None[source]¶

Save the agent to the specified path.

Parameters:: path – Path to save the agent to.

track_data(tag: str, value: float) → None[source]¶

Track data to TensorBoard.

Note

Currently only scalar data is supported.

Parameters:

tag – Data identifier (e.g. ‘Loss/Policy loss’).
value – Value to track.

update(*, timestep: int, timesteps: int) → None[source]¶

Algorithm’s main update step.

Parameters:

timestep – Current timestep.
timesteps – Number of timesteps.

write_checkpoint(*, timestep: int, timesteps: int) → None[source]¶

Write checkpoint (modules) to persistent storage.

Note

The checkpoints are stored in the subdirectory checkpoints within the experiment directory. The checkpoint name is the timestep argument value (if it is not None), or the current system date-time otherwise.

Parameters:

timestep – Current timestep.
timesteps – Number of timesteps.

write_tracking_data(*, timestep: int, timesteps: int) → None[source]¶

Write tracking data to TensorBoard.

Parameters:

timestep – Current timestep.
timesteps – Number of timesteps.