State Action Reward State Action (SARSA)

SARSA is a model-free on-policy algorithm that uses a tabular Q-function to handle discrete observations and action spaces

Paper: On-Line Q-Learning Using Connectionist Systems

Algorithm implementation

Main notation/symbols:
- action-value function (\(Q\))
- states (\(s\)), actions (\(a\)), rewards (\(r\)), next states (\(s'\)), dones (\(d\))

Decision making (act(...))

\(a \leftarrow \pi_{Q[s,a]}(s) \qquad\) where \(\; a \leftarrow \begin{cases} a \in_R A & x < \epsilon \\ \underset{a}{\arg\max} \; Q[s] & x \geq \epsilon \end{cases} \qquad\) for \(\; x \leftarrow U(0,1)\)

Learning algorithm (_update(...))

# compute next actions
\(a' \leftarrow \pi_{Q[s,a]}(s') \qquad\) # the only difference with Q-learning
# update Q-table
\(Q[s,a] \leftarrow Q[s,a] \;+\) learning_rate \((r \;+\) discount_factor \(\neg d \; Q[s',a'] - Q[s,a])\)

Configuration and hyperparameters

skrl.agents.torch.sarsa.sarsa.SARSA_DEFAULT_CONFIG
 1SARSA_DEFAULT_CONFIG = {
 2    "discount_factor": 0.99,        # discount factor (gamma)
 3
 4    "random_timesteps": 0,          # random exploration steps
 5    "learning_starts": 0,           # learning starts after this many steps
 6
 7    "learning_rate": 0.5,           # learning rate (alpha)
 8
 9    "rewards_shaper": None,         # rewards shaping function: Callable(reward, timestep, timesteps) -> reward
10
11    "experiment": {
12        "directory": "",            # experiment's parent directory
13        "experiment_name": "",      # experiment name
14        "write_interval": 250,      # TensorBoard writing interval (timesteps)
15
16        "checkpoint_interval": 1000,        # interval for checkpoints (timesteps)
17        "store_separately": False,          # whether to store checkpoints separately
18
19        "wandb": False,             # whether to use Weights & Biases
20        "wandb_kwargs": {}          # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
21    }
22}

Spaces and models

The implementation supports the following Gym spaces / Gymnasium spaces

Gym/Gymnasium spaces

Observation

Action

Discrete

\(\blacksquare\)

\(\blacksquare\)

Box

\(\square\)

\(\square\)

Dict

\(\square\)

\(\square\)

The implementation uses 1 table. This table (model) must be collected in a dictionary and passed to the constructor of the class under the argument models

Notation

Concept

Key

Input shape

Output shape

Type

\(\pi_{Q[s,a]}(s)\)

Policy (\(\epsilon\)-greedy)

"policy"

observation

action

Tabular

API