State Action Reward State Action (SARSA)

SARSA is a model-free on-policy algorithm that uses a tabular Q-function to handle discrete observations and action spaces

Paper: On-Line Q-Learning Using Connectionist Systems

Algorithm implementation

Main notation/symbols:

- action-value function (\(Q\))

- states (\(s\)), actions (\(a\)), rewards (\(r\)), next states (\(s'\)), dones (\(d\))

Decision making (act(...))

\(a \leftarrow \pi_{Q[s,a]}(s) \qquad\) where \(\; a \leftarrow \begin{cases} a \in_R A & x < \epsilon \\ \underset{a}{\arg\max} \; Q[s] & x \geq \epsilon \end{cases} \qquad\) for \(\; x \leftarrow U(0,1)\)

Learning algorithm (_update(...))

# compute next actions
\(a' \leftarrow \pi_{Q[s,a]}(s') \qquad\) # the only difference with Q-learning
# update Q-table
\(Q[s,a] \leftarrow Q[s,a] \;+\) learning_rate \((r \;+\) discount_factor \(\neg d \; Q[s',a'] - Q[s,a])\)

Configuration and hyperparameters

skrl.agents.torch.sarsa.sarsa.SARSA_DEFAULT_CONFIG

SARSA_DEFAULT_CONFIG = {
    "discount_factor": 0.99,        # discount factor (gamma)

    "random_timesteps": 0,          # random exploration steps
    "learning_starts": 0,           # learning starts after this many steps

    "learning_rate": 0.5,           # learning rate (alpha)

    "rewards_shaper": None,         # rewards shaping function: Callable(reward, timestep, timesteps) -> reward

    "experiment": {
        "directory": "",            # experiment's parent directory
        "experiment_name": "",      # experiment name
        "write_interval": 250,      # TensorBoard writing interval (timesteps)

        "checkpoint_interval": 1000,        # interval for checkpoints (timesteps)
        "store_separately": False,          # whether to store checkpoints separately

        "wandb": False,             # whether to use Weights & Biases
        "wandb_kwargs": {}          # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
    }
}

Spaces and models

The implementation supports the following Gym spaces / Gymnasium spaces

Gym/Gymnasium spaces	Observation	Action
Discrete	\(\blacksquare\)	\(\blacksquare\)
Box	\(\square\)	\(\square\)
Dict	\(\square\)	\(\square\)

The implementation uses 1 table. This table (model) must be collected in a dictionary and passed to the constructor of the class under the argument models

Notation	Concept	Key	Input shape	Output shape	Type
\(\pi_{Q[s,a]}(s)\)	Policy (\(\epsilon\)-greedy)	`"policy"`	observation	action	Tabular

State Action Reward State Action (SARSA)

Algorithm implementation

Configuration and hyperparameters

Spaces and models

API