State Action Reward State Action (SARSA)
SARSA is a model-free on-policy algorithm that uses a tabular Q-function to handle discrete observations and action spaces
Paper: On-Line Q-Learning Using Connectionist Systems
Algorithm implementation
Main notation/symbols:
- action-value function (\(Q\))
- states (\(s\)), actions (\(a\)), rewards (\(r\)), next states (\(s'\)), dones (\(d\))
Decision making (act(...)
)
\(a \leftarrow \pi_{Q[s,a]}(s) \qquad\) where \(\; a \leftarrow \begin{cases} a \in_R A & x < \epsilon \\ \underset{a}{\arg\max} \; Q[s] & x \geq \epsilon \end{cases} \qquad\) for \(\; x \leftarrow U(0,1)\)
Learning algorithm (_update(...)
)
# compute next actions
\(a' \leftarrow \pi_{Q[s,a]}(s') \qquad\) # the only difference with Q-learning
# update Q-table
\(Q[s,a] \leftarrow Q[s,a] \;+\) learning_rate \((r \;+\) discount_factor \(\neg d \; Q[s',a'] - Q[s,a])\)
Configuration and hyperparameters
- skrl.agents.torch.sarsa.sarsa.SARSA_DEFAULT_CONFIG
1SARSA_DEFAULT_CONFIG = {
2 "discount_factor": 0.99, # discount factor (gamma)
3
4 "random_timesteps": 0, # random exploration steps
5 "learning_starts": 0, # learning starts after this many steps
6
7 "learning_rate": 0.5, # learning rate (alpha)
8
9 "rewards_shaper": None, # rewards shaping function: Callable(reward, timestep, timesteps) -> reward
10
11 "experiment": {
12 "directory": "", # experiment's parent directory
13 "experiment_name": "", # experiment name
14 "write_interval": 250, # TensorBoard writing interval (timesteps)
15
16 "checkpoint_interval": 1000, # interval for checkpoints (timesteps)
17 "store_separately": False, # whether to store checkpoints separately
18
19 "wandb": False, # whether to use Weights & Biases
20 "wandb_kwargs": {} # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
21 }
22}
Spaces and models
The implementation supports the following Gym spaces / Gymnasium spaces
Gym/Gymnasium spaces |
Observation |
Action |
---|---|---|
Discrete |
\(\blacksquare\) |
\(\blacksquare\) |
Box |
\(\square\) |
\(\square\) |
Dict |
\(\square\) |
\(\square\) |
The implementation uses 1 table. This table (model) must be collected in a dictionary and passed to the constructor of the class under the argument models
Notation |
Concept |
Key |
Input shape |
Output shape |
Type |
---|---|---|---|---|---|
\(\pi_{Q[s,a]}(s)\) |
Policy (\(\epsilon\)-greedy) |
|
observation |
action |