State Action Reward State Action (SARSA)#
SARSA is a model-free on-policy algorithm that uses a tabular Q-function to handle discrete observations and action spaces
Paper: On-Line Q-Learning Using Connectionist Systems
Algorithm#
Algorithm implementation#
Decision making#
act(...)
Learning algorithm#
_update(...)
Usage#
# import the agent and its default configuration
from skrl.agents.torch.sarsa import SARSA, SARSA_DEFAULT_CONFIG
# instantiate the agent's models
models = {}
models["policy"] = ...
# adjust some configuration if necessary
cfg_agent = SARSA_DEFAULT_CONFIG.copy()
cfg_agent["<KEY>"] = ...
# instantiate the agent
# (assuming a defined environment <env>)
agent = SARSA(models=models,
memory=None,
cfg=cfg_agent,
observation_space=env.observation_space,
action_space=env.action_space,
device=env.device)
Configuration and hyperparameters#
SARSA_DEFAULT_CONFIG = {
"discount_factor": 0.99, # discount factor (gamma)
"random_timesteps": 0, # random exploration steps
"learning_starts": 0, # learning starts after this many steps
"learning_rate": 0.5, # learning rate (alpha)
"rewards_shaper": None, # rewards shaping function: Callable(reward, timestep, timesteps) -> reward
"experiment": {
"directory": "", # experiment's parent directory
"experiment_name": "", # experiment name
"write_interval": 250, # TensorBoard writing interval (timesteps)
"checkpoint_interval": 1000, # interval for checkpoints (timesteps)
"store_separately": False, # whether to store checkpoints separately
"wandb": False, # whether to use Weights & Biases
"wandb_kwargs": {} # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
}
}
Spaces#
The implementation supports the following Gym spaces / Gymnasium spaces
Gym/Gymnasium spaces |
Observation |
Action |
---|---|---|
Discrete |
\(\blacksquare\) |
\(\blacksquare\) |
Box |
\(\square\) |
\(\square\) |
Dict |
\(\square\) |
\(\square\) |
Models#
The implementation uses 1 table. This table (model) must be collected in a dictionary and passed to the constructor of the class under the argument models
Notation |
Concept |
Key |
Input shape |
Output shape |
Type |
---|---|---|---|---|---|
\(\pi_{Q[s,a]}(s)\) |
Policy (\(\epsilon\)-greedy) |
|
observation |
action |
API (PyTorch)#
- skrl.agents.torch.sarsa.SARSA_DEFAULT_CONFIG#
alias of {‘discount_factor’: 0.99, ‘experiment’: {‘checkpoint_interval’: 1000, ‘directory’: ‘’, ‘experiment_name’: ‘’, ‘store_separately’: False, ‘wandb’: False, ‘wandb_kwargs’: {}, ‘write_interval’: 250}, ‘learning_rate’: 0.5, ‘learning_starts’: 0, ‘random_timesteps’: 0, ‘rewards_shaper’: None}
- class skrl.agents.torch.sarsa.SARSA(models: Dict[str, Model], memory: Memory | Tuple[Memory] | None = None, observation_space: int | Tuple[int] | gym.Space | gymnasium.Space | None = None, action_space: int | Tuple[int] | gym.Space | gymnasium.Space | None = None, device: str | torch.device | None = None, cfg: dict | None = None)#
Bases:
Agent
- __init__(models: Dict[str, Model], memory: Memory | Tuple[Memory] | None = None, observation_space: int | Tuple[int] | gym.Space | gymnasium.Space | None = None, action_space: int | Tuple[int] | gym.Space | gymnasium.Space | None = None, device: str | torch.device | None = None, cfg: dict | None = None) None #
State Action Reward State Action (SARSA)
https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.2539
- Parameters:
models (dictionary of skrl.models.torch.Model) – Models used by the agent
memory (skrl.memory.torch.Memory, list of skrl.memory.torch.Memory or None) – Memory to storage the transitions. If it is a tuple, the first element will be used for training and for the rest only the environment transitions will be added
observation_space (int, tuple or list of int, gym.Space, gymnasium.Space or None, optional) – Observation/state space or shape (default:
None
)action_space (int, tuple or list of int, gym.Space, gymnasium.Space or None, optional) – Action space or shape (default:
None
)device (str or torch.device, optional) – Device on which a tensor/array is or will be allocated (default:
None
). If None, the device will be either"cuda"
if available or"cpu"
cfg (dict) – Configuration dictionary
- Raises:
KeyError – If the models dictionary is missing a required key
- act(states: torch.Tensor, timestep: int, timesteps: int) torch.Tensor #
Process the environment’s states to make a decision (actions) using the main policy
- Parameters:
states (torch.Tensor) – Environment’s states
timestep (int) – Current timestep
timesteps (int) – Number of timesteps
- Returns:
Actions
- Return type:
- post_interaction(timestep: int, timesteps: int) None #
Callback called after the interaction with the environment
- pre_interaction(timestep: int, timesteps: int) None #
Callback called before the interaction with the environment
- record_transition(states: torch.Tensor, actions: torch.Tensor, rewards: torch.Tensor, next_states: torch.Tensor, terminated: torch.Tensor, truncated: torch.Tensor, infos: Any, timestep: int, timesteps: int) None #
Record an environment transition in memory
- Parameters:
states (torch.Tensor) – Observations/states of the environment used to make the decision
actions (torch.Tensor) – Actions taken by the agent
rewards (torch.Tensor) – Instant rewards achieved by the current actions
next_states (torch.Tensor) – Next observations/states of the environment
terminated (torch.Tensor) – Signals to indicate that episodes have terminated
truncated (torch.Tensor) – Signals to indicate that episodes have been truncated
infos (Any type supported by the environment) – Additional information about the environment
timestep (int) – Current timestep
timesteps (int) – Number of timesteps