Multi-agents

Multi-agents are autonomous entities that interact with the environment to learn and improve their behavior. Multi-agents’ goal is to learn optimal policies, which are correspondence between states and actions that maximize the cumulative reward received from the environment over time.



Implemented multi-agents

The following table lists the implemented multi-agents and their support for different frameworks.

Multi-agents

    pytorch    

    jax    

    warp    

Independent Proximal Policy Optimization (IPPO)

\(\blacksquare\)

\(\blacksquare\)

\(\square\)

Multi-Agent Proximal Policy Optimization (MAPPO)

\(\blacksquare\)

\(\blacksquare\)

\(\square\)



Base class / configuration

Base class and configuration for multi-agent implementations.

API


PyTorch

MultiAgentCfg

Base class for the agent's configuration.

ExperimentCfg

Configuration for the experiment (saving checkpoints and logging data).

MultiAgent

Base class that represent a RL multi-agent/algorithm.

class skrl.multi_agents.torch.MultiAgentCfg(*, experiment: ~skrl.multi_agents.torch.base.ExperimentCfg = <factory>)[source]

Bases: ABC

Base class for the agent’s configuration.

Methods:

expand(*, possible_agents[, immutable])

Expand the configuration.

validate()

Validate the configuration.

Attributes:

experiment

Experiment settings.

expand(*, possible_agents: list[str], immutable: list[str] = []) None[source]

Expand the configuration.

validate() bool[source]

Validate the configuration.

experiment: ExperimentCfg

Experiment settings.

class skrl.multi_agents.torch.ExperimentCfg(*, directory: str = '', experiment_name: str = '', write_interval: int | ~typing.Literal['auto'] = 'auto', checkpoint_interval: int | ~typing.Literal['auto'] = 'auto', store_separately: bool = False, wandb: bool = False, wandb_kwargs: dict = <factory>)[source]

Bases: object

Configuration for the experiment (saving checkpoints and logging data).

Attributes:

checkpoint_interval

Interval (in timesteps) for writing checkpoints.

directory

Directory path where the data generated by the different runs (experiments) are stored.

experiment_name

Name of the experiment (training/evaluation run).

store_separately

Whether to store checkpoints separately.

wandb

Whether to enable the use of Weights & Biases for logging and visualization.

wandb_kwargs

Keyword arguments for the Weights & Biases' setup.

write_interval

Interval (in timesteps) for writing data to TensorBoard.

checkpoint_interval: int | Literal['auto'] = 'auto'

Interval (in timesteps) for writing checkpoints.

  • A value less than or equal to 0 disables the writing of checkpoints.

  • If set to "auto", the interval will be defined to collect 10 samples throughout training/evaluation (timesteps / 10).

directory: str = ''

Directory path where the data generated by the different runs (experiments) are stored.

experiment_name: str = ''

Name of the experiment (training/evaluation run).

If not specified, the format YY-MM-DD_HH-MM-SS-SSSSSS_{agent_name} will be used.

store_separately: bool = False

Whether to store checkpoints separately.

If set to True, all of an agent’s modules (models, optimizers, preprocessors, etc.) will be saved in separate files. By default (False), the modules are grouped in a dictionary and stored in the same file.

wandb: bool = False

Whether to enable the use of Weights & Biases for logging and visualization.

wandb_kwargs: dict

Keyword arguments for the Weights & Biases’ setup.

Visit the Weights & Biases documentation for more details.

write_interval: int | Literal['auto'] = 'auto'

Interval (in timesteps) for writing data to TensorBoard.

  • A value less than or equal to 0 disables the writing of data to TensorBoard.

  • If set to "auto", the interval will be defined to collect 100 samples throughout training/evaluation (timesteps / 100).

class skrl.multi_agents.torch.MultiAgent(*, cfg: MultiAgentCfg, possible_agents: list[str], models: dict[str, dict[str, Model]], memories: dict[str, Memory] | None = None, observation_spaces: dict[str, gymnasium.Space] | None = None, state_spaces: dict[str, gymnasium.Space] | None = None, action_spaces: dict[str, gymnasium.Space] | None = None, device: str | torch.device | None = None)[source]

Bases: ABC

Base class that represent a RL multi-agent/algorithm.

Parameters:
  • cfg – Multi-agent’s configuration.

  • possible_agents – Name of all possible agents the environment could generate.

  • models – Agents’ models.

  • memories – Memories to storage agents’ data and environment transitions.

  • observation_spaces – Observation spaces.

  • state_spaces – State spaces.

  • action_spaces – Action spaces.

  • device – Data allocation and computation device. If not specified, the default device will be used.

Methods:

act(observations, states, *, timestep, timesteps)

Process the environment's observations/states to make a decision (actions) using the main policy.

enable_models_training_mode([enabled])

Set the training mode of all the agent's models: enabled (training) or disabled (evaluation).

enable_training_mode([enabled, apply_to_models])

Set the training mode of the agent: enabled (training) or disabled (evaluation).

init(*[, trainer_cfg])

Initialize the agent.

load(path)

Load the agent from the specified path.

post_interaction(*, timestep, timesteps)

Method called after the interaction with the environment.

pre_interaction(*, timestep, timesteps)

Method called before the interaction with the environment.

record_transition(*, observations, states, ...)

Record an environment transition in memory.

save(path)

Save the agent to the specified path.

track_data(tag, value)

Track data to TensorBoard.

update(*, timestep, timesteps, uid)

Algorithm's main update step.

write_checkpoint(*, timestep, timesteps)

Write checkpoint (modules) to persistent storage.

write_tracking_data(*, timestep, timesteps)

Write tracking data to TensorBoard.

abstractmethod act(observations: dict[str, torch.Tensor], states: dict[str, torch.Tensor | None], *, timestep: int, timesteps: int) tuple[dict[str, torch.Tensor], dict[str, Any]][source]

Process the environment’s observations/states to make a decision (actions) using the main policy.

Parameters:
  • observations – Environment observations.

  • states – Environment states.

  • timestep – Current timestep.

  • timesteps – Number of timesteps.

Returns:

Agent output. The first component is the expected action/value returned by the agent. The second component is a dictionary containing extra output values according to the model.

enable_models_training_mode(enabled: bool = True) None[source]

Set the training mode of all the agent’s models: enabled (training) or disabled (evaluation).

Parameters:

enabled – True to enable the training mode, False to enable the evaluation mode.

enable_training_mode(enabled: bool = True, *, apply_to_models: bool = False) None[source]

Set the training mode of the agent: enabled (training) or disabled (evaluation).

The training mode can be queried by the training property.

Parameters:
  • enabled – True to enable the training mode, False to enable the evaluation mode.

  • apply_to_models – Whether to apply the training mode to all the agent’s models.

init(*, trainer_cfg: dict[str, Any] | None = None) None[source]

Initialize the agent.

Warning

This method must be called before the agent is used. It will initialize the TensorBoard writer (and optionally Weights & Biases) and create the checkpoints directory.

Parameters:

trainer_cfg – Trainer configuration.

load(path: str) None[source]

Load the agent from the specified path.

Note

The final storage device is determined by the constructor of the agent.

Parameters:

path – Path to load the agent from.

abstractmethod post_interaction(*, timestep: int, timesteps: int) None[source]

Method called after the interaction with the environment.

Parameters:
  • timestep – Current timestep.

  • timesteps – Number of timesteps.

abstractmethod pre_interaction(*, timestep: int, timesteps: int) None[source]

Method called before the interaction with the environment.

Parameters:
  • timestep – Current timestep.

  • timesteps – Number of timesteps.

record_transition(*, observations: dict[str, torch.Tensor], states: dict[str, torch.Tensor | None], actions: dict[str, torch.Tensor], rewards: dict[str, torch.Tensor], next_observations: dict[str, torch.Tensor], next_states: dict[str, torch.Tensor], terminated: dict[str, torch.Tensor], truncated: dict[str, torch.Tensor], infos: dict[str, Any], timestep: int, timesteps: int) None[source]

Record an environment transition in memory.

Note

This method keeps track of the episode rewards (instantaneous and cumulative) and timesteps when experiment.write_interval configuration is resolved to a positive value. Inheriting classes must call this method to record such information.

Parameters:
  • observations – Environment observations.

  • states – Environment states.

  • actions – Actions taken by the agent.

  • rewards – Instant rewards achieved by the current actions.

  • next_observations – Next environment observations.

  • next_states – Next environment states.

  • terminated – Signals that indicate episodes have terminated.

  • truncated – Signals that indicate episodes have been truncated.

  • infos – Additional information about the environment.

  • timestep – Current timestep.

  • timesteps – Number of timesteps.

save(path: str) None[source]

Save the agent to the specified path.

Parameters:

path – Path to save the agent to.

track_data(tag: str, value: float) None[source]

Track data to TensorBoard.

Note

Currently only scalar data is supported.

Parameters:
  • tag – Data identifier (e.g. ‘Loss/Policy loss’).

  • value – Value to track.

abstractmethod update(*, timestep: int, timesteps: int, uid: str) None[source]

Algorithm’s main update step.

Warning

This method should not be called directly, but rather by the agent itself when the algorithm is needed for learning.

Parameters:
  • timestep – Current timestep.

  • timesteps – Number of timesteps.

  • uid – Agent ID.

write_checkpoint(*, timestep: int, timesteps: int) None[source]

Write checkpoint (modules) to persistent storage.

Note

The checkpoints are stored in the subdirectory checkpoints within the experiment directory. The checkpoint name is the timestep argument value (if it is not None), or the current system date-time otherwise.

Parameters:
  • timestep – Current timestep.

  • timesteps – Number of timesteps.

write_tracking_data(*, timestep: int, timesteps: int) None[source]

Write tracking data to TensorBoard.

Parameters:
  • timestep – Current timestep.

  • timesteps – Number of timesteps.


JAX

MultiAgentCfg

Base class for the agent's configuration.

ExperimentCfg

Configuration for the experiment (saving checkpoints and logging data).

MultiAgent

Base class that represent a RL multi-agent/algorithm.

class skrl.multi_agents.jax.MultiAgentCfg(*, experiment: ~skrl.multi_agents.jax.base.ExperimentCfg = <factory>)[source]

Bases: ABC

Base class for the agent’s configuration.

Methods:

expand(*, possible_agents[, immutable])

Expand the configuration.

validate()

Validate the configuration.

Attributes:

experiment

Experiment settings.

expand(*, possible_agents: list[str], immutable: list[str] = []) None[source]

Expand the configuration.

validate() bool[source]

Validate the configuration.

experiment: ExperimentCfg

Experiment settings.

class skrl.multi_agents.jax.ExperimentCfg(*, directory: str = '', experiment_name: str = '', write_interval: int | ~typing.Literal['auto'] = 'auto', checkpoint_interval: int | ~typing.Literal['auto'] = 'auto', store_separately: bool = False, wandb: bool = False, wandb_kwargs: dict = <factory>)[source]

Bases: object

Configuration for the experiment (saving checkpoints and logging data).

Attributes:

checkpoint_interval

Interval (in timesteps) for writing checkpoints.

directory

Directory path where the data generated by the different runs (experiments) are stored.

experiment_name

Name of the experiment (training/evaluation run).

store_separately

Whether to store checkpoints separately.

wandb

Whether to enable the use of Weights & Biases for logging and visualization.

wandb_kwargs

Keyword arguments for the Weights & Biases' setup.

write_interval

Interval (in timesteps) for writing data to TensorBoard.

checkpoint_interval: int | Literal['auto'] = 'auto'

Interval (in timesteps) for writing checkpoints.

  • A value less than or equal to 0 disables the writing of checkpoints.

  • If set to "auto", the interval will be defined to collect 10 samples throughout training/evaluation (timesteps / 10).

directory: str = ''

Directory path where the data generated by the different runs (experiments) are stored.

experiment_name: str = ''

Name of the experiment (training/evaluation run).

If not specified, the format YY-MM-DD_HH-MM-SS-SSSSSS_{agent_name} will be used.

store_separately: bool = False

Whether to store checkpoints separately.

If set to True, all of an agent’s modules (models, optimizers, preprocessors, etc.) will be saved in separate files. By default (False), the modules are grouped in a dictionary and stored in the same file.

wandb: bool = False

Whether to enable the use of Weights & Biases for logging and visualization.

wandb_kwargs: dict

Keyword arguments for the Weights & Biases’ setup.

Visit the Weights & Biases documentation for more details.

write_interval: int | Literal['auto'] = 'auto'

Interval (in timesteps) for writing data to TensorBoard.

  • A value less than or equal to 0 disables the writing of data to TensorBoard.

  • If set to "auto", the interval will be defined to collect 100 samples throughout training/evaluation (timesteps / 100).

class skrl.multi_agents.jax.MultiAgent(*, cfg: MultiAgentCfg, possible_agents: list[str], models: dict[str, dict[str, Model]], memories: dict[str, Memory] | None = None, observation_spaces: dict[str, gymnasium.Space] | None = None, state_spaces: dict[str, gymnasium.Space] | None = None, action_spaces: dict[str, gymnasium.Space] | None = None, device: str | jax.Device | None = None)[source]

Bases: ABC

Base class that represent a RL multi-agent/algorithm.

Parameters:
  • cfg – Multi-agent’s configuration.

  • possible_agents – Name of all possible agents the environment could generate.

  • models – Agents’ models.

  • memories – Memories to storage agents’ data and environment transitions.

  • observation_spaces – Observation spaces.

  • state_spaces – State spaces.

  • action_spaces – Action spaces.

  • device – Data allocation and computation device. If not specified, the default device will be used.

Methods:

act(observations, states, *, timestep, timesteps)

Process the environment's observations/states to make a decision (actions) using the main policy.

enable_models_training_mode([enabled])

Set the training mode of all the agent's models: enabled (training) or disabled (evaluation).

enable_training_mode([enabled])

Set the training mode of the agent: enabled (training) or disabled (evaluation).

init(*[, trainer_cfg])

Initialize the agent.

load(path)

Load the agent from the specified path.

post_interaction(*, timestep, timesteps)

Method called after the interaction with the environment.

pre_interaction(*, timestep, timesteps)

Method called before the interaction with the environment.

record_transition(*, observations, states, ...)

Record an environment transition in memory.

save(path)

Save the agent to the specified path.

track_data(tag, value)

Track data to TensorBoard.

update(*, timestep, timesteps, uid)

Algorithm's main update step.

write_checkpoint(*, timestep, timesteps)

Write checkpoint (modules) to persistent storage.

write_tracking_data(*, timestep, timesteps)

Write tracking data to TensorBoard.

abstractmethod act(observations: dict[str, jax.Array], states: dict[str, jax.Array | None], *, timestep: int, timesteps: int) tuple[dict[str, jax.Array], dict[str, Any]][source]

Process the environment’s observations/states to make a decision (actions) using the main policy.

Parameters:
  • observations – Environment observations.

  • states – Environment states.

  • timestep – Current timestep.

  • timesteps – Number of timesteps.

Returns:

Agent output. The first component is the expected action/value returned by the agent. The second component is a dictionary containing extra output values according to the model.

enable_models_training_mode(enabled: bool = True) None[source]

Set the training mode of all the agent’s models: enabled (training) or disabled (evaluation).

Parameters:

enabled – True to enable the training mode, False to enable the evaluation mode.

enable_training_mode(enabled: bool = True) None[source]

Set the training mode of the agent: enabled (training) or disabled (evaluation).

The training mode can be queried by the training property.

Parameters:

enabled – True to enable the training mode, False to enable the evaluation mode.

init(*, trainer_cfg: dict[str, Any] | None = None) None[source]

Initialize the agent.

Warning

This method must be called before the agent is used. It will initialize the TensorBoard writer (and optionally Weights & Biases) and create the checkpoints directory.

Parameters:

trainer_cfg – Trainer configuration.

load(path: str) None[source]

Load the agent from the specified path.

Note

The final storage device is determined by the constructor of the agent.

Parameters:

path – Path to load the agent from.

abstractmethod post_interaction(*, timestep: int, timesteps: int) None[source]

Method called after the interaction with the environment.

Parameters:
  • timestep – Current timestep.

  • timesteps – Number of timesteps.

abstractmethod pre_interaction(*, timestep: int, timesteps: int) None[source]

Method called before the interaction with the environment.

Parameters:
  • timestep – Current timestep.

  • timesteps – Number of timesteps.

record_transition(*, observations: dict[str, jax.Array], states: dict[str, jax.Array | None], actions: dict[str, jax.Array], rewards: dict[str, jax.Array], next_observations: dict[str, jax.Array], next_states: dict[str, jax.Array], terminated: dict[str, jax.Array], truncated: dict[str, jax.Array], infos: dict[str, Any], timestep: int, timesteps: int) None[source]

Record an environment transition in memory.

Note

This method keeps track of the episode rewards (instantaneous and cumulative) and timesteps when experiment.write_interval configuration is resolved to a positive value. Inheriting classes must call this method to record such information.

Parameters:
  • observations – Environment observations.

  • states – Environment states.

  • actions – Actions taken by the agent.

  • rewards – Instant rewards achieved by the current actions.

  • next_observations – Next environment observations.

  • next_states – Next environment states.

  • terminated – Signals that indicate episodes have terminated.

  • truncated – Signals that indicate episodes have been truncated.

  • infos – Additional information about the environment.

  • timestep – Current timestep.

  • timesteps – Number of timesteps.

save(path: str) None[source]

Save the agent to the specified path.

Parameters:

path – Path to save the agent to.

track_data(tag: str, value: float) None[source]

Track data to TensorBoard.

Note

Currently only scalar data is supported.

Parameters:
  • tag – Data identifier (e.g. ‘Loss/Policy loss’).

  • value – Value to track.

abstractmethod update(*, timestep: int, timesteps: int, uid: str) None[source]

Algorithm’s main update step.

Warning

This method should not be called directly, but rather by the agent itself when the algorithm is needed for learning.

Parameters:
  • timestep – Current timestep.

  • timesteps – Number of timesteps.

  • uid – Agent ID.

write_checkpoint(*, timestep: int, timesteps: int) None[source]

Write checkpoint (modules) to persistent storage.

Note

The checkpoints are stored in the subdirectory checkpoints within the experiment directory. The checkpoint name is the timestep argument value (if it is not None), or the current system date-time otherwise.

Parameters:
  • timestep – Current timestep.

  • timesteps – Number of timesteps.

write_tracking_data(*, timestep: int, timesteps: int) None[source]

Write tracking data to TensorBoard.

Parameters:
  • timestep – Current timestep.

  • timesteps – Number of timesteps.