Getting Started

In this section, you will learn how to use the various components of the skrl library to create reinforcement learning tasks. Whether you are a beginner or an experienced researcher, we hope this section will provide you with a solid foundation to build upon.

We recommend visiting the Examples to see how the components can be integrated and applied in practice. Let’s get started!



Reinforcement Learning schema

Reinforcement Learning (RL) is a Machine Learning sub-field for decision making that allows an agent to learn from its interaction with the environment as shown in the following schema:

Reinforcement Learning schemaReinforcement Learning schema

At each step (also called timestep) of interaction with the environment, the agent sees an observation \(o_t\) of the complete description of the state \(s_t \in S\) of the environment. Then, it decides which action \(a_t \in A\) to take from the action space using a policy. The environment, which changes in response to the agent’s action (or by itself), returns a reward signal \(r_t = R(s_t, a_t, s_{t+1})\) as a measure of how good or bad the action was that moved it to its new state \(s_{t+1}\). The agent aims to maximize the cumulative reward (discounted or not by a factor \(\gamma \in (0,1]\)) by adjusting the policy’s behaviour via some optimization algorithm.

From this schema, this section is intended to guide in the creation of a RL system using skrl


1. Environments

The environment plays a fundamental role in the definition of the RL schema. For example, the selection of the agent depends strongly on the observation and action space nature. There are several interfaces to interact with the environments such as OpenAI Gym / Farama Gymnasium or DeepMind. However, each of them has a different API and work with non-compatible data types.

  • For single-agent environments, skrl offers a function to wrap environments based on the Gym/Gymnasium, DeepMind, NVIDIA Isaac Gym, Isaac Orbit and Omniverse Isaac Gym interfaces, among others. The wrapped environments provide, to the library components, a common interface (based on Gym/Gymnasium) as shown in the following figure. Refer to the Wrapping (single-agent) section for more information.

  • For multi-agent environments, skrl offers a function to wrap environments based on the PettingZoo and Bi-DexHands interfaces. The wrapped environments provide, to the library components, a common interface (based on PettingZoo) as shown in the following figure. Refer to the Wrapping (multi-agents) section for more information.

Environment wrappingEnvironment wrapping

Among the methods and properties defined in the wrapped environment, the observation and action spaces are one of the most relevant for instantiating other library components. The following code snippets show how to load and wrap environments based on the supported interfaces:

# import the environment wrapper and loader
from skrl.envs.wrappers.torch import wrap_env
from skrl.envs.loaders.torch import load_omniverse_isaacgym_env

# load the environment
env = load_omniverse_isaacgym_env(task_name="Cartpole")

# wrap the environment
env = wrap_env(env)  # or 'env = wrap_env(env, wrapper="omniverse-isaacgym")'

Once the environment is known (and instantiated), it is time to configure and instantiate the agent. Agents are composed, apart from the optimization algorithm, by several components, such as memories, models or noises, for example, according to their nature. The following subsections focus on those components.


2. Memories

Memories are storage components that allow agents to collect and use/reuse recent or past experiences or other types of information. These can be large in size (such as replay buffers used by off-policy algorithms like DDPG, TD3 or SAC) or small in size (such as rollout buffers used by on-policy algorithms like PPO or TRPO to store batches that are discarded after use).

skrl provides generic memory definitions that are not tied to the agent implementation and can be used for any role, such as rollout or replay buffers. They are empty shells when they are instantiated and the agents are in charge of defining the tensors according to their needs. The total space occupied is the product of the memory size (memory_size), the number of environments (num_envs) obtained from the wrapped environment and the data size for each defined tensor.

The following code snippets show how to instantiate a memory:

# import the memory class
from skrl.memories.torch import RandomMemory

# instantiate the memory (assumes there is a wrapped environment: env)
memory = RandomMemory(memory_size=1000, num_envs=env.num_envs, device=env.device)

Memories are passed directly to the agent constructor, if required (not all agents require memory, such as Q-learning or SARSA, for example), during its instantiation under the argument memory (or memories).


3. Models

Models are the agents’ brains. Agents can have one or several models and their parameters are adjusted via the optimization algorithms.

In contrast to other libraries, skrl does not provide predefined models or fixed templates (this practice tends to hide and reduce the flexibility of the system, forcing developers to deeply inspect the code to make even small changes). Nevertheless, helper mixins are provided to create discrete and continuous (stochastic or deterministic) models with the library. In this way, the user/researcher should only be concerned with the definition of the approximation functions (tables or artificial neural networks), having all the control in his hands. The following diagrams show the concept of the provided mixins.

Categorical modelCategorical model

For snippets refer to Categorical model section.

Models must be collected in a dictionary and passed to the agent constructor during its instantiation under the argument models. The dictionary keys are specific to each agent. Visit their respective documentation for more details (under Spaces and models section). For example, the PPO agent requires the policy and value models as shown below:

models = {}
models["policy"] = Policy(env.observation_space, env.action_space, env.device)
models["value"] = Value(env.observation_space, env.action_space, env.device)

Models can be saved and loaded to and from the file system. However, the recommended practice for loading checkpoints to perform evaluations or continue an interrupted training is through the agents (they include, in addition to the models, other components and internal instances such as preprocessors or optimizers). Refer to Saving, loading and logging (under Checkpoints section) for more information.


4. Noises

Noise plays a fundamental role in the exploration stage, especially in agents of a deterministic nature, such as DDPG or TD3, for example.

skrl provides, as part of its resources, classes for instantiating noises as shown in the following code snippets. Refer to Noises documentation for more information. Noise instances are passed to the agents in their respective configuration dictionaries.

from skrl.resources.noises.torch import GaussianNoise

cfg = DEFAULT_CONFIG.copy()
cfg["exploration"]["noise"] = GaussianNoise(mean=0, std=0.2, device="cuda:0")

5. Learning rate schedulers

Learning rate schedulers help RL system converge faster and improve accuracy.

skrl supports all PyTorch and JAX (Optax) learning rate schedulers and provides, as part of its resources, additional schedulers. Refer to Learning rate schedulers documentation for more information.

Learning rate schedulers classes and their respective arguments (except the optimizer argument) are passed to the agents in their respective configuration dictionaries. For example, for the PPO agent, one of the schedulers can be configured as shown below:

from skrl.agents.torch.ppo import PPO, PPO_DEFAULT_CONFIG
from skrl.resources.schedulers.torch import KLAdaptiveRL

agent_cfg = PPO_DEFAULT_CONFIG.copy()
agent_cfg["learning_rate_scheduler"] = KLAdaptiveRL
agent_cfg["learning_rate_scheduler_kwargs"] = {"kl_threshold": 0.008}

6. Preprocessors

Data preprocessing can help increase the accuracy and efficiency of training by cleaning or making data suitable for machine learning models.

skrl provides, as part of its resources, preprocessors classes. Refer to Preprocessors documentation for more information.

Preprocessors classes and their respective arguments are passed to the agents in their respective configuration dictionaries. For example, for the PPO agent, one of the preprocessors can be configured as shown below:

from skrl.agents.torch.ppo import PPO, PPO_DEFAULT_CONFIG
from skrl.resources.preprocessors.torch import RunningStandardScaler

agent_cfg["state_preprocessor"] = RunningStandardScaler
agent_cfg["state_preprocessor_kwargs"] = {"size": env.observation_space, "device": env.device}
agent_cfg["value_preprocessor"] = RunningStandardScaler
agent_cfg["value_preprocessor_kwargs"] = {"size": 1, "device": env.device}

7. Agents

Agents are the components in charge of decision making. They are much more than models (neural networks, for example) and include the optimization algorithms that compute the optimal policy

skrl provides state-of-the-art agents. Their implementations are focused on readability, simplicity and code transparency. Each agent is implemented independently even when two or more agents may contain code in common. Refer to each agent documentation for more information about the models and spaces they support, their respective configurations, algorithm details and more.

Agents generally expect, as arguments, the following components: models and memories, as well as the following variables: observation and action spaces, the device where their logic is executed and a configuration dictionary with hyperparameters and other values. The remaining components, mentioned above, are collected through the configuration dictionary. For example, the PPO agent can be instantiated as follows:

from skrl.agents.torch.ppo import PPO

agent = PPO(models=models,  # models dict
            memory=memory,  # memory instance, or None if not required
            cfg=agent_cfg,  # configuration dict (preprocessors, learning rate schedulers, etc.)
            observation_space=env.observation_space,
            action_space=env.action_space,
            device=env.device)

Agents can be saved and loaded to and from the file system. This is the recommended practice for loading checkpoints to perform evaluations or to continue interrupted training (since they include, in addition to models, other internal components and instances such as preprocessors or optimizers). Refer to Saving, loading and logging (under Checkpoints section) for more information.


8. Trainers

Now that both actors, the environment and the agent, are instantiated, it is time to put the RL system in motion.

skrl offers classes (called Trainers) that manage the interaction cycle between the environment and the agent(s) for both: training and evaluation. These classes also enable the simultaneous training and evaluation of several agents by scope (subsets of environments among all available environments), which may or may not share resources, in the same run.

The following code snippets show how to train/evaluate RL systems using the available trainers:

from skrl.trainers.torch import SequentialTrainer

# assuming there is an environment called 'env'
# and an agent or a list of agents called 'agents'

# create a sequential trainer
cfg = {"timesteps": 50000, "headless": False}
trainer = SequentialTrainer(env=env, agents=agents, cfg=cfg)

# train the agent(s)
trainer.train()

# evaluate the agent(s)
trainer.eval()


What’s next?

Visit the Examples section for training and evaluation demonstrations with different environment interfaces and highlighted practices, among others.