Cross-Entropy Method (CEM)

Algorithm implementation

Main notation/symbols:

- policy function approximator (\(\pi_\theta\))
- states (\(s\)), actions (\(a\)), rewards (\(r\)), next states (\(s'\)), dones (\(d\))
- loss (\(L\))

Decision making (act(...))

\(a \leftarrow \pi_\theta(s)\)

Learning algorithm (_update(...))

# sample all memory
\(s, a, r, s', d \leftarrow\) states, actions, rewards, next_states, dones
# compute discounted return threshold
\([G] \leftarrow \sum_{t=0}^{E-1}\) discount_factor\(^{t} \, r_t\) for each episode
\(G_{_{bound}} \leftarrow q_{th_{quantile}}([G])\) at the given percentile
# get elite states and actions
\(s_{_{elite}} \leftarrow s[G \geq G_{_{bound}}]\)
\(a_{_{elite}} \leftarrow a[G \geq G_{_{bound}}]\)
# compute scores for the elite states
\(scores \leftarrow \theta(s_{_{elite}})\)
# compute policy loss
\(L_{\pi_\theta} \leftarrow -\sum_{i=1}^{N} a_{_{elite}} \log(scores)\)
# optimization step
reset \(\text{optimizer}_\theta\)
\(\nabla_{\theta} L_{\pi_\theta}\)
step \(\text{optimizer}_\theta\)
# update learning rate
IF there is a learning_rate_scheduler THEN
step \(\text{scheduler}_\theta (\text{optimizer}_\theta)\)

Configuration and hyperparameters

skrl.agents.torch.cem.cem.CEM_DEFAULT_CONFIG

CEM_DEFAULT_CONFIG = {
    "rollouts": 16,                 # number of rollouts before updating
    "percentile": 0.70,             # percentile to compute the reward bound [0, 1]

    "discount_factor": 0.99,        # discount factor (gamma)

    "learning_rate": 1e-2,          # learning rate
    "learning_rate_scheduler": None,        # learning rate scheduler class (see torch.optim.lr_scheduler)
    "learning_rate_scheduler_kwargs": {},   # learning rate scheduler's kwargs (e.g. {"step_size": 1e-3})

    "state_preprocessor": None,             # state preprocessor class (see skrl.resources.preprocessors)
    "state_preprocessor_kwargs": {},        # state preprocessor's kwargs (e.g. {"size": env.observation_space})

    "random_timesteps": 0,          # random exploration steps
    "learning_starts": 0,           # learning starts after this many steps

    "rewards_shaper": None,         # rewards shaping function: Callable(reward, timestep, timesteps) -> reward

    "experiment": {
        "directory": "",            # experiment's parent directory
        "experiment_name": "",      # experiment name
        "write_interval": 250,      # TensorBoard writing interval (timesteps)

        "checkpoint_interval": 1000,        # interval for checkpoints (timesteps)
        "store_separately": False,          # whether to store checkpoints separately

        "wandb": False,             # whether to use Weights & Biases
        "wandb_kwargs": {}          # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
    }
}

Spaces and models

The implementation supports the following Gym spaces / Gymnasium spaces

Gym/Gymnasium spaces	Observation	Action
Discrete	\(\square\)	\(\blacksquare\)
Box	\(\blacksquare\)	\(\square\)
Dict	\(\blacksquare\)	\(\square\)

The implementation uses 1 discrete function approximator. This function approximator (model) must be collected in a dictionary and passed to the constructor of the class under the argument models

Notation	Concept	Key	Input shape	Output shape	Type
\(\pi(s)\)	Policy	`"policy"`	observation	action	Categorical

Support for advanced features is described in the next table

Feature	Support and remarks
RNN support	-

Cross-Entropy Method (CEM)

Algorithm implementation

Configuration and hyperparameters

Spaces and models

API