Cross-Entropy Method (CEM)

Algorithm implementation

Main notation/symbols:
- policy function approximator (\(\pi_\theta\))
- states (\(s\)), actions (\(a\)), rewards (\(r\)), next states (\(s'\)), dones (\(d\))
- loss (\(L\))

Decision making (act(...))

\(a \leftarrow \pi_\theta(s)\)

Learning algorithm (_update(...))

# sample all memory
\(s, a, r, s', d \leftarrow\) states, actions, rewards, next_states, dones
# compute discounted return threshold
\([G] \leftarrow \sum_{t=0}^{E-1}\) discount_factor\(^{t} \, r_t\) for each episode
\(G_{_{bound}} \leftarrow q_{th_{quantile}}([G])\) at the given percentile
# get elite states and actions
\(s_{_{elite}} \leftarrow s[G \geq G_{_{bound}}]\)
\(a_{_{elite}} \leftarrow a[G \geq G_{_{bound}}]\)
# compute scores for the elite states
\(scores \leftarrow \theta(s_{_{elite}})\)
# compute policy loss
\(L_{\pi_\theta} \leftarrow -\sum_{i=1}^{N} a_{_{elite}} \log(scores)\)
# optimization step
reset \(\text{optimizer}_\theta\)
\(\nabla_{\theta} L_{\pi_\theta}\)
step \(\text{optimizer}_\theta\)
# update learning rate
IF there is a learning_rate_scheduler THEN
step \(\text{scheduler}_\theta (\text{optimizer}_\theta)\)

Configuration and hyperparameters

skrl.agents.torch.cem.cem.CEM_DEFAULT_CONFIG
 1CEM_DEFAULT_CONFIG = {
 2    "rollouts": 16,                 # number of rollouts before updating
 3    "percentile": 0.70,             # percentile to compute the reward bound [0, 1]
 4
 5    "discount_factor": 0.99,        # discount factor (gamma)
 6
 7    "learning_rate": 1e-2,          # learning rate
 8    "learning_rate_scheduler": None,        # learning rate scheduler class (see torch.optim.lr_scheduler)
 9    "learning_rate_scheduler_kwargs": {},   # learning rate scheduler's kwargs (e.g. {"step_size": 1e-3})
10
11    "state_preprocessor": None,             # state preprocessor class (see skrl.resources.preprocessors)
12    "state_preprocessor_kwargs": {},        # state preprocessor's kwargs (e.g. {"size": env.observation_space})
13
14    "random_timesteps": 0,          # random exploration steps
15    "learning_starts": 0,           # learning starts after this many steps
16
17    "rewards_shaper": None,         # rewards shaping function: Callable(reward, timestep, timesteps) -> reward
18
19    "experiment": {
20        "directory": "",            # experiment's parent directory
21        "experiment_name": "",      # experiment name
22        "write_interval": 250,      # TensorBoard writing interval (timesteps)
23
24        "checkpoint_interval": 1000,        # interval for checkpoints (timesteps)
25        "store_separately": False,          # whether to store checkpoints separately
26
27        "wandb": False,             # whether to use Weights & Biases
28        "wandb_kwargs": {}          # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
29    }
30}

Spaces and models

The implementation supports the following Gym spaces / Gymnasium spaces

Gym/Gymnasium spaces

Observation

Action

Discrete

\(\square\)

\(\blacksquare\)

Box

\(\blacksquare\)

\(\square\)

Dict

\(\blacksquare\)

\(\square\)

The implementation uses 1 discrete function approximator. This function approximator (model) must be collected in a dictionary and passed to the constructor of the class under the argument models

Notation

Concept

Key

Input shape

Output shape

Type

\(\pi(s)\)

Policy

"policy"

observation

action

Categorical

Support for advanced features is described in the next table

Feature

Support and remarks

RNN support

-

API