Cross-Entropy Method (CEM)
Algorithm implementation
Main notation/symbols:
- policy function approximator (\(\pi_\theta\))
- states (\(s\)), actions (\(a\)), rewards (\(r\)), next states (\(s'\)), dones (\(d\))
- loss (\(L\))
Decision making (act(...)
)
\(a \leftarrow \pi_\theta(s)\)
Learning algorithm (_update(...)
)
# sample all memory
\(s, a, r, s', d \leftarrow\) states, actions, rewards, next_states, dones
# compute discounted return threshold
\([G] \leftarrow \sum_{t=0}^{E-1}\) discount_factor\(^{t} \, r_t\) for each episode
\(G_{_{bound}} \leftarrow q_{th_{quantile}}([G])\) at the given percentile
# get elite states and actions
\(s_{_{elite}} \leftarrow s[G \geq G_{_{bound}}]\)
\(a_{_{elite}} \leftarrow a[G \geq G_{_{bound}}]\)
# compute scores for the elite states
\(scores \leftarrow \theta(s_{_{elite}})\)
# compute policy loss
\(L_{\pi_\theta} \leftarrow -\sum_{i=1}^{N} a_{_{elite}} \log(scores)\)
# optimization step
reset \(\text{optimizer}_\theta\)
\(\nabla_{\theta} L_{\pi_\theta}\)
step \(\text{optimizer}_\theta\)
# update learning rate
IF there is a learning_rate_scheduler THEN
step \(\text{scheduler}_\theta (\text{optimizer}_\theta)\)
Configuration and hyperparameters
- skrl.agents.torch.cem.cem.CEM_DEFAULT_CONFIG
1CEM_DEFAULT_CONFIG = {
2 "rollouts": 16, # number of rollouts before updating
3 "percentile": 0.70, # percentile to compute the reward bound [0, 1]
4
5 "discount_factor": 0.99, # discount factor (gamma)
6
7 "learning_rate": 1e-2, # learning rate
8 "learning_rate_scheduler": None, # learning rate scheduler class (see torch.optim.lr_scheduler)
9 "learning_rate_scheduler_kwargs": {}, # learning rate scheduler's kwargs (e.g. {"step_size": 1e-3})
10
11 "state_preprocessor": None, # state preprocessor class (see skrl.resources.preprocessors)
12 "state_preprocessor_kwargs": {}, # state preprocessor's kwargs (e.g. {"size": env.observation_space})
13
14 "random_timesteps": 0, # random exploration steps
15 "learning_starts": 0, # learning starts after this many steps
16
17 "rewards_shaper": None, # rewards shaping function: Callable(reward, timestep, timesteps) -> reward
18
19 "experiment": {
20 "directory": "", # experiment's parent directory
21 "experiment_name": "", # experiment name
22 "write_interval": 250, # TensorBoard writing interval (timesteps)
23
24 "checkpoint_interval": 1000, # interval for checkpoints (timesteps)
25 "store_separately": False, # whether to store checkpoints separately
26
27 "wandb": False, # whether to use Weights & Biases
28 "wandb_kwargs": {} # wandb kwargs (see https://docs.wandb.ai/ref/python/init)
29 }
30}
Spaces and models
The implementation supports the following Gym spaces / Gymnasium spaces
Gym/Gymnasium spaces |
Observation |
Action |
---|---|---|
Discrete |
\(\square\) |
\(\blacksquare\) |
Box |
\(\blacksquare\) |
\(\square\) |
Dict |
\(\blacksquare\) |
\(\square\) |
The implementation uses 1 discrete function approximator. This function approximator (model) must be collected in a dictionary and passed to the constructor of the class under the argument models
Notation |
Concept |
Key |
Input shape |
Output shape |
Type |
---|---|---|---|---|---|
\(\pi(s)\) |
Policy |
|
observation |
action |
Support for advanced features is described in the next table
Feature |
Support and remarks |
---|---|
RNN support |
- |