csle_agents.agents.ppo_clean package

Submodules

csle_agents.agents.ppo_clean.ppo_clean_agent module

MIT License

Copyright (c) 2019 CleanRL developers https://github.com/vwxyzjn/cleanrl

class csle_agents.agents.ppo_clean.ppo_clean_agent.PPOCleanAgent(simulation_env_config: SimulationEnvConfig, emulation_env_config: Union[None, EmulationEnvConfig], experiment_config: ExperimentConfig, training_job: Optional[TrainingJobConfig] = None, save_to_metastore: bool = True)[source]

Bases: BaseAgent

A PPO agent using the implementation from CleanRL

generalized_advantage_estimation(model: PPONetwork, next_obs: Tensor, rewards: Tensor, device: device, next_done: Tensor, dones: Tensor, values: Tensor) Tuple[Tensor, Tensor][source]

Computes the generalized advantage estimation (i.e., exponentially weighted average of n-step returns) See (HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION, 2016, ICLR)

Parameters
  • device – the device acted upon

  • values – tensor of values

  • model – the neural network model

  • rewards – tensor of available rewards

  • next_obs – the next observation

  • dones – tensor of done events

  • next_done – logical or operation between terminations and truncations

Returns

returns, advantages

hparam_names() List[str][source]
Returns

a list with the hyperparameter names

make_env() Callable[[], RecordEpisodeStatistics[Any, Any]][source]

Helper function for creating the environment to use for training

Returns

a function that creates the environment

run_ppo(exp_result: ExperimentResult, seed: int) Tuple[ExperimentResult, BaseEnv, PPONetwork][source]

Runs PPO with a given seed

Parameters
  • exp_result – the object to save the experiment results

  • seed – the random seed

Returns

the updated experiment results, the environment, and the trained model

train() ExperimentExecution[source]

Runs the training process

Returns

the results

update_trajectory_buffers(global_step: int, envs: SyncVectorEnv, obs: Tensor, dones: Tensor, actions: Tensor, rewards: Tensor, device: device, logprobs: Tensor, values: Tensor, model: PPONetwork, next_obs: Tensor, next_done: Tensor, horizons: List[int]) Tuple[Tensor, Tensor, int, List[int]][source]

Updates the buffers of trajectories collected from the environment

Parameters
  • step (global) – the global step in the iteration

  • envs – list of environments

  • obs – torch tensor of observations

  • dones – tensor of done events

  • actions – tensor of available actions

  • rewards – tensor of available rewards

  • device – the device acted upon

  • logprobs – logarithmic probabilities

  • horizons – list of time horizons

  • values – tensor of values

  • model – the neural network model

  • next_obs – the next observation

  • next_done – logical or operation between terminations and truncations

Returns

next_obs, next_done, global_step

Module contents