csle_agents.agents.ppo_clean package
Submodules
csle_agents.agents.ppo_clean.ppo_clean_agent module
MIT License
Copyright (c) 2019 CleanRL developers https://github.com/vwxyzjn/cleanrl
- class csle_agents.agents.ppo_clean.ppo_clean_agent.PPOCleanAgent(simulation_env_config: csle_common.dao.simulation_config.simulation_env_config.SimulationEnvConfig, emulation_env_config: Union[None, csle_common.dao.emulation_config.emulation_env_config.EmulationEnvConfig], experiment_config: csle_common.dao.training.experiment_config.ExperimentConfig, training_job: Optional[csle_common.dao.jobs.training_job_config.TrainingJobConfig] = None, save_to_metastore: bool = True)[source]
Bases:
csle_agents.agents.base.base_agent.BaseAgent
A PPO agent using the implementation from CleanRL
- generalized_advantage_estimation(model: csle_common.models.ppo_network.PPONetwork, next_obs: torch.Tensor, rewards: torch.Tensor, device: torch.device, next_done: torch.Tensor, dones: torch.Tensor, values: torch.Tensor) Tuple[torch.Tensor, torch.Tensor] [source]
Computes the generalized advantage estimation (i.e., exponentially weighted average of n-step returns) See (HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION, 2016, ICLR)
- Parameters
device – the device acted upon
values – tensor of values
model – the neural network model
rewards – tensor of available rewards
next_obs – the next observation
dones – tensor of done events
next_done – logical or operation between terminations and truncations
- Returns
returns, advantages
- make_env() Callable[[], gymnasium.wrappers.record_episode_statistics.RecordEpisodeStatistics] [source]
Helper function for creating the environment to use for training
- Returns
a function that creates the environment
- run_ppo(exp_result: csle_common.dao.training.experiment_result.ExperimentResult, seed: int) Tuple[csle_common.dao.training.experiment_result.ExperimentResult, csle_common.dao.simulation_config.base_env.BaseEnv, csle_common.models.ppo_network.PPONetwork] [source]
Runs PPO with a given seed
- Parameters
exp_result – the object to save the experiment results
seed – the random seed
- Returns
the updated experiment results, the environment, and the trained model
- train() csle_common.dao.training.experiment_execution.ExperimentExecution [source]
Runs the training process
- Returns
the results
- update_trajectory_buffers(global_step: int, envs: gymnasium.vector.sync_vector_env.SyncVectorEnv, obs: torch.Tensor, dones: torch.Tensor, actions: torch.Tensor, rewards: torch.Tensor, device: torch.device, logprobs: torch.Tensor, values: torch.Tensor, model: csle_common.models.ppo_network.PPONetwork, next_obs: torch.Tensor, next_done: torch.Tensor, horizons: List[int]) Tuple[torch.Tensor, torch.Tensor, int, List[int]] [source]
Updates the buffers of trajectories collected from the environment
- Parameters
step (global) – the global step in the iteration
envs – list of environments
obs – torch tensor of observations
dones – tensor of done events
actions – tensor of available actions
rewards – tensor of available rewards
device – the device acted upon
logprobs – logarithmic probabilities
horizons – list of time horizons
values – tensor of values
model – the neural network model
next_obs – the next observation
next_done – logical or operation between terminations and truncations
- Returns
next_obs, next_done, global_step