csle_agents.agents.pi package
Submodules
csle_agents.agents.pi.pi_agent module
- class csle_agents.agents.pi.pi_agent.PIAgent(simulation_env_config: csle_common.dao.simulation_config.simulation_env_config.SimulationEnvConfig, experiment_config: csle_common.dao.training.experiment_config.ExperimentConfig, training_job: Optional[csle_common.dao.jobs.training_job_config.TrainingJobConfig] = None, save_to_metastore: bool = True, env: Optional[csle_common.dao.simulation_config.base_env.BaseEnv] = None)[source]
Bases:
csle_agents.agents.base.base_agent.BaseAgent
Policy Iteration Agent
- evaluate_policy(policy: numpy.ndarray[Any, numpy.dtype[Any]], eval_batch_size: int) float [source]
Evalutes a tabular policy
- Parameters
policy – the tabular policy to evaluate
eval_batch_size – the batch size
- Returns
None
- expected_reward_under_policy(P: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], policy: numpy.ndarray[Any, numpy.dtype[Any]], num_states: int, num_actions: int) numpy.ndarray[Any, numpy.dtype[Any]] [source]
Utility function for computing the expected immediate reward for each state in the MDP given a policy.
- :paramP: the state transition probabilities for all actions in the MDP
(tensor num_actions x num_states x num_states)
:param : policy: the policy (matrix num_states x num_actions) :param : R: the reward function in the MDP (tensor num_actions x num_states x num_states) :param : num_states: the number of states :param : num_actions: the number of actions
- Returns
r: a vector of dimension <num_states> with the expected immediate reward for each state.
- pi(P: numpy.ndarray[Any, numpy.dtype[Any]], policy: numpy.ndarray[Any, numpy.dtype[Any]], N: int, gamma: float, R: numpy.ndarray[Any, numpy.dtype[Any]], num_states: int, num_actions: int) Tuple[numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]], List[float], List[float]] [source]
The policy iteration algorithm, interleaves policy evaluation and policy improvement for N iterations. Guaranteed to converge to the optimal policy and value function.
:param : P: the state transition probabilities for all actions in the MDP :param (tensor num_actions x num_states x num_states): :param : policy: the policy (matrix num_states x num_actions) :param : N: the number of iterations (scalar) :param : gamma: the discount factor :param : R: the reward function in the MDP (tensor num_actions x num_states x num_states)
- Returns
a tuple of (v, policy) where v is the state values after N iterations and policy is the policy after N iterations.
- policy_evaluation(P: numpy.ndarray[Any, numpy.dtype[Any]], policy: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, num_states: int, num_actions: int) numpy.ndarray[Any, numpy.dtype[Any]] [source]
Implements the policy evaluation step in the policy iteration dynamic programming algorithm. Uses the linear algebra interpretation of policy evaluation, solving it as a linear system.
- :paramP: the state transition probabilities for all actions in the MDP
(tensor num_actions x num_states x num_states)
:param : policy: the policy (matrix num_states x num_actions) :param : gamma: the discount factor :param : num_states: the number of states :param : num_actions: the number of actions
- Returns
v: the state values, a vector of dimension NUM_STATES
- policy_improvement(P: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, v: numpy.ndarray[Any, numpy.dtype[Any]], pi: numpy.ndarray[Any, numpy.dtype[Any]], num_states: int, num_actions: int) numpy.ndarray[Any, numpy.dtype[Any]] [source]
Implements the policy improvement step in the policy iteration dynamic programming algorithm.
- :paramP: the state transition probabilities for all actions in the MDP
(tensor num_actions x num_states x num_states)
:param : R: the reward function in the MDP (tensor num_actions x num_states x num_states) :param : gamma: the discount factor :param : v: the state values (dimension NUM_STATES) :param : pi: the old policy (matrix num_states x num_actions) :param : num_states: the number of states :param : num_actions: the number of actions
- Returns
pi_prime: a new updated policy (dimensions num_states x num_actions)
- policy_iteration(exp_result: csle_common.dao.training.experiment_result.ExperimentResult, seed: int) csle_common.dao.training.experiment_result.ExperimentResult [source]
Runs the policy iteration algorithm
- Parameters
exp_result – the experiment result object
seed – the random seed
- Returns
the updated experiment result
- train() csle_common.dao.training.experiment_execution.ExperimentExecution [source]
Runs the policy iteration algorithm to compute V*
- Returns
the results
- transition_probability_under_policy(P: numpy.ndarray[Any, numpy.dtype[Any]], policy: numpy.ndarray[Any, numpy.dtype[Any]], num_states: int) numpy.ndarray[Any, numpy.dtype[Any]] [source]
Utility function for computing the state transition probabilities under the current policy. Assumes a deterministic policy (probability 1 of selecting an action in a state)
- :paramP: the state transition probabilities for all actions in the MDP
(tensor num_actions x num_states x num_states)
:param : policy: the policy (matrix num_states x num_actions) :param : num_states: the number of states
- Returns
- P_pi: the transition probabilities in the MDP under the given policy
(dimensions num_states x num_states)