csle_agents.agents.pi package
Submodules
csle_agents.agents.pi.pi_agent module
- class csle_agents.agents.pi.pi_agent.PIAgent(simulation_env_config: SimulationEnvConfig, experiment_config: ExperimentConfig, training_job: Optional[TrainingJobConfig] = None, save_to_metastore: bool = True, env: Optional[BaseEnv] = None, create_log_dir: bool = True, env_eval: bool = True, max_eval_length: int = 100, initial_eval_state: int = 0)[source]
Bases:
BaseAgent
Policy Iteration Agent
- evaluate_policy(policy: ndarray[Any, dtype[Any]], eval_batch_size: int, P: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], num_states: int, num_actions: int) float [source]
Evaluates a tabular policy
- Parameters
policy – the tabular policy to evaluate
eval_batch_size – the batch size
P – the transition tensor
R – the reward tensor
num_states – the number of states
num_actions – the number of actions
- Returns
the average return of the policy
- expected_reward_under_policy(R: ndarray[Any, dtype[Any]], policy: ndarray[Any, dtype[Any]], num_states: int, num_actions: int) ndarray[Any, dtype[Any]] [source]
Utility function for computing the expected immediate reward for each state in the MDP given a policy.
- :paramP: the state transition probabilities for all actions in the MDP
(tensor num_actions x num_states x num_states)
:param : policy: the policy (matrix num_states x num_actions) :param : R: the reward function in the MDP (tensor num_actions x num_states x num_states) :param : num_states: the number of states :param : num_actions: the number of actions
- Returns
r: a vector of dimension <num_states> with the expected immediate reward for each state.
- pi(P: ndarray[Any, dtype[Any]], policy: ndarray[Any, dtype[Any]], N: int, gamma: float, R: ndarray[Any, dtype[Any]], num_states: int, num_actions: int) Tuple[ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]], List[float], List[float]] [source]
The policy iteration algorithm, interleaves policy evaluation and policy improvement for N iterations. Guaranteed to converge to the optimal policy and value function.
:param : P: the state transition probabilities for all actions in the MDP :param (tensor num_actions x num_states x num_states): :param : policy: the policy (matrix num_states x num_actions) :param : N: the number of iterations (scalar) :param : gamma: the discount factor :param : R: the reward function in the MDP (tensor num_actions x num_states x num_states)
- Returns
a tuple of (v, policy) where v is the state values after N iterations and policy is the policy after N iterations.
- policy_evaluation(P: ndarray[Any, dtype[Any]], policy: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], gamma: float, num_states: int, num_actions: int) ndarray[Any, dtype[Any]] [source]
Implements the policy evaluation step in the policy iteration dynamic programming algorithm. Uses the linear algebra interpretation of policy evaluation, solving it as a linear system.
- :paramP: the state transition probabilities for all actions in the MDP
(tensor num_actions x num_states x num_states)
:param : policy: the policy (matrix num_states x num_actions) :param : gamma: the discount factor :param : num_states: the number of states :param : num_actions: the number of actions
- Returns
v: the state values, a vector of dimension NUM_STATES
- policy_improvement(P: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], gamma: float, v: ndarray[Any, dtype[Any]], pi: ndarray[Any, dtype[Any]], num_states: int, num_actions: int) ndarray[Any, dtype[Any]] [source]
Implements the policy improvement step in the policy iteration dynamic programming algorithm.
- :paramP: the state transition probabilities for all actions in the MDP
(tensor num_actions x num_states x num_states)
:param : R: the reward function in the MDP (tensor num_actions x num_states x num_states) :param : gamma: the discount factor :param : v: the state values (dimension NUM_STATES) :param : pi: the old policy (matrix num_states x num_actions) :param : num_states: the number of states :param : num_actions: the number of actions
- Returns
pi_prime: a new updated policy (dimensions num_states x num_actions)
- policy_iteration(exp_result: ExperimentResult, seed: int) ExperimentResult [source]
Runs the policy iteration algorithm
- Parameters
exp_result – the experiment result object
seed – the random seed
- Returns
the updated experiment result
- train() ExperimentExecution [source]
Runs the policy iteration algorithm to compute V*
- Returns
the results
- transition_probability_under_policy(P: ndarray[Any, dtype[Any]], policy: ndarray[Any, dtype[Any]], num_states: int) ndarray[Any, dtype[Any]] [source]
Utility function for computing the state transition probabilities under the current policy. Assumes a deterministic policy (probability 1 of selecting an action in a state)
- :paramP: the state transition probabilities for all actions in the MDP
(tensor num_actions x num_states x num_states)
:param : policy: the policy (matrix num_states x num_actions) :param : num_states: the number of states
- Returns
- P_pi: the transition probabilities in the MDP under the given policy
(dimensions num_states x num_states)