csle_agents.agents.pi package

Submodules

csle_agents.agents.pi.pi_agent module

class csle_agents.agents.pi.pi_agent.PIAgent(simulation_env_config: SimulationEnvConfig, experiment_config: ExperimentConfig, training_job: Optional[TrainingJobConfig] = None, save_to_metastore: bool = True, env: Optional[BaseEnv] = None, create_log_dir: bool = True, env_eval: bool = True, max_eval_length: int = 100, initial_eval_state: int = 0)[source]

Bases: BaseAgent

Policy Iteration Agent

evaluate_policy(policy: ndarray[Any, dtype[Any]], eval_batch_size: int, P: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], num_states: int, num_actions: int) float[source]

Evaluates a tabular policy

Parameters
  • policy – the tabular policy to evaluate

  • eval_batch_size – the batch size

  • P – the transition tensor

  • R – the reward tensor

  • num_states – the number of states

  • num_actions – the number of actions

Returns

the average return of the policy

expected_reward_under_policy(R: ndarray[Any, dtype[Any]], policy: ndarray[Any, dtype[Any]], num_states: int, num_actions: int) ndarray[Any, dtype[Any]][source]

Utility function for computing the expected immediate reward for each state in the MDP given a policy.

:paramP: the state transition probabilities for all actions in the MDP

(tensor num_actions x num_states x num_states)

:param : policy: the policy (matrix num_states x num_actions) :param : R: the reward function in the MDP (tensor num_actions x num_states x num_states) :param : num_states: the number of states :param : num_actions: the number of actions

Returns

r: a vector of dimension <num_states> with the expected immediate reward for each state.

hparam_names() List[str][source]
Returns

a list with the hyperparameter names

pi(P: ndarray[Any, dtype[Any]], policy: ndarray[Any, dtype[Any]], N: int, gamma: float, R: ndarray[Any, dtype[Any]], num_states: int, num_actions: int) Tuple[ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]], List[float], List[float]][source]

The policy iteration algorithm, interleaves policy evaluation and policy improvement for N iterations. Guaranteed to converge to the optimal policy and value function.

:param : P: the state transition probabilities for all actions in the MDP :param (tensor num_actions x num_states x num_states): :param : policy: the policy (matrix num_states x num_actions) :param : N: the number of iterations (scalar) :param : gamma: the discount factor :param : R: the reward function in the MDP (tensor num_actions x num_states x num_states)

Returns

a tuple of (v, policy) where v is the state values after N iterations and policy is the policy after N iterations.

policy_evaluation(P: ndarray[Any, dtype[Any]], policy: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], gamma: float, num_states: int, num_actions: int) ndarray[Any, dtype[Any]][source]

Implements the policy evaluation step in the policy iteration dynamic programming algorithm. Uses the linear algebra interpretation of policy evaluation, solving it as a linear system.

:paramP: the state transition probabilities for all actions in the MDP

(tensor num_actions x num_states x num_states)

:param : policy: the policy (matrix num_states x num_actions) :param : gamma: the discount factor :param : num_states: the number of states :param : num_actions: the number of actions

Returns

v: the state values, a vector of dimension NUM_STATES

policy_improvement(P: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], gamma: float, v: ndarray[Any, dtype[Any]], pi: ndarray[Any, dtype[Any]], num_states: int, num_actions: int) ndarray[Any, dtype[Any]][source]

Implements the policy improvement step in the policy iteration dynamic programming algorithm.

:paramP: the state transition probabilities for all actions in the MDP

(tensor num_actions x num_states x num_states)

:param : R: the reward function in the MDP (tensor num_actions x num_states x num_states) :param : gamma: the discount factor :param : v: the state values (dimension NUM_STATES) :param : pi: the old policy (matrix num_states x num_actions) :param : num_states: the number of states :param : num_actions: the number of actions

Returns

pi_prime: a new updated policy (dimensions num_states x num_actions)

policy_iteration(exp_result: ExperimentResult, seed: int) ExperimentResult[source]

Runs the policy iteration algorithm

Parameters
  • exp_result – the experiment result object

  • seed – the random seed

Returns

the updated experiment result

train() ExperimentExecution[source]

Runs the policy iteration algorithm to compute V*

Returns

the results

transition_probability_under_policy(P: ndarray[Any, dtype[Any]], policy: ndarray[Any, dtype[Any]], num_states: int) ndarray[Any, dtype[Any]][source]

Utility function for computing the state transition probabilities under the current policy. Assumes a deterministic policy (probability 1 of selecting an action in a state)

:paramP: the state transition probabilities for all actions in the MDP

(tensor num_actions x num_states x num_states)

:param : policy: the policy (matrix num_states x num_actions) :param : num_states: the number of states

Returns

P_pi: the transition probabilities in the MDP under the given policy

(dimensions num_states x num_states)

Module contents