csle_agents.agents.pi package

Submodules

csle_agents.agents.pi.pi_agent module

class csle_agents.agents.pi.pi_agent.PIAgent(simulation_env_config: csle_common.dao.simulation_config.simulation_env_config.SimulationEnvConfig, experiment_config: csle_common.dao.training.experiment_config.ExperimentConfig, training_job: Optional[csle_common.dao.jobs.training_job_config.TrainingJobConfig] = None, save_to_metastore: bool = True, env: Optional[csle_common.dao.simulation_config.base_env.BaseEnv] = None)[source]

Bases: csle_agents.agents.base.base_agent.BaseAgent

Policy Iteration Agent

evaluate_policy(policy: numpy.ndarray[Any, numpy.dtype[Any]], eval_batch_size: int) float[source]

Evalutes a tabular policy

Parameters
  • policy – the tabular policy to evaluate

  • eval_batch_size – the batch size

Returns

None

expected_reward_under_policy(P: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], policy: numpy.ndarray[Any, numpy.dtype[Any]], num_states: int, num_actions: int) numpy.ndarray[Any, numpy.dtype[Any]][source]

Utility function for computing the expected immediate reward for each state in the MDP given a policy.

:paramP: the state transition probabilities for all actions in the MDP

(tensor num_actions x num_states x num_states)

:param : policy: the policy (matrix num_states x num_actions) :param : R: the reward function in the MDP (tensor num_actions x num_states x num_states) :param : num_states: the number of states :param : num_actions: the number of actions

Returns

r: a vector of dimension <num_states> with the expected immediate reward for each state.

hparam_names() List[str][source]
Returns

a list with the hyperparameter names

pi(P: numpy.ndarray[Any, numpy.dtype[Any]], policy: numpy.ndarray[Any, numpy.dtype[Any]], N: int, gamma: float, R: numpy.ndarray[Any, numpy.dtype[Any]], num_states: int, num_actions: int) Tuple[numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]], List[float], List[float]][source]

The policy iteration algorithm, interleaves policy evaluation and policy improvement for N iterations. Guaranteed to converge to the optimal policy and value function.

:param : P: the state transition probabilities for all actions in the MDP :param (tensor num_actions x num_states x num_states): :param : policy: the policy (matrix num_states x num_actions) :param : N: the number of iterations (scalar) :param : gamma: the discount factor :param : R: the reward function in the MDP (tensor num_actions x num_states x num_states)

Returns

a tuple of (v, policy) where v is the state values after N iterations and policy is the policy after N iterations.

policy_evaluation(P: numpy.ndarray[Any, numpy.dtype[Any]], policy: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, num_states: int, num_actions: int) numpy.ndarray[Any, numpy.dtype[Any]][source]

Implements the policy evaluation step in the policy iteration dynamic programming algorithm. Uses the linear algebra interpretation of policy evaluation, solving it as a linear system.

:paramP: the state transition probabilities for all actions in the MDP

(tensor num_actions x num_states x num_states)

:param : policy: the policy (matrix num_states x num_actions) :param : gamma: the discount factor :param : num_states: the number of states :param : num_actions: the number of actions

Returns

v: the state values, a vector of dimension NUM_STATES

policy_improvement(P: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, v: numpy.ndarray[Any, numpy.dtype[Any]], pi: numpy.ndarray[Any, numpy.dtype[Any]], num_states: int, num_actions: int) numpy.ndarray[Any, numpy.dtype[Any]][source]

Implements the policy improvement step in the policy iteration dynamic programming algorithm.

:paramP: the state transition probabilities for all actions in the MDP

(tensor num_actions x num_states x num_states)

:param : R: the reward function in the MDP (tensor num_actions x num_states x num_states) :param : gamma: the discount factor :param : v: the state values (dimension NUM_STATES) :param : pi: the old policy (matrix num_states x num_actions) :param : num_states: the number of states :param : num_actions: the number of actions

Returns

pi_prime: a new updated policy (dimensions num_states x num_actions)

policy_iteration(exp_result: csle_common.dao.training.experiment_result.ExperimentResult, seed: int) csle_common.dao.training.experiment_result.ExperimentResult[source]

Runs the policy iteration algorithm

Parameters
  • exp_result – the experiment result object

  • seed – the random seed

Returns

the updated experiment result

train() csle_common.dao.training.experiment_execution.ExperimentExecution[source]

Runs the policy iteration algorithm to compute V*

Returns

the results

transition_probability_under_policy(P: numpy.ndarray[Any, numpy.dtype[Any]], policy: numpy.ndarray[Any, numpy.dtype[Any]], num_states: int) numpy.ndarray[Any, numpy.dtype[Any]][source]

Utility function for computing the state transition probabilities under the current policy. Assumes a deterministic policy (probability 1 of selecting an action in a state)

:paramP: the state transition probabilities for all actions in the MDP

(tensor num_actions x num_states x num_states)

:param : policy: the policy (matrix num_states x num_actions) :param : num_states: the number of states

Returns

P_pi: the transition probabilities in the MDP under the given policy

(dimensions num_states x num_states)

Module contents