csle_agents.agents.pi package

Submodules

csle_agents.agents.pi.pi_agent module

class csle_agents.agents.pi.pi_agent.PIAgent(simulation_env_config: SimulationEnvConfig, experiment_config: ExperimentConfig, training_job: Optional[TrainingJobConfig] = None, save_to_metastore: bool = True, env: Optional[BaseEnv] = None, create_log_dir: bool = True, env_eval: bool = True, max_eval_length: int = 100, initial_eval_state: int = 0)[source]

Bases: BaseAgent

Policy Iteration Agent

evaluate_policy(policy: ndarray[Any, dtype[Any]], eval_batch_size: int, P: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], num_states: int, num_actions: int) → float[source]

Evaluates a tabular policy

Parameters

policy – the tabular policy to evaluate
eval_batch_size – the batch size
P – the transition tensor
R – the reward tensor
num_states – the number of states
num_actions – the number of actions

Returns

the average return of the policy

expected_reward_under_policy(R: ndarray[Any, dtype[Any]], policy: ndarray[Any, dtype[Any]], num_states: int, num_actions: int) → ndarray[Any, dtype[Any]][source]

Utility function for computing the expected immediate reward for each state in the MDP given a policy.

:paramP: the state transition probabilities for all actions in the MDP: (tensor num_actions x num_states x num_states)

:param : policy: the policy (matrix num_states x num_actions) :param : R: the reward function in the MDP (tensor num_actions x num_states x num_states) :param : num_states: the number of states :param : num_actions: the number of actions

Returns: r: a vector of dimension <num_states> with the expected immediate reward for each state.

hparam_names() → List[str][source]

Returns: a list with the hyperparameter names

pi(P: ndarray[Any, dtype[Any]], policy: ndarray[Any, dtype[Any]], N: int, gamma: float, R: ndarray[Any, dtype[Any]], num_states: int, num_actions: int) → Tuple[ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]], List[float], List[float]][source]

The policy iteration algorithm, interleaves policy evaluation and policy improvement for N iterations. Guaranteed to converge to the optimal policy and value function.

:param : P: the state transition probabilities for all actions in the MDP :param (tensor num_actions x num_states x num_states): :param : policy: the policy (matrix num_states x num_actions) :param : N: the number of iterations (scalar) :param : gamma: the discount factor :param : R: the reward function in the MDP (tensor num_actions x num_states x num_states)

Returns: a tuple of (v, policy) where v is the state values after N iterations and policy is the policy after N iterations.

policy_evaluation(P: ndarray[Any, dtype[Any]], policy: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], gamma: float, num_states: int, num_actions: int) → ndarray[Any, dtype[Any]][source]

Implements the policy evaluation step in the policy iteration dynamic programming algorithm. Uses the linear algebra interpretation of policy evaluation, solving it as a linear system.

:paramP: the state transition probabilities for all actions in the MDP: (tensor num_actions x num_states x num_states)

:param : policy: the policy (matrix num_states x num_actions) :param : gamma: the discount factor :param : num_states: the number of states :param : num_actions: the number of actions

Returns: v: the state values, a vector of dimension NUM_STATES

policy_improvement(P: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], gamma: float, v: ndarray[Any, dtype[Any]], pi: ndarray[Any, dtype[Any]], num_states: int, num_actions: int) → ndarray[Any, dtype[Any]][source]

Implements the policy improvement step in the policy iteration dynamic programming algorithm.

:paramP: the state transition probabilities for all actions in the MDP: (tensor num_actions x num_states x num_states)

:param : R: the reward function in the MDP (tensor num_actions x num_states x num_states) :param : gamma: the discount factor :param : v: the state values (dimension NUM_STATES) :param : pi: the old policy (matrix num_states x num_actions) :param : num_states: the number of states :param : num_actions: the number of actions

Returns: pi_prime: a new updated policy (dimensions num_states x num_actions)

policy_iteration(exp_result: ExperimentResult, seed: int) → ExperimentResult[source]

Runs the policy iteration algorithm

Parameters

exp_result – the experiment result object
seed – the random seed

Returns

the updated experiment result

train() → ExperimentExecution[source]

Runs the policy iteration algorithm to compute V*

Returns: the results

transition_probability_under_policy(P: ndarray[Any, dtype[Any]], policy: ndarray[Any, dtype[Any]], num_states: int) → ndarray[Any, dtype[Any]][source]

Utility function for computing the state transition probabilities under the current policy. Assumes a deterministic policy (probability 1 of selecting an action in a state)

:paramP: the state transition probabilities for all actions in the MDP: (tensor num_actions x num_states x num_states)

:param : policy: the policy (matrix num_states x num_actions) :param : num_states: the number of states

Returns

P_pi: the transition probabilities in the MDP under the given policy: (dimensions num_states x num_states)

csle_agents.agents.pi package

Submodules

csle_agents.agents.pi.pi_agent module

Module contents