csle_agents.agents.hsvi package

Submodules

csle_agents.agents.hsvi.hsvi_agent module

class csle_agents.agents.hsvi.hsvi_agent.HSVIAgent(simulation_env_config: csle_common.dao.simulation_config.simulation_env_config.SimulationEnvConfig, experiment_config: csle_common.dao.training.experiment_config.ExperimentConfig, training_job: Optional[csle_common.dao.jobs.training_job_config.TrainingJobConfig] = None, save_to_metastore: bool = True, env: Optional[csle_common.dao.simulation_config.base_env.BaseEnv] = None)[source]

Bases: csle_agents.agents.base.base_agent.BaseAgent

Heuristic-Search Value Iteration for POMDPs (Trey Smith and Reid Simmons, 2004) Agent

approximate_projection_sawtooth(upper_bound: Tuple[List[Any], List[Any]], b: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]]) float[source]

Reference: (Hauskreht 2000)

Performs an approximate projection of the belief onto the convex hull of the upepr bound to compute the upper bound value of the belief

Parameters
  • upper_bound – the upper bound

  • b – the belief point

  • S – the set of states

Returns

the value of the belief point

bayes_filter(s_prime: int, o: int, a: int, b: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) float[source]

A Bayesian filter to compute the belief of being in s_prime when observing o after taking action a in belief b

Parameters
  • s_prime – the state to compute the belief of

  • o – the observation

  • a – the action

  • b – the current belief point

  • S – the set of states

  • Z – the observation tensor

  • T – the transition tensor

Returns

b_prime(s_prime)

excess(lower_bound: List[Any], upper_bound: Tuple[List[Any], List[Any]], b: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], epsilon: float, gamma: float, t: int, lp: bool) float[source]

Computes the excess uncertainty (Trey Smith and Reid Simmons, 2004)

Parameters
  • lower_bound – the lower bound

  • upper_bound – the upper bound

  • b – the current belief point

  • S – the set of states

  • epsilon – the epsilon accuracy parameter

  • gamma – the discount factor

  • t – the current exploration depth

  • lp – whether to use LP or not to compute upper bound belief values

Returns

the excess uncertainty

explore(b: numpy.ndarray[Any, numpy.dtype[Any]], epsilon: float, t: int, lower_bound: List[int], upper_bound: Tuple[List[int], List[int]], gamma: float, S: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], A: numpy.ndarray[Any, numpy.dtype[Any]], lp: bool) Tuple[List[int], Tuple[List[int], List[int]]][source]

Explores the POMDP tree

Parameters
  • b – current belief

  • epsilon – accuracy parameter

  • t – the current depth of the exploration

  • lower_bound – the lower bound on the value function

  • upper_bound – the upper bound on the value function

  • gamma – discount factor

  • S – set of states

  • O – set of observations

  • Z – observation tensor

  • R – reward tensor

  • T – transition tensor

  • A – set of actions

  • lp – whether to use LP to compute upper bound values

Returns

new lower and upper bounds

generate_corner_belief(s: int, S: numpy.ndarray[Any, numpy.dtype[Any]])[source]

Generate the corner of the simplex that corresponds to being in some state with probability 1

Parameters
  • s – the state

  • S – the set of States

Returns

the corner belief corresponding to state s

hparam_names() List[str][source]
Returns

a list with the hyperparameter names

hsvi(O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], A: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, b0: numpy.ndarray[Any, numpy.dtype[Any]], epsilon: float, lp: bool = False, prune_frequency: int = 10, simulation_frequency: int = 10, simulate_horizon: int = 10, number_of_simulations: int = 10) Tuple[List[List[float]], List[float], List[float], List[float], List[int], List[int], List[float]][source]

Heuristic Search Value Iteration for POMDPs (Trey Smith and Reid Simmons, 2004)

Parameters
  • O – set of observations of the POMDP

  • Z – observation tensor of the POMDP

  • R – reward tensor of the POMDP

  • T – transition tensor of the POMDP

  • A – action set of the POMDP

  • S – state set of the POMDP

  • gamma – discount factor

  • b0 – initial belief point

  • epsilon – accuracy parameter

  • lp – whether to use LP to compute upper bound values or SawTooth approximation

  • prune_frequency – how often to prune the upper and lower bounds

  • simulation_frequency – how frequently to simulate the POMDP to compure rewards of current policy

  • simulate_horizon – length of simulations to compute rewards

  • number_of_simulations – number of simulations to estimate reward

Returns

None

hsvi_algorithm(exp_result: csle_common.dao.training.experiment_result.ExperimentResult, seed: int) csle_common.dao.training.experiment_result.ExperimentResult[source]

Runs

Parameters
  • exp_result – the experiment result object

  • seed – the random seed

Returns

the updated experiment result

initialize_lower_bound(R: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], A: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float) List[Any][source]

Initializes the lower bound

Parameters
  • R – reward tensor

  • S – set of states

  • A – set of actions

  • gamma – discount factor

Returns

the initialized lower bound

initialize_upper_bound(T: numpy.ndarray[Any, numpy.dtype[Any]], A: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, R: numpy.ndarray[Any, numpy.dtype[Any]]) Tuple[List[Any], List[Any]][source]

Initializes the upper bound

Parameters
  • T – the transition tensor

  • A – the set of actions

  • S – the set of states

  • R – the reward tensor

  • gamma – the discount factor

Returns

the initialized upper bound

interior_point_belief_val(interior_point: Tuple[numpy.ndarray[Any, numpy.dtype[Any]], float], b: numpy.ndarray[Any, numpy.dtype[Any]], alpha_corner: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]]) float[source]

Computes the value induced on the belief point b projected onto the convex hull by a given interior belief point

Parameters
  • interior_point – the interior point

  • b – the belief point

  • alpha_corner – the alpha vector corresponding to the corners of the belief simplex

  • S – the set of states

Returns

the value of the belief point induced by the interior point

local_lower_bound_update(lower_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], A: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float) List[Any][source]

Performs a local update to the lower bound given a belief point in the heuristic search

Parameters
  • lower_bound – the current lower bound

  • b – the current belief point

  • A – the set of actions

  • O – the set of observations

  • Z – the observation tensor

  • S – the set of states

  • T – the transition tensor

  • R – the reward tensor

  • gamma – the discount factor

Returns

the updated lower bound

local_updates(lower_bound: List[Any], upper_bound: Tuple[List[Any], List[Any]], b: numpy.ndarray[Any, numpy.dtype[Any]], A: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, Z: numpy.ndarray[Any, numpy.dtype[Any]], lp: bool) Tuple[List[Any], Tuple[List[Any], List[Any]]][source]

Perform local updates to the upper and lower bounds for the given belief in the heuristic-search-exploration

Parameters
  • lower_bound – the lower bound on V

  • upper_bound – the upper bound on V

  • b – the current belief point

  • A – the set of actions

  • S – the set of states

  • O – the set of observations

  • R – the reward tensor

  • T – the transition tensor

  • gamma – the discount factor

  • Z – the set of observations

  • lp – a boolean flag whether to use LP to compute upper bound beliefs

Returns

The updated lower and upper bounds

local_upper_bound_update(upper_bound: Tuple[List[Any], List[Any]], b: numpy.ndarray[Any, numpy.dtype[Any]], A: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, Z: numpy.ndarray[Any, numpy.dtype[Any]], lp: bool) Tuple[List[Any], List[Any]][source]

Performs a local update to the upper bound during the heuristic-search exploration

Parameters
  • upper_bound – the upper bound to update

  • b – the current belief point

  • A – the set of actions

  • S – the set of states

  • O – the set of observations

  • R – the reward tensor

  • T – the transition tensor

  • gamma – the discount factor

  • Z – the set of observations

  • lp – whether or not to use LP to compute upper bound beliefs

Returns

the updated upper bound

lower_bound_backup(lower_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], A: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float) numpy.ndarray[Any, numpy.dtype[Any]][source]

Generates a new alpha-vector for the lower bound

Parameters
  • lower_bound – the current lower bound

  • b – the current belief point

  • A – the set of actions

  • O – the set of observations

  • Z – the observation tensor

  • S – the set of states

  • T – the transition tensor

  • R – the reward tensor

  • gamma – the discount factor

Returns

the new alpha vector

lower_bound_value(lower_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]]) float[source]

Computes the lower bound value of a given belief point

Parameters
  • lower_bound – the lower bound

  • b – the belief point

  • S – the set of states

Returns

the lower bound value

lp_convex_hull_projection_lp(upper_bound: Tuple[List[Any], List[Any]], b: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]]) float[source]

Reference: (Hauskreht 2000)

Computes the upper bound belief by performing a projection onto the convex hull of the upper bound, it is computed by solving an LP

Parameters
  • upper_bound – the upper bound

  • b – the belief point to compute the value for

  • S – the set of states

Returns

the upper bound value of the belief point

next_belief(o: int, a: int, b: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) numpy.ndarray[Any, numpy.dtype[Any]][source]

Computes the next belief using a Bayesian filter

Parameters
  • o – the latest observation

  • a – the latest action

  • b – the current belief

  • S – the set of states

  • Z – the observation tensor

  • T – the transition tensor

Returns

the new belief

observation_possible(o, b, Z, T, S, a) bool[source]

Checks if a given observation can be observed when taking a given action in a given state

Parameters
  • o – the observation to check

  • b – the belief to check

  • Z – the observation tensor

  • T – the transition tensor

  • S – the state space

  • a – the aciton tocheck

Returns

true if possible otherwise fasle

one_step_lookahead(state, V, num_actions, num_states, T, discount_factor, R) numpy.ndarray[Any, numpy.dtype[Any]][source]

Performs a one-step lookahead for value iteration :param state: the current state :param V: the current value function :param num_actions: the number of actions :param num_states: the number of states :param T: the transition kernel :param discount_factor: the discount factor :param R: the table with rewards :param next_state_lookahead: the next state lookahead table :return: an array with lookahead values

p_o_given_b_a(o: int, b: numpy.ndarray[Any, numpy.dtype[Any]], a: int, S: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) float[source]

Computes P[o|a,b]

Parameters
  • o – the observation

  • b – the belief point

  • a – the action

  • S – the set of states

  • Z – the observation tensor

  • T – the transition tensor

Returns

the probability of observing o when taking action a in belief point b

prune_upper_bound(upper_bound: Tuple[List[Any], List[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], lp: bool) Tuple[List[Any], List[Any]][source]

Prunes the points in the upper bound

Parameters
  • upper_bound – the current upper bound

  • S – the set of states

  • lp – boolean flag that decides whether to use LP to compute upper bound belief values

Returns

the pruned upper bound

q(b: numpy.ndarray[Any, numpy.dtype[Any]], a: int, lower_bound: List[int], upper_bound: Tuple[List[int], List[int]], S: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, T: numpy.ndarray[Any, numpy.dtype[Any]], upper: bool = True, lp: bool = False) float[source]

Applies the Bellman equation to compute Q values

Parameters
  • b – the belief point

  • a – the action

  • lower_bound – the lower bound

  • upper_bound – the upper bound

  • S – the set of states

  • O – the set of observations

  • Z – the observation tensor

  • R – the reward tensor

  • gamma – the discount factor

  • T – the transition tensor

  • upper – boolean flag that decides whether to use the upper bound or lower bound on V to compute the Q-value

  • lp – boolean flag that decides whether to use LP to compute upper bound belief values

Returns

the Q-value

q_hat_interval(b: numpy.ndarray[Any, numpy.dtype[Any]], a: int, S: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, lower_bound: List[int], upper_bound: Tuple[List[Any], List[Any]], lp: bool) List[float][source]

Computes the interval (Trey Smith and Reid Simmons, 2004)

Parameters
  • b – the current belief point

  • a – the action

  • S – the set of states

  • O – the set of observations

  • Z – the observation tensor

  • R – the reward tensor

  • T – the transition tensor

  • gamma – the discount factor

  • lower_bound – the lower bound

  • upper_bound – the upper bound

  • lp – boolean flag that decides whether to use LP to compute upper bound belief values

Returns

the interval

simulate(horizon: int, b0: numpy.ndarray[Any, numpy.dtype[Any]], lower_bound: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, T: numpy.ndarray[Any, numpy.dtype[Any]], A: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]]) float[source]

Simulates the POMDP to estimate the reward of the greedy policy with respect to the value function represented by the lower bound

Parameters
  • horizon – the horizon for the simulation

  • b0 – the initial belief

  • lower_bound – the lower bound which represents the value function

  • Z – the observation tensor

  • R – the reward tensor

  • gamma – the discount factor

  • T – the transition operator

  • A – the action set

  • O – the observation set

  • S – the set of states

Returns

the cumulative discounted reward

train() csle_common.dao.training.experiment_execution.ExperimentExecution[source]

Runs the value iteration algorithm to compute V*

Returns

the results

update_corner_points(corner_points: List[Any], new_point: Tuple[numpy.ndarray[Any, numpy.dtype[Any]], float]) List[Any][source]

(Maybe) update the corner points of the upper bound

Parameters
  • corner_points – the current set of corner points

  • new_point – the new point to add to the upper bound

Returns

the new set of corner points

upper_bound_backup(upper_bound: Tuple[List[Any], List[Any]], b: numpy.ndarray[Any, numpy.dtype[Any]], A: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, lp: bool) Tuple[numpy.ndarray[Any, numpy.dtype[Any]], float][source]

Adds a point to the upper bound

Parameters
  • upper_bound – the current upper bound

  • b – the current belief point

  • A – the set of actions

  • S – the set of states

  • O – the set of observations

  • Z – the observation tensor

  • R – the reward tensor

  • T – the transition tensor

  • gamma – the discount factor

  • lp – a boolean flag whether to use LP to compute the upper bound belief

Returns

the new point

upper_bound_value(upper_bound: Tuple[List[Any], List[Any]], b: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], lp: bool = False) float[source]

Computes the upper bound value of a given belief point

Parameters
  • upper_bound – the upper bound

  • b – the belief point

  • S – the set of states

  • lp – boolean flag that decides whether to use LP to compute the upper bound value or not

Returns

the upper bound value

vi(T: numpy.ndarray[Any, numpy.dtype[Any]], num_states: int, num_actions: int, R: numpy.ndarray[Any, numpy.dtype[Any]], theta=0.0001, discount_factor=1.0) Tuple[numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]]][source]

An implementation of the Value Iteration algorithm :param T: the transition kernel T :param num_states: the number of states :param num_actions: the number of actions :param state_to_id: the state-to-id lookup table :param HP: the table with hack probabilities :param R: the table with rewards :param next_state_lookahead: the next-state-lookahead table :param theta: convergence threshold :param discount_factor: the discount factor :return: (greedy policy, value function)

width(lower_bound: List[Any], upper_bound: Tuple[List[Any], List[Any]], b: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], lp: bool) float[source]

Computes the bounds width (Trey Smith and Reid Simmons, 2004)

Parameters
  • lower_bound – the current lower bound

  • upper_bound – the current upper bound

  • b – the current belief point

  • S – the set of states

  • lp – boolean flag that decides whether to use LP to compute upper bound belief values

Returns

the width of the bounds

Module contents