csle_agents.agents.hsvi_os_posg package

Submodules

csle_agents.agents.hsvi_os_posg.hsvi_os_posg_agent module

class csle_agents.agents.hsvi_os_posg.hsvi_os_posg_agent.HSVIOSPOSGAgent(simulation_env_config: csle_common.dao.simulation_config.simulation_env_config.SimulationEnvConfig, experiment_config: csle_common.dao.training.experiment_config.ExperimentConfig, training_job: Optional[csle_common.dao.jobs.training_job_config.TrainingJobConfig] = None, save_to_metastore: bool = True)[source]

Bases: csle_agents.agents.base.base_agent.BaseAgent

HSVI for OS-POSGs agent

auxillary_game(V: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, S: numpy.ndarray[Any, numpy.dtype[Any]], s: int, A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) numpy.ndarray[Any, numpy.dtype[Any]][source]

Creates an auxillary matrix game based on the value function V

Parameters
  • V – the value function

  • gamma – the discount factor

  • S – the set of states

  • s – the state s

  • A1 – the set of actions of player 1

  • A2 – the set of actions of player 2

  • R – the reward tensor

  • T – the transition tensor

Returns

the matrix auxillary game

bayes_filter(s_prime: int, o: int, a1: int, b: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], pi_2: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]]) float[source]

A Bayesian filter to compute the belief of player 1 of being in s_prime when observing o after taking action a in belief b given that the opponent follows strategy pi_2

Parameters
  • s_prime – the state to compute the belief of

  • o – the observation

  • a1 – the action of player 1

  • b – the current belief point

  • S – the set of states

  • Z – the observation tensor

  • T – the transition tensor

  • pi_2 – the policy of player 2

  • A2 – the action set of player 2

Returns

b_prime(s_prime)

choose_a_o_for_exploration(A1: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], t: int, b: numpy.ndarray[Any, numpy.dtype[Any]], pi_1_upper_bound: numpy.ndarray[Any, numpy.dtype[Any]], pi_2_lower_bound: numpy.ndarray[Any, numpy.dtype[Any]], lower_bound: List[Any], upper_bound: List[Any], gamma: float, epsilon: float, delta: float, D: float, S: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) Tuple[int, int, float, float, float, numpy.ndarray[Any, numpy.dtype[Any]]][source]

Selects the action a* and observation * for exploration according to the heuristic:

(a1*,o*) = argmax_(a1,o)[P[a1,o]*excess(b_1(a_1,o))]

(Horák, Bosansky, Kovařík, Kiekintveld, 2020)

Parameters
  • A1 – the action set of player 1

  • O – the set of observations in the OS-POSG

  • t – the time-step t

  • b – the belief point

  • pi_1_upper_bound – equilibrium strategy of player 1 in the stage game constructed from the upper bound V*

  • pi_2_lower_bound – equilibrium strategy of player 2 in the stage game constructed from the lower bound V*

  • lower_bound – the lower bound

  • upper_bound – the upper bound

  • gamma – the discount factor

  • epsilon – the epsilon accuracy parameter

  • delta – the Lipschitz-continuity parameter

  • D – the neighboorhood parameter

  • S – the set of states

  • Z – the observation tensor

  • A2 – the set of actions of player 2

  • T – the transition tensor

Returns

a*,o*,weighted_excess(a*,o*), excess(a*,o*), width(a*,o*), b_prime(a*,o*)

combine_weights_and_pure_strategies_into_mixed_strategy(weights: numpy.ndarray[Any, numpy.dtype[Any]], strategies: numpy.ndarray[Any, numpy.dtype[Any]])[source]

Uses a set of mixture weights and strategies to compute a mixed strategy

Parameters
  • weights – the mixture weights

  • strategies – the set of strategies

Returns

the mixed strategy

compute_delta(S: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, R: numpy.ndarray[Any, numpy.dtype[Any]]) float[source]

The optimal value function V* of a OS-POSG is delta-Lipschitz continuous. To prove convergence of HSVI, we require that V_UB and V_LB are delta-Lipschitz continuous as well. This function computes the delta value according to (Horák, Bosansky, Kovařík, Kiekintveld, 2020) Specifically, delta = (U-L)/2 where L and U are teh lower respectively upper bounds on the game values.

Parameters
  • S – the set of states of the OS-POSG

  • A1 – the set of actions of player 1

  • A2 – the set of actions of player 2

  • gamma – the discount factor

  • R – the reward tensor

Returns

the delta value

compute_equilibrium_strategies_in_matrix_game(A: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]]) Tuple[numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]]][source]

Computes equilibrium strategies in a matrix game

Parameters
  • A – the matrix game

  • A1 – the action set of player 1 (the maximizer)

  • A2 – the action set of player 2 (the minimizer)

Returns

the equilibrium strategy profile

compute_matrix_game_value(A: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], maximizer: bool = True) Tuple[Any, numpy.ndarray[Any, numpy.dtype[Any]]][source]

Uses LP to compute the value of a a matrix game, also computes the maximin or minimax strategy

Parameters
  • A – the matrix game

  • A1 – the action set of player 1

  • A2 – the action set of player 2

  • maximizer – boolean flag whether to compute the maximin strategy or minimax strategy

Returns

(val(A), maximin or minimax strategy)

delta_lipschitz_envelope_of_upper_bound_value(upper_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], delta: float, S: numpy.ndarray[Any, numpy.dtype[Any]]) float[source]

This function computes the delta-Lipschitz envelop of the upper bound value at a given belief point b.

Parameters
  • upper_bound – the upper bound

  • b – the belief point

  • delta – the delta parameter for Lipschitz-continuity

  • S – the set of states

Returns

the belief value

excess(lower_bound: List[Any], upper_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], epsilon: float, gamma: float, t: int, delta: float, D: float) Tuple[float, float][source]

Computes the excess gap and width (Horak, Bosansky, Pechoucek, 2017)

Parameters
  • lower_bound – the lower bound

  • upper_bound – the upper bound

  • D – the neighborhood parameter

  • delta – the Lipschitz-continuity parameter

  • b – the current belief point

  • S – the set of states

  • epsilon – the epsilon accuracy parameter

  • gamma – the discount factor

  • t – the current exploration depth

Returns

the excess gap and gap width

explore(b: numpy.ndarray[Any, numpy.dtype[Any]], epsilon: float, t: int, lower_bound: List[Any], upper_bound: List[Any], gamma: float, S: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], delta: float, D: float) Tuple[List[Any], List[Any]][source]

Explores the OS-POSG tree

Parameters
  • b – current belief

  • epsilon – accuracy parameter

  • t – the current depth of the exploration

  • lower_bound – the lower bound on the value function

  • upper_bound – the upper bound on the value function

  • gamma – discount factor

  • S – set of states in the OS-POSG

  • O – set of observations in the OS-POSG

  • Z – observation tensor in the OS-POSG

  • R – reward tensor in the OS-POSG

  • T – transition tensor in the OS-POSG

  • A1 – set of actions of P1 in the OS-POSG

  • A2 – set of actions of P2 in the OS-POSG

  • delta – the delta parameter in Lipschitz-continuity

Returns

new lower and upper bounds

generate_corner_belief(s: int, S: numpy.ndarray[Any, numpy.dtype[Any]])[source]

Generate the corner of the simplex that corresponds to being in some state with probability 1

Parameters
  • s – the state

  • S – the set of States

Returns

the corner belief corresponding to state s

hparam_names() List[str][source]
Returns

a list with the hyperparameter names

hsvi(O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, b0: numpy.ndarray[Any, numpy.dtype[Any]], epsilon: float, prune_frequency: int = 10, D: float = - 1.0)[source]

Heuristic Search Value Iteration for zero-sum OS-POSGs (Horak, Bosansky, Pechoucek, 2017)

The value function represents the utility player 1 can achieve in each possible initial belief of the game.

Parameters
  • O – set of observations of the OS-POSG

  • Z – observation tensor of the OS-POSG

  • R – reward tensor of the OS-POSG

  • T – transition tensor of the OS-POSG

  • A1 – action set of P1 in the OS-POSG

  • A2 – action set of P2 in the OS-POSG

  • S – state set of the OS-POSG

  • gamma – discount factor

  • b0 – initial belief point

  • epsilon – accuracy parameter

  • prune_frequency – how often to prune the upper and lower bounds

  • D – neighborhood parameter

Returns

None

hsvi_os_posg(exp_result: csle_common.dao.training.experiment_result.ExperimentResult, seed: int) csle_common.dao.training.experiment_result.ExperimentResult[source]

Runs the Shapley iteration algorithm

Parameters
  • exp_result – the experiment result object

  • seed – the random seed

Returns

the updated experiment result

initialize_lower_bound(S: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, b0: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) List[Any][source]

Initializes the lower bound by computing the state-values of the POMDP induced by player 1 playing a uniform strategy

Parameters
  • S – the set of states

  • A1 – the set of actions of player 1

  • A2 – the set of actions of player 2

  • gamma – the discount factor

  • b0 – the initial belief point

  • R – the reward tensor

  • T – the transition tensor

Returns

the lower bound (singleton set with an alpha vector)

initialize_upper_bound(T: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float) List[Any][source]

Initializes the upper bound by computing the values of the fully observed version of the OS-POSG using Shapley iteration.

Parameters
  • T – the transition tensor of the OS-POSG

  • R – the reward tensor of the OS-POSG

  • A1 – the action set of player 1 in the the OS-POSG

  • A2 – the action set of player 2 in the the OS-POSG

  • S – the set of states of the OS-POSG

  • gamma – the discount factor

Returns

the initial upper bound

local_lower_bound_update(lower_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float) numpy.ndarray[Any, numpy.dtype[Any]][source]

Performs a local update to the lower bound given a belief point in the heuristic search.

The lower bound update preserves the lower bound property and the Delta-Lipschitz continuity property.

Parameters
  • lower_bound – the current lower bound

  • b – the current belief point

  • A1 – the set of actions of player 1 in the OS-POSG

  • A2 – the set of actions of player 2 in the OS-POSG

  • O – the set of observations in the OS-POSG

  • Z – the observation tensor in the OS-POSG

  • S – the set of states in the OS-POSG

  • T – the transition tensor in the OS-POSG

  • R – the reward tensor in the OS-POSG

  • gamma – the discount factor in the OS-POSG

Returns

the updated lower bound

local_updates(lower_bound: List[Any], upper_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, Z: numpy.ndarray[Any, numpy.dtype[Any]], delta: float) Tuple[List[Any], List[Any]][source]

Perform local updates to the upper and lower bounds for the given belief in the heuristic-search-exploration

Parameters
  • lower_bound – the lower bound on V

  • upper_bound – the upper bound on V

  • b – the current belief point

  • A1 – the set of actions of player 1 in the OS-POSG

  • A2 – the set of actions of player 2 in the OS-POSG

  • S – the set of states in the OS-POSG

  • O – the set of observations in the OS-POSG

  • R – the reward tensor in the OS-POSG

  • T – the transition tensor in the OS-POSG

  • gamma – the discount factor

  • Z – the set of observations in the OS-POSG

  • delta – the Lipschitz-Delta parameter

Returns

The updated lower and upper bounds

local_upper_bound_update(upper_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, Z: numpy.ndarray[Any, numpy.dtype[Any]], delta: float) numpy.ndarray[Any, numpy.dtype[Any]][source]

Performs a local update to the upper bound during the heuristic-search exploration

Parameters
  • upper_bound – the upper bound to update

  • b – the current belief point

  • A1 – the set of actions of player 1 in the OS-POSG

  • S – the set of states in the OS-POSG

  • O – the set of observations in the OS-POSG

  • R – the reward tensor in the OS-POSG

  • T – the transition tensor in the OS-POSG

  • gamma – the discount factor in the OS-POSG

  • Z – the set of observations in the OS-POSG

  • delta – the Lipschitz-Delta parameter

Returns

the updated upper bound

lower_bound_backup(lower_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, A2: numpy.ndarray[Any, numpy.dtype[Any]]) numpy.ndarray[Any, numpy.dtype[Any]][source]

Generates a new alpha-vector for the lower bound

Parameters
  • lower_bound – the current lower bound

  • b – the current belief point

  • A1 – the set of actions of player 1 in the OS-POSG

  • A2 – the set of actions of player 2 in the OS-POSG

  • O – the set of observations in the OS-POSG

  • Z – the observation tensor in the OS-POSG

  • S – the set of states in the OS-POSG

  • T – the transition tensor in the OS-POSG

  • R – the reward tensor in the OS-POSG

  • gamma – the discount factor

Returns

the new alpha vector

lower_bound_value(lower_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]]) float[source]

Computes the lower bound value of a given belief point

Parameters
  • lower_bound – the lower bound

  • b – the belief point

  • S – the set of states

Returns

the lower bound value

maxcomp_shapley_bellman_operator(Gamma: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, b: numpy.ndarray[Any, numpy.dtype[Any]], R, T, Z) Tuple[numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]]][source]

A dear child with many names: Maxcomp/Shapley/Bellman operator that computes [HV](b) where V is represented by the pointwise maximum over the convex hull of a set of alpha vectors Gamma. By representing the set of vectors as the convex hull, the solution to the operator can be found by an LP. I.e the whole purpose of using the convex hull of Gamma is to be able to compute the operator through LP.

The operator is represented as the solution to a linear program given in (Karel Horák PhD thesis, 2019).

That is, the operator performs a backup to update the lower bound in HSVI or to perform iterations in exact value iteration for OS-POSGs.

Parameters
  • Gamma – the set of alpha vectors representing the value function

  • A1 – the action space of player 1

  • S – the set of states in the OS-POSG

  • O – the set of observations in the OS-POSG

  • A2 – the action space of player 2 in the OS-POSG

  • gamma – the discount factor

  • b – the belief point

  • R – the reward tensor of the OS-POSG

  • T – the transition tensor of the OS-POSG

  • Z – the observation tensor of the OS-POSG

Returns

the “optimal” stage strategy of Player 1 and the set of alpha-vectors representing the values of the behavioral strategy following a1,o for any combination of a1,o.

mdp_reward_matrix_p2(R: numpy.ndarray[Any, numpy.dtype[Any]], P1_strategy: numpy.ndarray[Any, numpy.dtype[Any]], A1: List[Any]) numpy.ndarray[Any, numpy.dtype[Any]][source]

Creates the reward matrix of player 2 induced by a fixed policy of player 1 (a MDP)

Returns

return a |A2|x|S| matrix

mdp_transition_tensor_p2(T: numpy.ndarray[Any, numpy.dtype[Any]], P1_strategy: numpy.ndarray[Any, numpy.dtype[Any]], A1: List[float]) numpy.ndarray[Any, numpy.dtype[Any]][source]

Creates the transition tensor induced by a fixed policy of player 1 (MDP)

Returns

a |A2||S|^2 tensor

next_belief(o: int, a1: int, b: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], pi_2: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]]) numpy.ndarray[Any, numpy.dtype[Any]][source]

Computes the next belief using a Bayesian filter

Parameters
  • o – the latest observation

  • a1 – the latest action of player 1

  • b – the current belief

  • S – the set of states

  • Z – the observation tensor

  • T – the transition tensor

  • pi_2 – the policy of player 2

  • A2 – the set of actions of player 2

Returns

the new belief

obtain_equilibrium_strategy_profiles_in_stage_game(lower_bound: List[Any], upper_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], delta: float, S: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, R: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) Tuple[numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]]][source]

Computes equilibrium strategy profiles in the stage game constructed from the lower and upper bound value functions

Parameters
  • lower_bound – the lower bound of V

  • upper_bound – the upper bound of V

  • b – the belief point

  • delta – the Lipschitz-continuity parameter

  • S – the set of states

  • A1 – the set of acitons of player 1

  • A2 – the set of actions of player 2

  • R – the reward tensor

  • O – the set of observations

  • Z – the observation tensor

  • T – the transition tensor

  • gamma – the discount factor

Returns

Equilibrium strategy profiles in the upper and lower bound stage games

one_step_lookahead(state, V, num_actions, num_states, T, discount_factor, R) numpy.ndarray[Any, numpy.dtype[Any]][source]

Performs a one-step lookahead for value iteration :param state: the current state :param V: the current value function :param num_actions: the number of actions :param num_states: the number of states :param T: the transition kernel :param discount_factor: the discount factor :param R: the table with rewards :param next_state_lookahead: the next state lookahead table :return: an array with lookahead values

p_o_given_b_a1_a2(o: int, b: numpy.ndarray[Any, numpy.dtype[Any]], a1: int, a2: int, S: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) float[source]

Computes P[o|a,b]

Parameters
  • o – the observation

  • b – the belief point

  • a1 – the action of player 1

  • a2 – the action of player 2

  • S – the set of states

  • Z – the observation tensor

  • T – the transition tensor

Returns

the probability of observing o when taking action a in belief point b

p_o_given_b_pi_1_pi_2(o: int, b: numpy.ndarray[Any, numpy.dtype[Any]], pi_1: numpy.ndarray[Any, numpy.dtype[Any]], pi_2: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) float[source]

Computes P[o|a,b]

Parameters
  • o – the observation

  • b – the belief point

  • pi_1 – the policy of player 1

  • pi_2 – the policy of player 2

  • S – the set of states

  • Z – the observation tensor

  • T – the transition tensor

Returns

the probability of observing o when taking action a in belief point b

prune_upper_bound(upper_bound: List[Any], delta: float, S: numpy.ndarray[Any, numpy.dtype[Any]]) List[Any][source]

Prunes the points in the upper bound

Parameters
  • upper_bound – the current upper bound

  • delta – the delta parameter for lipschitz-continuity

  • S – the set of states

Returns

the pruned upper bound

rho(t: int, epsilon: float, gamma: float, delta: float, D: float) float[source]

During the exploration, the HSVI algorithms tries to keep the gap between V_UB and V_LB to be at most rho(t), which is monotonically increasing and unbounded

rho(0) = epsilon rho(t+1) = (rho(t) -2*delta*D)/gamma

Parameters
  • t – the time-step of the exploration

  • epsilon – the epsilon parameter

  • gamma – the discount factor

  • delta – the Lipshitz-continuity parameter

  • D – the neighborhood parameter

Returns

rho(t)

sample_D(gamma: float, epsilon: float, delta: float) float[source]

Samples the neighborhood parameter in the legal range.

To ensure that the sequence rho is monotonically increasing and unbounded, we need to select the parameter D from the interval (0, (1-gamma)*epsilon/2delta)

Parameters
  • gamma – the discount factor

  • epsilon – the epsilon accuracy parameter

  • delta – the Lipschitz-continuity parameter

Returns

the neighborhood parameter D

si(S: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float = 1, max_iterations: int = 500, delta_threshold: float = 0.1, log: bool = False) Tuple[numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]]][source]

Shapley Iteration (L. Shapley 1953)

Parameters
  • S – the set of states of the SG

  • A1 – the set of actions of player 1 in the SG

  • A2 – the set of actions of player 2 in the SG

  • R – the reward tensor in the SG

  • T – the transition tensor in the SG

  • gamma – the discount factor

  • max_iterations – the maximum number of iterations

  • delta_threshold – the stopping threshold

  • log – a boolean flag whether to use verbose logging or not

Returns

the value function, the set of maximin strategies for all stage games,

the set of minimax strategies for all stage games, and the stage games themselves

train() csle_common.dao.training.experiment_execution.ExperimentExecution[source]

Runs the value iteration algorithm to compute V*

Returns

the results

upper_bound_backup(upper_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, delta: float) float[source]

Adds a point to the upper bound

Parameters
  • upper_bound – the current upper bound

  • b – the current belief point

  • A1 – the set of actions of player 1 in the OS-POSG

  • A2 – the set of actions of player 1 in the OS-POSG

  • S – the set of states in the OS-POSG

  • O – the set of observations in the OS-POSG

  • Z – the observation tensor in the OS-POSG

  • R – the reward tensor in the OS-POSG

  • T – the transition tensor in the OS-POSG

  • gamma – the discount factor in the OS-POSG

  • lp_nf – a boolean flag whether to use LP to compute the upper bound belief

  • delta – the Lipschitz-delta parameter

Returns

the new point

upper_bound_value(upper_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], delta: float, S: numpy.ndarray[Any, numpy.dtype[Any]]) float[source]

Computes the upper bound value of a given belief point

Parameters
  • upper_bound – the upper bound

  • b – the belief point

  • delta – the delta-parameter for Lipschitz-continuity

  • lp_nf – boolean flag that decides whether to use LP to compute the upper bound value or not

  • S – the set of states

Returns

the upper bound value

valcomp(pi_1: numpy.ndarray[Any, numpy.dtype[Any]], alpha_bar: numpy.ndarray[Any, numpy.dtype[Any]], s: int, A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, substituted_alpha: bool = False) float[source]

Computes the value of a compositional strategy of player 1 in a given state s. The compositional strategy consists of the one-stage strategy pi_1 (a probability distribution over A1) and the expected value when taking action a according to pi_1 and then observing the next observation o and then following a behavioral strategy C^(a,o) in (Sigma_1)^(O x A). It is assumed that the value of following strategy C^(a,o) is represented by alpha^(a,o) in alpha_bar. Further, to compute the value, since the game is zero-sum, we assume that P2 plays a best response against the compositional strategy.

Parameters
  • pi_1 – the one-stage strategy of P1 to evaluate

  • alpha_bar – the set of alpha vectors representing the value of the behavioral strategies in subsequent subgames

  • s – the state to evaluate the value of

  • A1 – the action set of player 1 in the OS-POSG

  • A2 – the action set of player 2 in the OS-POSG

  • O – the set of observations in the OS-POSG

  • S – the set of states in the OS-POSG

  • Z – the observation tensor in the OS-POSG

  • T – the transition tensor in the OS-POSG

  • R – the reward tensor in the OS-POSG

  • substituted_alpha – if True, treat the alpha vectors as alpha^(a,o)(s) = pi_1[a]alpha^(a,o)(s)

Returns

the value of the compositional strategy of P1

value_of_p1_strategy_static(S: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, P1_strategy: numpy.ndarray[Any, numpy.dtype[Any]], b0: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) Tuple[numpy.ndarray[Any, numpy.dtype[Any]], float][source]

Computes the value of PI’s strategy P1_strategy, assuming that P1’s strategy is static and independent of observations/actions/beliefs in the game. For example a uniform strategy.

The value function is computed by solving the Bellman equation of the induced MDP for P2.

Parameters
  • S – the set of states of the OS-POSG

  • A1 – the set of actions of P1 in the OS-POSG

  • A2 – the set of actions of P2 in the OS-POSG

  • gamma – the discount factor

  • P1_strategy – the static strategy of P1 to evaluate

  • b0 – the initial belief

  • R – the reward tensor

  • T – the transition tensor

Returns

the value vector and the value given the initial belief b0.

vi(T: numpy.ndarray[Any, numpy.dtype[Any]], num_states: int, num_actions: int, R: numpy.ndarray[Any, numpy.dtype[Any]], theta: float = 0.0001, discount_factor: float = 1.0) Tuple[numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]]][source]

An implementation of the Value Iteration algorithm :param T: the transition kernel T :param num_states: the number of states :param num_actions: the number of actions :param state_to_id: the state-to-id lookup table :param HP: the table with hack probabilities :param R: the table with rewards :param next_state_lookahead: the next-state-lookahead table :param theta: convergence threshold :param discount_factor: the discount factor :return: (greedy policy, value function)

weighted_excess_gap(lower_bound: List[Any], upper_bound: List[Any], a1: int, o: int, b: numpy.ndarray[Any, numpy.dtype[Any]], t: int, gamma: float, pi_1_upper_bound, pi_2_lower_bound, epsilon: float, delta: float, D: float, S: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) Tuple[float, float, float, numpy.ndarray[Any, numpy.dtype[Any]]][source]

Computes the weighted excess gap

Parameters
  • lower_bound – the lower bound

  • upper_bound – the upper bound

  • a1 – the action of player 1

  • o – the observation

  • b – the belief point

  • t – the time-step

  • gamma – the discount factor

:param pi_1_upper_bound a maximin strategy of player 1 in the stage game based on upper bound V* :param pi_2_lower_bound: a minimax strategy of player 1 in the stage game based on lower bound V* :param epsilon: the epsilon accuracy parameter :param delta: the Lipschitz-continuity parameter :param D: the neighborhood parameter :param S: the set of states :param Z: the observation tensor :param A1: the set of actions of player 1 :param A2: the set of actions of player 2 :param T: the transition tensor :return: (weighted excess, excess, width, b_prime)

width(lower_bound: List[Any], upper_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], delta: float) float[source]

Computes the bounds width (Trey Smith and Reid Simmons, 2004)

Parameters
  • lower_bound – the current lower bound

  • upper_bound – the current upper bound

  • b – the current belief point

  • S – the set of states

  • delta – the delta parameter for Lipschitz-continuity

Returns

the width of the bounds

Module contents