csle_agents.agents.hsvi_os_posg package

Submodules

csle_agents.agents.hsvi_os_posg.hsvi_os_posg_agent module

class csle_agents.agents.hsvi_os_posg.hsvi_os_posg_agent.HSVIOSPOSGAgent(simulation_env_config: SimulationEnvConfig, experiment_config: ExperimentConfig, training_job: Optional[TrainingJobConfig] = None, save_to_metastore: bool = True)[source]

Bases: BaseAgent

HSVI for OS-POSGs agent

auxillary_game(V: ndarray[Any, dtype[Any]], gamma: float, S: ndarray[Any, dtype[Any]], s: int, A1: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]]) → ndarray[Any, dtype[Any]][source]

Creates an auxillary matrix game based on the value function V

Parameters

V – the value function
gamma – the discount factor
S – the set of states
s – the state s
A1 – the set of actions of player 1
A2 – the set of actions of player 2
R – the reward tensor
T – the transition tensor

Returns

the matrix auxillary game

bayes_filter(s_prime: int, o: int, a1: int, b: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], pi_2: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]]) → float[source]

A Bayesian filter to compute the belief of player 1 of being in s_prime when observing o after taking action a in belief b given that the opponent follows strategy pi_2

Parameters

s_prime – the state to compute the belief of
o – the observation
a1 – the action of player 1
b – the current belief point
S – the set of states
Z – the observation tensor
T – the transition tensor
pi_2 – the policy of player 2
A2 – the action set of player 2

Returns

b_prime(s_prime)

choose_a_o_for_exploration(A1: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], t: int, b: ndarray[Any, dtype[Any]], pi_1_upper_bound: ndarray[Any, dtype[Any]], pi_2_lower_bound: ndarray[Any, dtype[Any]], lower_bound: List[Any], upper_bound: List[Any], gamma: float, epsilon: float, delta: float, D: float, S: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]]) → Tuple[int, int, float, float, float, ndarray[Any, dtype[Any]]][source]

Selects the action a* and observation * for exploration according to the heuristic:

(a1*,o*) = argmax_(a1,o)[P[a1,o]*excess(b_1(a_1,o))]

(Horák, Bosansky, Kovařík, Kiekintveld, 2020)

Parameters

A1 – the action set of player 1
O – the set of observations in the OS-POSG
t – the time-step t
b – the belief point
pi_1_upper_bound – equilibrium strategy of player 1 in the stage game constructed from the upper bound V*
pi_2_lower_bound – equilibrium strategy of player 2 in the stage game constructed from the lower bound V*
lower_bound – the lower bound
upper_bound – the upper bound
gamma – the discount factor
epsilon – the epsilon accuracy parameter
delta – the Lipschitz-continuity parameter
D – the neighboorhood parameter
S – the set of states
Z – the observation tensor
A2 – the set of actions of player 2
T – the transition tensor

Returns

a*,o*,weighted_excess(a*,o*), excess(a*,o*), width(a*,o*), b_prime(a*,o*)

combine_weights_and_pure_strategies_into_mixed_strategy(weights: ndarray[Any, dtype[Any]], strategies: ndarray[Any, dtype[Any]])[source]

Uses a set of mixture weights and strategies to compute a mixed strategy

Parameters

weights – the mixture weights
strategies – the set of strategies

Returns

the mixed strategy

compute_delta(S: ndarray[Any, dtype[Any]], A1: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], gamma: float, R: ndarray[Any, dtype[Any]]) → float[source]

The optimal value function V* of a OS-POSG is delta-Lipschitz continuous. To prove convergence of HSVI, we require that V_UB and V_LB are delta-Lipschitz continuous as well. This function computes the delta value according to (Horák, Bosansky, Kovařík, Kiekintveld, 2020) Specifically, delta = (U-L)/2 where L and U are teh lower respectively upper bounds on the game values.

Parameters

S – the set of states of the OS-POSG
A1 – the set of actions of player 1
A2 – the set of actions of player 2
gamma – the discount factor
R – the reward tensor

Returns

the delta value

compute_equilibrium_strategies_in_matrix_game(A: ndarray[Any, dtype[Any]], A1: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]]) → Tuple[ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]]][source]

Computes equilibrium strategies in a matrix game

Parameters

A – the matrix game
A1 – the action set of player 1 (the maximizer)
A2 – the action set of player 2 (the minimizer)

Returns

the equilibrium strategy profile

compute_matrix_game_value(A: ndarray[Any, dtype[Any]], A1: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], maximizer: bool = True) → Tuple[Any, ndarray[Any, dtype[Any]]][source]

Uses LP to compute the value of a a matrix game, also computes the maximin or minimax strategy

Parameters

A – the matrix game
A1 – the action set of player 1
A2 – the action set of player 2
maximizer – boolean flag whether to compute the maximin strategy or minimax strategy

Returns

(val(A), maximin or minimax strategy)

delta_lipschitz_envelope_of_upper_bound_value(upper_bound: List[Any], b: ndarray[Any, dtype[Any]], delta: float, S: ndarray[Any, dtype[Any]]) → float[source]

This function computes the delta-Lipschitz envelop of the upper bound value at a given belief point b.

Parameters

upper_bound – the upper bound
b – the belief point
delta – the delta parameter for Lipschitz-continuity
S – the set of states

Returns

the belief value

excess(lower_bound: List[Any], upper_bound: List[Any], b: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], epsilon: float, gamma: float, t: int, delta: float, D: float) → Tuple[float, float][source]

Computes the excess gap and width (Horak, Bosansky, Pechoucek, 2017)

Parameters

lower_bound – the lower bound
upper_bound – the upper bound
D – the neighborhood parameter
delta – the Lipschitz-continuity parameter
b – the current belief point
S – the set of states
epsilon – the epsilon accuracy parameter
gamma – the discount factor
t – the current exploration depth

Returns

the excess gap and gap width

explore(b: ndarray[Any, dtype[Any]], epsilon: float, t: int, lower_bound: List[Any], upper_bound: List[Any], gamma: float, S: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], A1: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], delta: float, D: float) → Tuple[List[Any], List[Any]][source]

Explores the OS-POSG tree

Parameters

b – current belief
epsilon – accuracy parameter
t – the current depth of the exploration
lower_bound – the lower bound on the value function
upper_bound – the upper bound on the value function
gamma – discount factor
S – set of states in the OS-POSG
O – set of observations in the OS-POSG
Z – observation tensor in the OS-POSG
R – reward tensor in the OS-POSG
T – transition tensor in the OS-POSG
A1 – set of actions of P1 in the OS-POSG
A2 – set of actions of P2 in the OS-POSG
delta – the delta parameter in Lipschitz-continuity

Returns

new lower and upper bounds

generate_corner_belief(s: int, S: ndarray[Any, dtype[Any]])[source]

Generate the corner of the simplex that corresponds to being in some state with probability 1

Parameters

s – the state
S – the set of States

Returns

the corner belief corresponding to state s

hparam_names() → List[str][source]

Returns: a list with the hyperparameter names

hsvi(O: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], A1: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], gamma: float, b0: ndarray[Any, dtype[Any]], epsilon: float, prune_frequency: int = 10, D: float = -1.0)[source]

Heuristic Search Value Iteration for zero-sum OS-POSGs (Horak, Bosansky, Pechoucek, 2017)

The value function represents the utility player 1 can achieve in each possible initial belief of the game.

Parameters

O – set of observations of the OS-POSG
Z – observation tensor of the OS-POSG
R – reward tensor of the OS-POSG
T – transition tensor of the OS-POSG
A1 – action set of P1 in the OS-POSG
A2 – action set of P2 in the OS-POSG
S – state set of the OS-POSG
gamma – discount factor
b0 – initial belief point
epsilon – accuracy parameter
prune_frequency – how often to prune the upper and lower bounds
D – neighborhood parameter

Returns

None

hsvi_os_posg(exp_result: ExperimentResult, seed: int) → ExperimentResult[source]

Runs the Shapley iteration algorithm

Parameters

exp_result – the experiment result object
seed – the random seed

Returns

the updated experiment result

initialize_lower_bound(S: ndarray[Any, dtype[Any]], A1: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], gamma: float, b0: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]]) → List[Any][source]

Initializes the lower bound by computing the state-values of the POMDP induced by player 1 playing a uniform strategy

Parameters

S – the set of states
A1 – the set of actions of player 1
A2 – the set of actions of player 2
gamma – the discount factor
b0 – the initial belief point
R – the reward tensor
T – the transition tensor

Returns

the lower bound (singleton set with an alpha vector)

initialize_upper_bound(T: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], A1: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], gamma: float) → List[Any][source]

Initializes the upper bound by computing the values of the fully observed version of the OS-POSG using Shapley iteration.

Parameters

T – the transition tensor of the OS-POSG
R – the reward tensor of the OS-POSG
A1 – the action set of player 1 in the the OS-POSG
A2 – the action set of player 2 in the the OS-POSG
S – the set of states of the OS-POSG
gamma – the discount factor

Returns

the initial upper bound

local_lower_bound_update(lower_bound: List[Any], b: ndarray[Any, dtype[Any]], A1: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], gamma: float) → ndarray[Any, dtype[Any]][source]

Performs a local update to the lower bound given a belief point in the heuristic search.

The lower bound update preserves the lower bound property and the Delta-Lipschitz continuity property.

Parameters

lower_bound – the current lower bound
b – the current belief point
A1 – the set of actions of player 1 in the OS-POSG
A2 – the set of actions of player 2 in the OS-POSG
O – the set of observations in the OS-POSG
Z – the observation tensor in the OS-POSG
S – the set of states in the OS-POSG
T – the transition tensor in the OS-POSG
R – the reward tensor in the OS-POSG
gamma – the discount factor in the OS-POSG

Returns

the updated lower bound

local_updates(lower_bound: List[Any], upper_bound: List[Any], b: ndarray[Any, dtype[Any]], A1: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], gamma: float, Z: ndarray[Any, dtype[Any]], delta: float) → Tuple[List[Any], List[Any]][source]

Perform local updates to the upper and lower bounds for the given belief in the heuristic-search-exploration

Parameters

lower_bound – the lower bound on V
upper_bound – the upper bound on V
b – the current belief point
A1 – the set of actions of player 1 in the OS-POSG
A2 – the set of actions of player 2 in the OS-POSG
S – the set of states in the OS-POSG
O – the set of observations in the OS-POSG
R – the reward tensor in the OS-POSG
T – the transition tensor in the OS-POSG
gamma – the discount factor
Z – the set of observations in the OS-POSG
delta – the Lipschitz-Delta parameter

Returns

The updated lower and upper bounds

local_upper_bound_update(upper_bound: List[Any], b: ndarray[Any, dtype[Any]], A1: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], gamma: float, Z: ndarray[Any, dtype[Any]], delta: float) → ndarray[Any, dtype[Any]][source]

Performs a local update to the upper bound during the heuristic-search exploration

Parameters

upper_bound – the upper bound to update
b – the current belief point
A1 – the set of actions of player 1 in the OS-POSG
S – the set of states in the OS-POSG
O – the set of observations in the OS-POSG
R – the reward tensor in the OS-POSG
T – the transition tensor in the OS-POSG
gamma – the discount factor in the OS-POSG
Z – the set of observations in the OS-POSG
delta – the Lipschitz-Delta parameter

Returns

the updated upper bound

lower_bound_backup(lower_bound: List[Any], b: ndarray[Any, dtype[Any]], A1: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], gamma: float, A2: ndarray[Any, dtype[Any]]) → ndarray[Any, dtype[Any]][source]

Generates a new alpha-vector for the lower bound

Parameters

lower_bound – the current lower bound
b – the current belief point
A1 – the set of actions of player 1 in the OS-POSG
A2 – the set of actions of player 2 in the OS-POSG
O – the set of observations in the OS-POSG
Z – the observation tensor in the OS-POSG
S – the set of states in the OS-POSG
T – the transition tensor in the OS-POSG
R – the reward tensor in the OS-POSG
gamma – the discount factor

Returns

the new alpha vector

lower_bound_value(lower_bound: List[Any], b: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]]) → float[source]

Computes the lower bound value of a given belief point

Parameters

lower_bound – the lower bound
b – the belief point
S – the set of states

Returns

the lower bound value

maxcomp_shapley_bellman_operator(Gamma: ndarray[Any, dtype[Any]], A1: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], gamma: float, b: ndarray[Any, dtype[Any]], R, T, Z) → Tuple[ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]]][source]

A dear child with many names: Maxcomp/Shapley/Bellman operator that computes [HV](b) where V is represented by the pointwise maximum over the convex hull of a set of alpha vectors Gamma. By representing the set of vectors as the convex hull, the solution to the operator can be found by an LP. I.e the whole purpose of using the convex hull of Gamma is to be able to compute the operator through LP.

The operator is represented as the solution to a linear program given in (Karel Horák PhD thesis, 2019).

That is, the operator performs a backup to update the lower bound in HSVI or to perform iterations in exact value iteration for OS-POSGs.

Parameters

Gamma – the set of alpha vectors representing the value function
A1 – the action space of player 1
S – the set of states in the OS-POSG
O – the set of observations in the OS-POSG
A2 – the action space of player 2 in the OS-POSG
gamma – the discount factor
b – the belief point
R – the reward tensor of the OS-POSG
T – the transition tensor of the OS-POSG
Z – the observation tensor of the OS-POSG

Returns

the “optimal” stage strategy of Player 1 and the set of alpha-vectors representing the values of the behavioral strategy following a1,o for any combination of a1,o.

mdp_reward_matrix_p2(R: ndarray[Any, dtype[Any]], P1_strategy: ndarray[Any, dtype[Any]], A1: List[Any]) → ndarray[Any, dtype[Any]][source]

Creates the reward matrix of player 2 induced by a fixed policy of player 1 (a MDP)

Returns: return a |A2|x|S| matrix

mdp_transition_tensor_p2(T: ndarray[Any, dtype[Any]], P1_strategy: ndarray[Any, dtype[Any]], A1: List[float]) → ndarray[Any, dtype[Any]][source]

Creates the transition tensor induced by a fixed policy of player 1 (MDP)

Returns: a |A2||S|^2 tensor

next_belief(o: int, a1: int, b: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], pi_2: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]]) → ndarray[Any, dtype[Any]][source]

Computes the next belief using a Bayesian filter

Parameters

o – the latest observation
a1 – the latest action of player 1
b – the current belief
S – the set of states
Z – the observation tensor
T – the transition tensor
pi_2 – the policy of player 2
A2 – the set of actions of player 2

Returns

the new belief

obtain_equilibrium_strategy_profiles_in_stage_game(lower_bound: List[Any], upper_bound: List[Any], b: ndarray[Any, dtype[Any]], delta: float, S: ndarray[Any, dtype[Any]], A1: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], gamma: float, R: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]]) → Tuple[ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]]][source]

Computes equilibrium strategy profiles in the stage game constructed from the lower and upper bound value functions

Parameters

lower_bound – the lower bound of V
upper_bound – the upper bound of V
b – the belief point
delta – the Lipschitz-continuity parameter
S – the set of states
A1 – the set of acitons of player 1
A2 – the set of actions of player 2
R – the reward tensor
O – the set of observations
Z – the observation tensor
T – the transition tensor
gamma – the discount factor

Returns

Equilibrium strategy profiles in the upper and lower bound stage games

one_step_lookahead(state, V, num_actions, num_states, T, discount_factor, R) → ndarray[Any, dtype[Any]][source]: Performs a one-step lookahead for value iteration :param state: the current state :param V: the current value function :param num_actions: the number of actions :param num_states: the number of states :param T: the transition kernel :param discount_factor: the discount factor :param R: the table with rewards :param next_state_lookahead: the next state lookahead table :return: an array with lookahead values

p_o_given_b_a1_a2(o: int, b: ndarray[Any, dtype[Any]], a1: int, a2: int, S: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]]) → float[source]

Computes P[o|a,b]

Parameters

o – the observation
b – the belief point
a1 – the action of player 1
a2 – the action of player 2
S – the set of states
Z – the observation tensor
T – the transition tensor

Returns

the probability of observing o when taking action a in belief point b

p_o_given_b_pi_1_pi_2(o: int, b: ndarray[Any, dtype[Any]], pi_1: ndarray[Any, dtype[Any]], pi_2: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], A1: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]]) → float[source]

Computes P[o|a,b]

Parameters

o – the observation
b – the belief point
pi_1 – the policy of player 1
pi_2 – the policy of player 2
S – the set of states
Z – the observation tensor
T – the transition tensor

Returns

the probability of observing o when taking action a in belief point b

prune_upper_bound(upper_bound: List[Any], delta: float, S: ndarray[Any, dtype[Any]]) → List[Any][source]

Prunes the points in the upper bound

Parameters

upper_bound – the current upper bound
delta – the delta parameter for lipschitz-continuity
S – the set of states

Returns

the pruned upper bound

rho(t: int, epsilon: float, gamma: float, delta: float, D: float) → float[source]

During the exploration, the HSVI algorithms tries to keep the gap between V_UB and V_LB to be at most rho(t), which is monotonically increasing and unbounded

rho(0) = epsilon rho(t+1) = (rho(t) -2*delta*D)/gamma

Parameters

t – the time-step of the exploration
epsilon – the epsilon parameter
gamma – the discount factor
delta – the Lipshitz-continuity parameter
D – the neighborhood parameter

Returns

rho(t)

sample_D(gamma: float, epsilon: float, delta: float) → float[source]

Samples the neighborhood parameter in the legal range.

To ensure that the sequence rho is monotonically increasing and unbounded, we need to select the parameter D from the interval (0, (1-gamma)*epsilon/2delta)

Parameters

gamma – the discount factor
epsilon – the epsilon accuracy parameter
delta – the Lipschitz-continuity parameter

Returns

the neighborhood parameter D

si(S: ndarray[Any, dtype[Any]], A1: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], gamma: float = 1, max_iterations: int = 500, delta_threshold: float = 0.1, log: bool = False) → Tuple[ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]]][source]

Shapley Iteration (L. Shapley 1953)

Parameters

S – the set of states of the SG
A1 – the set of actions of player 1 in the SG
A2 – the set of actions of player 2 in the SG
R – the reward tensor in the SG
T – the transition tensor in the SG
gamma – the discount factor
max_iterations – the maximum number of iterations
delta_threshold – the stopping threshold
log – a boolean flag whether to use verbose logging or not

Returns

the value function, the set of maximin strategies for all stage games,

the set of minimax strategies for all stage games, and the stage games themselves

train() → ExperimentExecution[source]

Runs the value iteration algorithm to compute V*

Returns: the results

upper_bound_backup(upper_bound: List[Any], b: ndarray[Any, dtype[Any]], A1: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], gamma: float, delta: float) → float[source]

Adds a point to the upper bound

Parameters

upper_bound – the current upper bound
b – the current belief point
A1 – the set of actions of player 1 in the OS-POSG
A2 – the set of actions of player 1 in the OS-POSG
S – the set of states in the OS-POSG
O – the set of observations in the OS-POSG
Z – the observation tensor in the OS-POSG
R – the reward tensor in the OS-POSG
T – the transition tensor in the OS-POSG
gamma – the discount factor in the OS-POSG
lp_nf – a boolean flag whether to use LP to compute the upper bound belief
delta – the Lipschitz-delta parameter

Returns

the new point

upper_bound_value(upper_bound: List[Any], b: ndarray[Any, dtype[Any]], delta: float, S: ndarray[Any, dtype[Any]]) → float[source]

Computes the upper bound value of a given belief point

Parameters

upper_bound – the upper bound
b – the belief point
delta – the delta-parameter for Lipschitz-continuity
lp_nf – boolean flag that decides whether to use LP to compute the upper bound value or not
S – the set of states

Returns

the upper bound value

valcomp(pi_1: ndarray[Any, dtype[Any]], alpha_bar: ndarray[Any, dtype[Any]], s: int, A1: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], gamma: float, substituted_alpha: bool = False) → float[source]

Computes the value of a compositional strategy of player 1 in a given state s. The compositional strategy consists of the one-stage strategy pi_1 (a probability distribution over A1) and the expected value when taking action a according to pi_1 and then observing the next observation o and then following a behavioral strategy C^(a,o) in (Sigma_1)^(O x A). It is assumed that the value of following strategy C^(a,o) is represented by alpha^(a,o) in alpha_bar. Further, to compute the value, since the game is zero-sum, we assume that P2 plays a best response against the compositional strategy.

Parameters

pi_1 – the one-stage strategy of P1 to evaluate
alpha_bar – the set of alpha vectors representing the value of the behavioral strategies in subsequent subgames
s – the state to evaluate the value of
A1 – the action set of player 1 in the OS-POSG
A2 – the action set of player 2 in the OS-POSG
O – the set of observations in the OS-POSG
S – the set of states in the OS-POSG
Z – the observation tensor in the OS-POSG
T – the transition tensor in the OS-POSG
R – the reward tensor in the OS-POSG
substituted_alpha – if True, treat the alpha vectors as alpha^(a,o)(s) = pi_1[a]alpha^(a,o)(s)

Returns

the value of the compositional strategy of P1

value_of_p1_strategy_static(S: ndarray[Any, dtype[Any]], A1: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], gamma: float, P1_strategy: ndarray[Any, dtype[Any]], b0: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]]) → Tuple[ndarray[Any, dtype[Any]], float][source]

Computes the value of PI’s strategy P1_strategy, assuming that P1’s strategy is static and independent of observations/actions/beliefs in the game. For example a uniform strategy.

The value function is computed by solving the Bellman equation of the induced MDP for P2.

Parameters

S – the set of states of the OS-POSG
A1 – the set of actions of P1 in the OS-POSG
A2 – the set of actions of P2 in the OS-POSG
gamma – the discount factor
P1_strategy – the static strategy of P1 to evaluate
b0 – the initial belief
R – the reward tensor
T – the transition tensor

Returns

the value vector and the value given the initial belief b0.

vi(T: ndarray[Any, dtype[Any]], num_states: int, num_actions: int, R: ndarray[Any, dtype[Any]], theta: float = 0.0001, discount_factor: float = 1.0) → Tuple[ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]]][source]: An implementation of the Value Iteration algorithm :param T: the transition kernel T :param num_states: the number of states :param num_actions: the number of actions :param state_to_id: the state-to-id lookup table :param HP: the table with hack probabilities :param R: the table with rewards :param next_state_lookahead: the next-state-lookahead table :param theta: convergence threshold :param discount_factor: the discount factor :return: (greedy policy, value function)

weighted_excess_gap(lower_bound: List[Any], upper_bound: List[Any], a1: int, o: int, b: ndarray[Any, dtype[Any]], t: int, gamma: float, pi_1_upper_bound, pi_2_lower_bound, epsilon: float, delta: float, D: float, S: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], A1: ndarray[Any, dtype[Any]], A2: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]]) → Tuple[float, float, float, ndarray[Any, dtype[Any]]][source]

Computes the weighted excess gap

Parameters

lower_bound – the lower bound
upper_bound – the upper bound
a1 – the action of player 1
o – the observation
b – the belief point
t – the time-step
gamma – the discount factor

:param pi_1_upper_bound a maximin strategy of player 1 in the stage game based on upper bound V* :param pi_2_lower_bound: a minimax strategy of player 1 in the stage game based on lower bound V* :param epsilon: the epsilon accuracy parameter :param delta: the Lipschitz-continuity parameter :param D: the neighborhood parameter :param S: the set of states :param Z: the observation tensor :param A1: the set of actions of player 1 :param A2: the set of actions of player 2 :param T: the transition tensor :return: (weighted excess, excess, width, b_prime)

width(lower_bound: List[Any], upper_bound: List[Any], b: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], delta: float) → float[source]

Computes the bounds width (Trey Smith and Reid Simmons, 2004)

Parameters

lower_bound – the current lower bound
upper_bound – the current upper bound
b – the current belief point
S – the set of states
delta – the delta parameter for Lipschitz-continuity

Returns

the width of the bounds

csle_agents.agents.hsvi_os_posg package

Submodules

csle_agents.agents.hsvi_os_posg.hsvi_os_posg_agent module

Module contents