csle_agents.agents.hsvi_os_posg package
Submodules
csle_agents.agents.hsvi_os_posg.hsvi_os_posg_agent module
- class csle_agents.agents.hsvi_os_posg.hsvi_os_posg_agent.HSVIOSPOSGAgent(simulation_env_config: csle_common.dao.simulation_config.simulation_env_config.SimulationEnvConfig, experiment_config: csle_common.dao.training.experiment_config.ExperimentConfig, training_job: Optional[csle_common.dao.jobs.training_job_config.TrainingJobConfig] = None, save_to_metastore: bool = True)[source]
Bases:
csle_agents.agents.base.base_agent.BaseAgent
HSVI for OS-POSGs agent
- auxillary_game(V: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, S: numpy.ndarray[Any, numpy.dtype[Any]], s: int, A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) numpy.ndarray[Any, numpy.dtype[Any]] [source]
Creates an auxillary matrix game based on the value function V
- Parameters
V – the value function
gamma – the discount factor
S – the set of states
s – the state s
A1 – the set of actions of player 1
A2 – the set of actions of player 2
R – the reward tensor
T – the transition tensor
- Returns
the matrix auxillary game
- bayes_filter(s_prime: int, o: int, a1: int, b: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], pi_2: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]]) float [source]
A Bayesian filter to compute the belief of player 1 of being in s_prime when observing o after taking action a in belief b given that the opponent follows strategy pi_2
- Parameters
s_prime – the state to compute the belief of
o – the observation
a1 – the action of player 1
b – the current belief point
S – the set of states
Z – the observation tensor
T – the transition tensor
pi_2 – the policy of player 2
A2 – the action set of player 2
- Returns
b_prime(s_prime)
- choose_a_o_for_exploration(A1: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], t: int, b: numpy.ndarray[Any, numpy.dtype[Any]], pi_1_upper_bound: numpy.ndarray[Any, numpy.dtype[Any]], pi_2_lower_bound: numpy.ndarray[Any, numpy.dtype[Any]], lower_bound: List[Any], upper_bound: List[Any], gamma: float, epsilon: float, delta: float, D: float, S: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) Tuple[int, int, float, float, float, numpy.ndarray[Any, numpy.dtype[Any]]] [source]
Selects the action a* and observation * for exploration according to the heuristic:
(a1*,o*) = argmax_(a1,o)[P[a1,o]*excess(b_1(a_1,o))]
(Horák, Bosansky, Kovařík, Kiekintveld, 2020)
- Parameters
A1 – the action set of player 1
O – the set of observations in the OS-POSG
t – the time-step t
b – the belief point
pi_1_upper_bound – equilibrium strategy of player 1 in the stage game constructed from the upper bound V*
pi_2_lower_bound – equilibrium strategy of player 2 in the stage game constructed from the lower bound V*
lower_bound – the lower bound
upper_bound – the upper bound
gamma – the discount factor
epsilon – the epsilon accuracy parameter
delta – the Lipschitz-continuity parameter
D – the neighboorhood parameter
S – the set of states
Z – the observation tensor
A2 – the set of actions of player 2
T – the transition tensor
- Returns
a*,o*,weighted_excess(a*,o*), excess(a*,o*), width(a*,o*), b_prime(a*,o*)
- combine_weights_and_pure_strategies_into_mixed_strategy(weights: numpy.ndarray[Any, numpy.dtype[Any]], strategies: numpy.ndarray[Any, numpy.dtype[Any]])[source]
Uses a set of mixture weights and strategies to compute a mixed strategy
- Parameters
weights – the mixture weights
strategies – the set of strategies
- Returns
the mixed strategy
- compute_delta(S: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, R: numpy.ndarray[Any, numpy.dtype[Any]]) float [source]
The optimal value function V* of a OS-POSG is delta-Lipschitz continuous. To prove convergence of HSVI, we require that V_UB and V_LB are delta-Lipschitz continuous as well. This function computes the delta value according to (Horák, Bosansky, Kovařík, Kiekintveld, 2020) Specifically, delta = (U-L)/2 where L and U are teh lower respectively upper bounds on the game values.
- Parameters
S – the set of states of the OS-POSG
A1 – the set of actions of player 1
A2 – the set of actions of player 2
gamma – the discount factor
R – the reward tensor
- Returns
the delta value
- compute_equilibrium_strategies_in_matrix_game(A: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]]) Tuple[numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]]] [source]
Computes equilibrium strategies in a matrix game
- Parameters
A – the matrix game
A1 – the action set of player 1 (the maximizer)
A2 – the action set of player 2 (the minimizer)
- Returns
the equilibrium strategy profile
- compute_matrix_game_value(A: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], maximizer: bool = True) Tuple[Any, numpy.ndarray[Any, numpy.dtype[Any]]] [source]
Uses LP to compute the value of a a matrix game, also computes the maximin or minimax strategy
- Parameters
A – the matrix game
A1 – the action set of player 1
A2 – the action set of player 2
maximizer – boolean flag whether to compute the maximin strategy or minimax strategy
- Returns
(val(A), maximin or minimax strategy)
- delta_lipschitz_envelope_of_upper_bound_value(upper_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], delta: float, S: numpy.ndarray[Any, numpy.dtype[Any]]) float [source]
This function computes the delta-Lipschitz envelop of the upper bound value at a given belief point b.
- Parameters
upper_bound – the upper bound
b – the belief point
delta – the delta parameter for Lipschitz-continuity
S – the set of states
- Returns
the belief value
- excess(lower_bound: List[Any], upper_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], epsilon: float, gamma: float, t: int, delta: float, D: float) Tuple[float, float] [source]
Computes the excess gap and width (Horak, Bosansky, Pechoucek, 2017)
- Parameters
lower_bound – the lower bound
upper_bound – the upper bound
D – the neighborhood parameter
delta – the Lipschitz-continuity parameter
b – the current belief point
S – the set of states
epsilon – the epsilon accuracy parameter
gamma – the discount factor
t – the current exploration depth
- Returns
the excess gap and gap width
- explore(b: numpy.ndarray[Any, numpy.dtype[Any]], epsilon: float, t: int, lower_bound: List[Any], upper_bound: List[Any], gamma: float, S: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], delta: float, D: float) Tuple[List[Any], List[Any]] [source]
Explores the OS-POSG tree
- Parameters
b – current belief
epsilon – accuracy parameter
t – the current depth of the exploration
lower_bound – the lower bound on the value function
upper_bound – the upper bound on the value function
gamma – discount factor
S – set of states in the OS-POSG
O – set of observations in the OS-POSG
Z – observation tensor in the OS-POSG
R – reward tensor in the OS-POSG
T – transition tensor in the OS-POSG
A1 – set of actions of P1 in the OS-POSG
A2 – set of actions of P2 in the OS-POSG
delta – the delta parameter in Lipschitz-continuity
- Returns
new lower and upper bounds
- generate_corner_belief(s: int, S: numpy.ndarray[Any, numpy.dtype[Any]])[source]
Generate the corner of the simplex that corresponds to being in some state with probability 1
- Parameters
s – the state
S – the set of States
- Returns
the corner belief corresponding to state s
- hsvi(O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, b0: numpy.ndarray[Any, numpy.dtype[Any]], epsilon: float, prune_frequency: int = 10, D: float = - 1.0)[source]
Heuristic Search Value Iteration for zero-sum OS-POSGs (Horak, Bosansky, Pechoucek, 2017)
The value function represents the utility player 1 can achieve in each possible initial belief of the game.
- Parameters
O – set of observations of the OS-POSG
Z – observation tensor of the OS-POSG
R – reward tensor of the OS-POSG
T – transition tensor of the OS-POSG
A1 – action set of P1 in the OS-POSG
A2 – action set of P2 in the OS-POSG
S – state set of the OS-POSG
gamma – discount factor
b0 – initial belief point
epsilon – accuracy parameter
prune_frequency – how often to prune the upper and lower bounds
D – neighborhood parameter
- Returns
None
- hsvi_os_posg(exp_result: csle_common.dao.training.experiment_result.ExperimentResult, seed: int) csle_common.dao.training.experiment_result.ExperimentResult [source]
Runs the Shapley iteration algorithm
- Parameters
exp_result – the experiment result object
seed – the random seed
- Returns
the updated experiment result
- initialize_lower_bound(S: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, b0: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) List[Any] [source]
Initializes the lower bound by computing the state-values of the POMDP induced by player 1 playing a uniform strategy
- Parameters
S – the set of states
A1 – the set of actions of player 1
A2 – the set of actions of player 2
gamma – the discount factor
b0 – the initial belief point
R – the reward tensor
T – the transition tensor
- Returns
the lower bound (singleton set with an alpha vector)
- initialize_upper_bound(T: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float) List[Any] [source]
Initializes the upper bound by computing the values of the fully observed version of the OS-POSG using Shapley iteration.
- Parameters
T – the transition tensor of the OS-POSG
R – the reward tensor of the OS-POSG
A1 – the action set of player 1 in the the OS-POSG
A2 – the action set of player 2 in the the OS-POSG
S – the set of states of the OS-POSG
gamma – the discount factor
- Returns
the initial upper bound
- local_lower_bound_update(lower_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float) numpy.ndarray[Any, numpy.dtype[Any]] [source]
Performs a local update to the lower bound given a belief point in the heuristic search.
The lower bound update preserves the lower bound property and the Delta-Lipschitz continuity property.
- Parameters
lower_bound – the current lower bound
b – the current belief point
A1 – the set of actions of player 1 in the OS-POSG
A2 – the set of actions of player 2 in the OS-POSG
O – the set of observations in the OS-POSG
Z – the observation tensor in the OS-POSG
S – the set of states in the OS-POSG
T – the transition tensor in the OS-POSG
R – the reward tensor in the OS-POSG
gamma – the discount factor in the OS-POSG
- Returns
the updated lower bound
- local_updates(lower_bound: List[Any], upper_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, Z: numpy.ndarray[Any, numpy.dtype[Any]], delta: float) Tuple[List[Any], List[Any]] [source]
Perform local updates to the upper and lower bounds for the given belief in the heuristic-search-exploration
- Parameters
lower_bound – the lower bound on V
upper_bound – the upper bound on V
b – the current belief point
A1 – the set of actions of player 1 in the OS-POSG
A2 – the set of actions of player 2 in the OS-POSG
S – the set of states in the OS-POSG
O – the set of observations in the OS-POSG
R – the reward tensor in the OS-POSG
T – the transition tensor in the OS-POSG
gamma – the discount factor
Z – the set of observations in the OS-POSG
delta – the Lipschitz-Delta parameter
- Returns
The updated lower and upper bounds
- local_upper_bound_update(upper_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, Z: numpy.ndarray[Any, numpy.dtype[Any]], delta: float) numpy.ndarray[Any, numpy.dtype[Any]] [source]
Performs a local update to the upper bound during the heuristic-search exploration
- Parameters
upper_bound – the upper bound to update
b – the current belief point
A1 – the set of actions of player 1 in the OS-POSG
S – the set of states in the OS-POSG
O – the set of observations in the OS-POSG
R – the reward tensor in the OS-POSG
T – the transition tensor in the OS-POSG
gamma – the discount factor in the OS-POSG
Z – the set of observations in the OS-POSG
delta – the Lipschitz-Delta parameter
- Returns
the updated upper bound
- lower_bound_backup(lower_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, A2: numpy.ndarray[Any, numpy.dtype[Any]]) numpy.ndarray[Any, numpy.dtype[Any]] [source]
Generates a new alpha-vector for the lower bound
- Parameters
lower_bound – the current lower bound
b – the current belief point
A1 – the set of actions of player 1 in the OS-POSG
A2 – the set of actions of player 2 in the OS-POSG
O – the set of observations in the OS-POSG
Z – the observation tensor in the OS-POSG
S – the set of states in the OS-POSG
T – the transition tensor in the OS-POSG
R – the reward tensor in the OS-POSG
gamma – the discount factor
- Returns
the new alpha vector
- lower_bound_value(lower_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]]) float [source]
Computes the lower bound value of a given belief point
- Parameters
lower_bound – the lower bound
b – the belief point
S – the set of states
- Returns
the lower bound value
- maxcomp_shapley_bellman_operator(Gamma: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, b: numpy.ndarray[Any, numpy.dtype[Any]], R, T, Z) Tuple[numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]]] [source]
A dear child with many names: Maxcomp/Shapley/Bellman operator that computes [HV](b) where V is represented by the pointwise maximum over the convex hull of a set of alpha vectors Gamma. By representing the set of vectors as the convex hull, the solution to the operator can be found by an LP. I.e the whole purpose of using the convex hull of Gamma is to be able to compute the operator through LP.
The operator is represented as the solution to a linear program given in (Karel Horák PhD thesis, 2019).
That is, the operator performs a backup to update the lower bound in HSVI or to perform iterations in exact value iteration for OS-POSGs.
- Parameters
Gamma – the set of alpha vectors representing the value function
A1 – the action space of player 1
S – the set of states in the OS-POSG
O – the set of observations in the OS-POSG
A2 – the action space of player 2 in the OS-POSG
gamma – the discount factor
b – the belief point
R – the reward tensor of the OS-POSG
T – the transition tensor of the OS-POSG
Z – the observation tensor of the OS-POSG
- Returns
the “optimal” stage strategy of Player 1 and the set of alpha-vectors representing the values of the behavioral strategy following a1,o for any combination of a1,o.
- mdp_reward_matrix_p2(R: numpy.ndarray[Any, numpy.dtype[Any]], P1_strategy: numpy.ndarray[Any, numpy.dtype[Any]], A1: List[Any]) numpy.ndarray[Any, numpy.dtype[Any]] [source]
Creates the reward matrix of player 2 induced by a fixed policy of player 1 (a MDP)
- Returns
return a |A2|x|S| matrix
- mdp_transition_tensor_p2(T: numpy.ndarray[Any, numpy.dtype[Any]], P1_strategy: numpy.ndarray[Any, numpy.dtype[Any]], A1: List[float]) numpy.ndarray[Any, numpy.dtype[Any]] [source]
Creates the transition tensor induced by a fixed policy of player 1 (MDP)
- Returns
a |A2||S|^2 tensor
- next_belief(o: int, a1: int, b: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], pi_2: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]]) numpy.ndarray[Any, numpy.dtype[Any]] [source]
Computes the next belief using a Bayesian filter
- Parameters
o – the latest observation
a1 – the latest action of player 1
b – the current belief
S – the set of states
Z – the observation tensor
T – the transition tensor
pi_2 – the policy of player 2
A2 – the set of actions of player 2
- Returns
the new belief
- obtain_equilibrium_strategy_profiles_in_stage_game(lower_bound: List[Any], upper_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], delta: float, S: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, R: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) Tuple[numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]]] [source]
Computes equilibrium strategy profiles in the stage game constructed from the lower and upper bound value functions
- Parameters
lower_bound – the lower bound of V
upper_bound – the upper bound of V
b – the belief point
delta – the Lipschitz-continuity parameter
S – the set of states
A1 – the set of acitons of player 1
A2 – the set of actions of player 2
R – the reward tensor
O – the set of observations
Z – the observation tensor
T – the transition tensor
gamma – the discount factor
- Returns
Equilibrium strategy profiles in the upper and lower bound stage games
- one_step_lookahead(state, V, num_actions, num_states, T, discount_factor, R) numpy.ndarray[Any, numpy.dtype[Any]] [source]
Performs a one-step lookahead for value iteration :param state: the current state :param V: the current value function :param num_actions: the number of actions :param num_states: the number of states :param T: the transition kernel :param discount_factor: the discount factor :param R: the table with rewards :param next_state_lookahead: the next state lookahead table :return: an array with lookahead values
- p_o_given_b_a1_a2(o: int, b: numpy.ndarray[Any, numpy.dtype[Any]], a1: int, a2: int, S: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) float [source]
Computes P[o|a,b]
- Parameters
o – the observation
b – the belief point
a1 – the action of player 1
a2 – the action of player 2
S – the set of states
Z – the observation tensor
T – the transition tensor
- Returns
the probability of observing o when taking action a in belief point b
- p_o_given_b_pi_1_pi_2(o: int, b: numpy.ndarray[Any, numpy.dtype[Any]], pi_1: numpy.ndarray[Any, numpy.dtype[Any]], pi_2: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) float [source]
Computes P[o|a,b]
- Parameters
o – the observation
b – the belief point
pi_1 – the policy of player 1
pi_2 – the policy of player 2
S – the set of states
Z – the observation tensor
T – the transition tensor
- Returns
the probability of observing o when taking action a in belief point b
- prune_upper_bound(upper_bound: List[Any], delta: float, S: numpy.ndarray[Any, numpy.dtype[Any]]) List[Any] [source]
Prunes the points in the upper bound
- Parameters
upper_bound – the current upper bound
delta – the delta parameter for lipschitz-continuity
S – the set of states
- Returns
the pruned upper bound
- rho(t: int, epsilon: float, gamma: float, delta: float, D: float) float [source]
During the exploration, the HSVI algorithms tries to keep the gap between V_UB and V_LB to be at most rho(t), which is monotonically increasing and unbounded
rho(0) = epsilon rho(t+1) = (rho(t) -2*delta*D)/gamma
- Parameters
t – the time-step of the exploration
epsilon – the epsilon parameter
gamma – the discount factor
delta – the Lipshitz-continuity parameter
D – the neighborhood parameter
- Returns
rho(t)
- sample_D(gamma: float, epsilon: float, delta: float) float [source]
Samples the neighborhood parameter in the legal range.
To ensure that the sequence rho is monotonically increasing and unbounded, we need to select the parameter D from the interval (0, (1-gamma)*epsilon/2delta)
- Parameters
gamma – the discount factor
epsilon – the epsilon accuracy parameter
delta – the Lipschitz-continuity parameter
- Returns
the neighborhood parameter D
- si(S: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float = 1, max_iterations: int = 500, delta_threshold: float = 0.1, log: bool = False) Tuple[numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]]] [source]
Shapley Iteration (L. Shapley 1953)
- Parameters
S – the set of states of the SG
A1 – the set of actions of player 1 in the SG
A2 – the set of actions of player 2 in the SG
R – the reward tensor in the SG
T – the transition tensor in the SG
gamma – the discount factor
max_iterations – the maximum number of iterations
delta_threshold – the stopping threshold
log – a boolean flag whether to use verbose logging or not
- Returns
the value function, the set of maximin strategies for all stage games,
the set of minimax strategies for all stage games, and the stage games themselves
- train() csle_common.dao.training.experiment_execution.ExperimentExecution [source]
Runs the value iteration algorithm to compute V*
- Returns
the results
- upper_bound_backup(upper_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, delta: float) float [source]
Adds a point to the upper bound
- Parameters
upper_bound – the current upper bound
b – the current belief point
A1 – the set of actions of player 1 in the OS-POSG
A2 – the set of actions of player 1 in the OS-POSG
S – the set of states in the OS-POSG
O – the set of observations in the OS-POSG
Z – the observation tensor in the OS-POSG
R – the reward tensor in the OS-POSG
T – the transition tensor in the OS-POSG
gamma – the discount factor in the OS-POSG
lp_nf – a boolean flag whether to use LP to compute the upper bound belief
delta – the Lipschitz-delta parameter
- Returns
the new point
- upper_bound_value(upper_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], delta: float, S: numpy.ndarray[Any, numpy.dtype[Any]]) float [source]
Computes the upper bound value of a given belief point
- Parameters
upper_bound – the upper bound
b – the belief point
delta – the delta-parameter for Lipschitz-continuity
lp_nf – boolean flag that decides whether to use LP to compute the upper bound value or not
S – the set of states
- Returns
the upper bound value
- valcomp(pi_1: numpy.ndarray[Any, numpy.dtype[Any]], alpha_bar: numpy.ndarray[Any, numpy.dtype[Any]], s: int, A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], O: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, substituted_alpha: bool = False) float [source]
Computes the value of a compositional strategy of player 1 in a given state s. The compositional strategy consists of the one-stage strategy pi_1 (a probability distribution over A1) and the expected value when taking action a according to pi_1 and then observing the next observation o and then following a behavioral strategy C^(a,o) in (Sigma_1)^(O x A). It is assumed that the value of following strategy C^(a,o) is represented by alpha^(a,o) in alpha_bar. Further, to compute the value, since the game is zero-sum, we assume that P2 plays a best response against the compositional strategy.
- Parameters
pi_1 – the one-stage strategy of P1 to evaluate
alpha_bar – the set of alpha vectors representing the value of the behavioral strategies in subsequent subgames
s – the state to evaluate the value of
A1 – the action set of player 1 in the OS-POSG
A2 – the action set of player 2 in the OS-POSG
O – the set of observations in the OS-POSG
S – the set of states in the OS-POSG
Z – the observation tensor in the OS-POSG
T – the transition tensor in the OS-POSG
R – the reward tensor in the OS-POSG
substituted_alpha – if True, treat the alpha vectors as alpha^(a,o)(s) = pi_1[a]alpha^(a,o)(s)
- Returns
the value of the compositional strategy of P1
- value_of_p1_strategy_static(S: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], gamma: float, P1_strategy: numpy.ndarray[Any, numpy.dtype[Any]], b0: numpy.ndarray[Any, numpy.dtype[Any]], R: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) Tuple[numpy.ndarray[Any, numpy.dtype[Any]], float] [source]
Computes the value of PI’s strategy P1_strategy, assuming that P1’s strategy is static and independent of observations/actions/beliefs in the game. For example a uniform strategy.
The value function is computed by solving the Bellman equation of the induced MDP for P2.
- Parameters
S – the set of states of the OS-POSG
A1 – the set of actions of P1 in the OS-POSG
A2 – the set of actions of P2 in the OS-POSG
gamma – the discount factor
P1_strategy – the static strategy of P1 to evaluate
b0 – the initial belief
R – the reward tensor
T – the transition tensor
- Returns
the value vector and the value given the initial belief b0.
- vi(T: numpy.ndarray[Any, numpy.dtype[Any]], num_states: int, num_actions: int, R: numpy.ndarray[Any, numpy.dtype[Any]], theta: float = 0.0001, discount_factor: float = 1.0) Tuple[numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]]] [source]
An implementation of the Value Iteration algorithm :param T: the transition kernel T :param num_states: the number of states :param num_actions: the number of actions :param state_to_id: the state-to-id lookup table :param HP: the table with hack probabilities :param R: the table with rewards :param next_state_lookahead: the next-state-lookahead table :param theta: convergence threshold :param discount_factor: the discount factor :return: (greedy policy, value function)
- weighted_excess_gap(lower_bound: List[Any], upper_bound: List[Any], a1: int, o: int, b: numpy.ndarray[Any, numpy.dtype[Any]], t: int, gamma: float, pi_1_upper_bound, pi_2_lower_bound, epsilon: float, delta: float, D: float, S: numpy.ndarray[Any, numpy.dtype[Any]], Z: numpy.ndarray[Any, numpy.dtype[Any]], A1: numpy.ndarray[Any, numpy.dtype[Any]], A2: numpy.ndarray[Any, numpy.dtype[Any]], T: numpy.ndarray[Any, numpy.dtype[Any]]) Tuple[float, float, float, numpy.ndarray[Any, numpy.dtype[Any]]] [source]
Computes the weighted excess gap
- Parameters
lower_bound – the lower bound
upper_bound – the upper bound
a1 – the action of player 1
o – the observation
b – the belief point
t – the time-step
gamma – the discount factor
:param pi_1_upper_bound a maximin strategy of player 1 in the stage game based on upper bound V* :param pi_2_lower_bound: a minimax strategy of player 1 in the stage game based on lower bound V* :param epsilon: the epsilon accuracy parameter :param delta: the Lipschitz-continuity parameter :param D: the neighborhood parameter :param S: the set of states :param Z: the observation tensor :param A1: the set of actions of player 1 :param A2: the set of actions of player 2 :param T: the transition tensor :return: (weighted excess, excess, width, b_prime)
- width(lower_bound: List[Any], upper_bound: List[Any], b: numpy.ndarray[Any, numpy.dtype[Any]], S: numpy.ndarray[Any, numpy.dtype[Any]], delta: float) float [source]
Computes the bounds width (Trey Smith and Reid Simmons, 2004)
- Parameters
lower_bound – the current lower bound
upper_bound – the current upper bound
b – the current belief point
S – the set of states
delta – the delta parameter for Lipschitz-continuity
- Returns
the width of the bounds