csle_agents.agents.hsvi package

Submodules

csle_agents.agents.hsvi.hsvi_agent module

class csle_agents.agents.hsvi.hsvi_agent.HSVIAgent(simulation_env_config: SimulationEnvConfig, experiment_config: ExperimentConfig, training_job: Optional[TrainingJobConfig] = None, save_to_metastore: bool = True, env: Optional[BaseEnv] = None)[source]

Bases: BaseAgent

Heuristic-Search Value Iteration for POMDPs (Trey Smith and Reid Simmons, 2004) Agent

approximate_projection_sawtooth(upper_bound: Tuple[List[Any], List[Any]], b: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]]) → float[source]

Reference: (Hauskreht 2000)

Performs an approximate projection of the belief onto the convex hull of the upepr bound to compute the upper bound value of the belief

Parameters

upper_bound – the upper bound
b – the belief point
S – the set of states

Returns

the value of the belief point

bayes_filter(s_prime: int, o: int, a: int, b: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]]) → float[source]

A Bayesian filter to compute the belief of being in s_prime when observing o after taking action a in belief b

Parameters

s_prime – the state to compute the belief of
o – the observation
a – the action
b – the current belief point
S – the set of states
Z – the observation tensor
T – the transition tensor

Returns

b_prime(s_prime)

excess(lower_bound: List[Any], upper_bound: Tuple[List[Any], List[Any]], b: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], epsilon: float, gamma: float, t: int, lp: bool) → float[source]

Computes the excess uncertainty (Trey Smith and Reid Simmons, 2004)

Parameters

lower_bound – the lower bound
upper_bound – the upper bound
b – the current belief point
S – the set of states
epsilon – the epsilon accuracy parameter
gamma – the discount factor
t – the current exploration depth
lp – whether to use LP or not to compute upper bound belief values

Returns

the excess uncertainty

explore(b: ndarray[Any, dtype[Any]], epsilon: float, t: int, lower_bound: List[int], upper_bound: Tuple[List[int], List[int]], gamma: float, S: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], A: ndarray[Any, dtype[Any]], lp: bool) → Tuple[List[int], Tuple[List[int], List[int]]][source]

Explores the POMDP tree

Parameters

b – current belief
epsilon – accuracy parameter
t – the current depth of the exploration
lower_bound – the lower bound on the value function
upper_bound – the upper bound on the value function
gamma – discount factor
S – set of states
O – set of observations
Z – observation tensor
R – reward tensor
T – transition tensor
A – set of actions
lp – whether to use LP to compute upper bound values

Returns

new lower and upper bounds

generate_corner_belief(s: int, S: ndarray[Any, dtype[Any]])[source]

Generate the corner of the simplex that corresponds to being in some state with probability 1

Parameters

s – the state
S – the set of States

Returns

the corner belief corresponding to state s

hparam_names() → List[str][source]

Returns: a list with the hyperparameter names

hsvi(O: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], A: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], gamma: float, b0: ndarray[Any, dtype[Any]], epsilon: float, lp: bool = False, prune_frequency: int = 10, simulation_frequency: int = 10, simulate_horizon: int = 10, number_of_simulations: int = 10) → Tuple[List[List[float]], List[float], List[float], List[float], List[int], List[int], List[float]][source]

Heuristic Search Value Iteration for POMDPs (Trey Smith and Reid Simmons, 2004)

Parameters

O – set of observations of the POMDP
Z – observation tensor of the POMDP
R – reward tensor of the POMDP
T – transition tensor of the POMDP
A – action set of the POMDP
S – state set of the POMDP
gamma – discount factor
b0 – initial belief point
epsilon – accuracy parameter
lp – whether to use LP to compute upper bound values or SawTooth approximation
prune_frequency – how often to prune the upper and lower bounds
simulation_frequency – how frequently to simulate the POMDP to compure rewards of current policy
simulate_horizon – length of simulations to compute rewards
number_of_simulations – number of simulations to estimate reward

Returns

None

hsvi_algorithm(exp_result: ExperimentResult, seed: int) → ExperimentResult[source]

Runs

Parameters

exp_result – the experiment result object
seed – the random seed

Returns

the updated experiment result

initialize_lower_bound(R: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], A: ndarray[Any, dtype[Any]], gamma: float) → List[Any][source]

Initializes the lower bound

Parameters

R – reward tensor
S – set of states
A – set of actions
gamma – discount factor

Returns

the initialized lower bound

initialize_upper_bound(T: ndarray[Any, dtype[Any]], A: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], gamma: float, R: ndarray[Any, dtype[Any]]) → Tuple[List[Any], List[Any]][source]

Initializes the upper bound

Parameters

T – the transition tensor
A – the set of actions
S – the set of states
R – the reward tensor
gamma – the discount factor

Returns

the initialized upper bound

interior_point_belief_val(interior_point: Tuple[ndarray[Any, dtype[Any]], float], b: ndarray[Any, dtype[Any]], alpha_corner: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]]) → float[source]

Computes the value induced on the belief point b projected onto the convex hull by a given interior belief point

Parameters

interior_point – the interior point
b – the belief point
alpha_corner – the alpha vector corresponding to the corners of the belief simplex
S – the set of states

Returns

the value of the belief point induced by the interior point

local_lower_bound_update(lower_bound: List[Any], b: ndarray[Any, dtype[Any]], A: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], gamma: float) → List[Any][source]

Performs a local update to the lower bound given a belief point in the heuristic search

Parameters

lower_bound – the current lower bound
b – the current belief point
A – the set of actions
O – the set of observations
Z – the observation tensor
S – the set of states
T – the transition tensor
R – the reward tensor
gamma – the discount factor

Returns

the updated lower bound

local_updates(lower_bound: List[Any], upper_bound: Tuple[List[Any], List[Any]], b: ndarray[Any, dtype[Any]], A: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], gamma: float, Z: ndarray[Any, dtype[Any]], lp: bool) → Tuple[List[Any], Tuple[List[Any], List[Any]]][source]

Perform local updates to the upper and lower bounds for the given belief in the heuristic-search-exploration

Parameters

lower_bound – the lower bound on V
upper_bound – the upper bound on V
b – the current belief point
A – the set of actions
S – the set of states
O – the set of observations
R – the reward tensor
T – the transition tensor
gamma – the discount factor
Z – the set of observations
lp – a boolean flag whether to use LP to compute upper bound beliefs

Returns

The updated lower and upper bounds

local_upper_bound_update(upper_bound: Tuple[List[Any], List[Any]], b: ndarray[Any, dtype[Any]], A: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], gamma: float, Z: ndarray[Any, dtype[Any]], lp: bool) → Tuple[List[Any], List[Any]][source]

Performs a local update to the upper bound during the heuristic-search exploration

Parameters

upper_bound – the upper bound to update
b – the current belief point
A – the set of actions
S – the set of states
O – the set of observations
R – the reward tensor
T – the transition tensor
gamma – the discount factor
Z – the set of observations
lp – whether or not to use LP to compute upper bound beliefs

Returns

the updated upper bound

lower_bound_backup(lower_bound: List[Any], b: ndarray[Any, dtype[Any]], A: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], gamma: float) → ndarray[Any, dtype[Any]][source]

Generates a new alpha-vector for the lower bound

Parameters

lower_bound – the current lower bound
b – the current belief point
A – the set of actions
O – the set of observations
Z – the observation tensor
S – the set of states
T – the transition tensor
R – the reward tensor
gamma – the discount factor

Returns

the new alpha vector

lower_bound_value(lower_bound: List[Any], b: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]]) → float[source]

Computes the lower bound value of a given belief point

Parameters

lower_bound – the lower bound
b – the belief point
S – the set of states

Returns

the lower bound value

lp_convex_hull_projection_lp(upper_bound: Tuple[List[Any], List[Any]], b: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]]) → float[source]

Reference: (Hauskreht 2000)

Computes the upper bound belief by performing a projection onto the convex hull of the upper bound, it is computed by solving an LP

Parameters

upper_bound – the upper bound
b – the belief point to compute the value for
S – the set of states

Returns

the upper bound value of the belief point

next_belief(o: int, a: int, b: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]]) → ndarray[Any, dtype[Any]][source]

Computes the next belief using a Bayesian filter

Parameters

o – the latest observation
a – the latest action
b – the current belief
S – the set of states
Z – the observation tensor
T – the transition tensor

Returns

the new belief

observation_possible(o, b, Z, T, S, a) → bool[source]

Checks if a given observation can be observed when taking a given action in a given state

Parameters

o – the observation to check
b – the belief to check
Z – the observation tensor
T – the transition tensor
S – the state space
a – the aciton tocheck

Returns

true if possible otherwise fasle

one_step_lookahead(state, V, num_actions, num_states, T, discount_factor, R) → ndarray[Any, dtype[Any]][source]: Performs a one-step lookahead for value iteration :param state: the current state :param V: the current value function :param num_actions: the number of actions :param num_states: the number of states :param T: the transition kernel :param discount_factor: the discount factor :param R: the table with rewards :param next_state_lookahead: the next state lookahead table :return: an array with lookahead values

p_o_given_b_a(o: int, b: ndarray[Any, dtype[Any]], a: int, S: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]]) → float[source]

Computes P[o|a,b]

Parameters

o – the observation
b – the belief point
a – the action
S – the set of states
Z – the observation tensor
T – the transition tensor

Returns

the probability of observing o when taking action a in belief point b

prune_upper_bound(upper_bound: Tuple[List[Any], List[Any]], S: ndarray[Any, dtype[Any]], lp: bool) → Tuple[List[Any], List[Any]][source]

Prunes the points in the upper bound

Parameters

upper_bound – the current upper bound
S – the set of states
lp – boolean flag that decides whether to use LP to compute upper bound belief values

Returns

the pruned upper bound

q(b: ndarray[Any, dtype[Any]], a: int, lower_bound: List[int], upper_bound: Tuple[List[int], List[int]], S: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], gamma: float, T: ndarray[Any, dtype[Any]], upper: bool = True, lp: bool = False) → float[source]

Applies the Bellman equation to compute Q values

Parameters

b – the belief point
a – the action
lower_bound – the lower bound
upper_bound – the upper bound
S – the set of states
O – the set of observations
Z – the observation tensor
R – the reward tensor
gamma – the discount factor
T – the transition tensor
upper – boolean flag that decides whether to use the upper bound or lower bound on V to compute the Q-value
lp – boolean flag that decides whether to use LP to compute upper bound belief values

Returns

the Q-value

q_hat_interval(b: ndarray[Any, dtype[Any]], a: int, S: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], gamma: float, lower_bound: List[int], upper_bound: Tuple[List[Any], List[Any]], lp: bool) → List[float][source]

Computes the interval (Trey Smith and Reid Simmons, 2004)

Parameters

b – the current belief point
a – the action
S – the set of states
O – the set of observations
Z – the observation tensor
R – the reward tensor
T – the transition tensor
gamma – the discount factor
lower_bound – the lower bound
upper_bound – the upper bound
lp – boolean flag that decides whether to use LP to compute upper bound belief values

Returns

the interval

simulate(horizon: int, b0: ndarray[Any, dtype[Any]], lower_bound: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], gamma: float, T: ndarray[Any, dtype[Any]], A: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]]) → float[source]

Simulates the POMDP to estimate the reward of the greedy policy with respect to the value function represented by the lower bound

Parameters

horizon – the horizon for the simulation
b0 – the initial belief
lower_bound – the lower bound which represents the value function
Z – the observation tensor
R – the reward tensor
gamma – the discount factor
T – the transition operator
A – the action set
O – the observation set
S – the set of states

Returns

the cumulative discounted reward

train() → ExperimentExecution[source]

Runs the value iteration algorithm to compute V*

Returns: the results

update_corner_points(corner_points: List[Any], new_point: Tuple[ndarray[Any, dtype[Any]], float]) → List[Any][source]

(Maybe) update the corner points of the upper bound

Parameters

corner_points – the current set of corner points
new_point – the new point to add to the upper bound

Returns

the new set of corner points

upper_bound_backup(upper_bound: Tuple[List[Any], List[Any]], b: ndarray[Any, dtype[Any]], A: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], O: ndarray[Any, dtype[Any]], Z: ndarray[Any, dtype[Any]], R: ndarray[Any, dtype[Any]], T: ndarray[Any, dtype[Any]], gamma: float, lp: bool) → Tuple[ndarray[Any, dtype[Any]], float][source]

Adds a point to the upper bound

Parameters

upper_bound – the current upper bound
b – the current belief point
A – the set of actions
S – the set of states
O – the set of observations
Z – the observation tensor
R – the reward tensor
T – the transition tensor
gamma – the discount factor
lp – a boolean flag whether to use LP to compute the upper bound belief

Returns

the new point

upper_bound_value(upper_bound: Tuple[List[Any], List[Any]], b: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], lp: bool = False) → float[source]

Computes the upper bound value of a given belief point

Parameters

upper_bound – the upper bound
b – the belief point
S – the set of states
lp – boolean flag that decides whether to use LP to compute the upper bound value or not

Returns

the upper bound value

vi(T: ndarray[Any, dtype[Any]], num_states: int, num_actions: int, R: ndarray[Any, dtype[Any]], theta=0.0001, discount_factor=1.0) → Tuple[ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]]][source]: An implementation of the Value Iteration algorithm :param T: the transition kernel T :param num_states: the number of states :param num_actions: the number of actions :param state_to_id: the state-to-id lookup table :param HP: the table with hack probabilities :param R: the table with rewards :param next_state_lookahead: the next-state-lookahead table :param theta: convergence threshold :param discount_factor: the discount factor :return: (greedy policy, value function)

width(lower_bound: List[Any], upper_bound: Tuple[List[Any], List[Any]], b: ndarray[Any, dtype[Any]], S: ndarray[Any, dtype[Any]], lp: bool) → float[source]

Computes the bounds width (Trey Smith and Reid Simmons, 2004)

Parameters

lower_bound – the current lower bound
upper_bound – the current upper bound
b – the current belief point
S – the set of states
lp – boolean flag that decides whether to use LP to compute upper bound belief values

Returns

the width of the bounds

csle_agents.agents.hsvi package

Submodules

csle_agents.agents.hsvi.hsvi_agent module

Module contents