csle_agents.agents.sondik_vi package

Submodules

csle_agents.agents.sondik_vi.sondik_vi_agent module

class csle_agents.agents.sondik_vi.sondik_vi_agent.SondikVIAgent(simulation_env_config: SimulationEnvConfig, experiment_config: ExperimentConfig, training_job: Optional[TrainingJobConfig] = None, save_to_metastore: bool = True, env: Optional[BaseEnv] = None)[source]

Bases: BaseAgent

Sondik’s value iteration for POMDPs (Sondik 1971)

check_duplicate(a, av)[source]

Check whether alpha vector av is already in set a

Parameters

a –
av –

Returns

compute_all_conditional_plans_conditioned_on_a_t(n_alpha_vectors_t_plus_one, n_obs)[source]

Compute the number of conditional plans conditioned on an action a. It produces all possible combinations of (observation -> conditional_plan)

Parameters

n_alpha_vectors_t_plus_one – Number of alpha-vectors (number of conditional plans) for t+1
n_obs – Number of observations

Returns

list of lists, where each list contains n_obs elements, and each element is in [0, n_alpha_vectors-1].

The number of conditional plans will be be n_alpha_vectors^n_obs elements. The plan is of the form: (o^(1)_i, o^(2)_j, …, o^(n_alpha_vectors_t_plus_one)_k) where o^(1)_i means that if observation o_i is observed, conditional plan 1 should be followed, o^(2)_j means that if observation o_j is observed, conditional plan 2 should be followed, o^(n_alpha_vectors_t_plus_one)_k means that if observation o_k is observed, conditional plan n_alpha_vectors_t_plus_one should be followed.

evaluate_policy(policy: AlphaVectorsPolicy, eval_batch_size: int) → float[source]

Evalutes a tabular policy

Parameters

policy – the tabular policy to evaluate
eval_batch_size – the batch size

Returns

None

hparam_names() → List[str][source]

Returns: a list with the hyperparameter names

prune(n_states, aleph)[source]: Remove dominated alpha-vectors using Lark’s filtering algorithm :param n_states :return:

sondik_vi(P, Z, R, T, gamma, n_states, n_actions, n_obs, b0, eval_batch_size: int, use_pruning: bool = True) → Tuple[List[Any], List[int], List[float], List[float], List[float]][source]

Parameters

P – The transition probability matrix
Z – The observation probability matrix
R – The immediate rewards matrix
T – The planning horizon
gamma – The discount factor
n_states – The number of states
n_actions – The number of actions
n_obs – The number of observations
eval_batch_size – number of simulations to evaluate the policy induced by the alpha vectors at each iteration
b0 – The initial belief

Returns

sondik_vi_algorithm(exp_result: ExperimentResult, seed: int) → ExperimentResult[source]

Runs

Parameters

exp_result – the experiment result object
seed – the random seed

Returns

the updated experiment result

train() → ExperimentExecution[source]

Runs the value iteration algorithm to compute V*

Returns: the results

csle_agents.agents.sondik_vi package

Submodules

csle_agents.agents.sondik_vi.sondik_vi_agent module

Module contents