csle_agents.agents.sondik_vi package

Submodules

csle_agents.agents.sondik_vi.sondik_vi_agent module

class csle_agents.agents.sondik_vi.sondik_vi_agent.SondikVIAgent(simulation_env_config: csle_common.dao.simulation_config.simulation_env_config.SimulationEnvConfig, experiment_config: csle_common.dao.training.experiment_config.ExperimentConfig, training_job: Optional[csle_common.dao.jobs.training_job_config.TrainingJobConfig] = None, save_to_metastore: bool = True, env: Optional[csle_common.dao.simulation_config.base_env.BaseEnv] = None)[source]

Bases: csle_agents.agents.base.base_agent.BaseAgent

Sondik’s value iteration for POMDPs (Sondik 1971)

check_duplicate(a, av)[source]

Check whether alpha vector av is already in set a

Parameters
  • a

  • av

Returns

compute_all_conditional_plans_conditioned_on_a_t(n_alpha_vectors_t_plus_one, n_obs)[source]

Compute the number of conditional plans conditioned on an action a. It produces all possible combinations of (observation -> conditional_plan)

Parameters
  • n_alpha_vectors_t_plus_one – Number of alpha-vectors (number of conditional plans) for t+1

  • n_obs – Number of observations

Returns

list of lists, where each list contains n_obs elements, and each element is in [0, n_alpha_vectors-1].

The number of conditional plans will be be n_alpha_vectors^n_obs elements. The plan is of the form: (o^(1)_i, o^(2)_j, …, o^(n_alpha_vectors_t_plus_one)_k) where o^(1)_i means that if observation o_i is observed, conditional plan 1 should be followed, o^(2)_j means that if observation o_j is observed, conditional plan 2 should be followed, o^(n_alpha_vectors_t_plus_one)_k means that if observation o_k is observed, conditional plan n_alpha_vectors_t_plus_one should be followed.

evaluate_policy(policy: csle_common.dao.training.alpha_vectors_policy.AlphaVectorsPolicy, eval_batch_size: int) float[source]

Evalutes a tabular policy

Parameters
  • policy – the tabular policy to evaluate

  • eval_batch_size – the batch size

Returns

None

hparam_names() List[str][source]
Returns

a list with the hyperparameter names

prune(n_states, aleph)[source]

Remove dominated alpha-vectors using Lark’s filtering algorithm :param n_states :return:

sondik_vi(P, Z, R, T, gamma, n_states, n_actions, n_obs, b0, eval_batch_size: int, use_pruning: bool = True) Tuple[List[Any], List[int], List[float], List[float], List[float]][source]
Parameters
  • P – The transition probability matrix

  • Z – The observation probability matrix

  • R – The immediate rewards matrix

  • T – The planning horizon

  • gamma – The discount factor

  • n_states – The number of states

  • n_actions – The number of actions

  • n_obs – The number of observations

  • eval_batch_size – number of simulations to evaluate the policy induced by the alpha vectors at each iteration

  • b0 – The initial belief

Returns

sondik_vi_algorithm(exp_result: csle_common.dao.training.experiment_result.ExperimentResult, seed: int) csle_common.dao.training.experiment_result.ExperimentResult[source]

Runs

Parameters
  • exp_result – the experiment result object

  • seed – the random seed

Returns

the updated experiment result

train() csle_common.dao.training.experiment_execution.ExperimentExecution[source]

Runs the value iteration algorithm to compute V*

Returns

the results

Module contents