csle_agents.agents.sondik_vi package
Submodules
csle_agents.agents.sondik_vi.sondik_vi_agent module
- class csle_agents.agents.sondik_vi.sondik_vi_agent.SondikVIAgent(simulation_env_config: csle_common.dao.simulation_config.simulation_env_config.SimulationEnvConfig, experiment_config: csle_common.dao.training.experiment_config.ExperimentConfig, training_job: Optional[csle_common.dao.jobs.training_job_config.TrainingJobConfig] = None, save_to_metastore: bool = True, env: Optional[csle_common.dao.simulation_config.base_env.BaseEnv] = None)[source]
Bases:
csle_agents.agents.base.base_agent.BaseAgent
Sondik’s value iteration for POMDPs (Sondik 1971)
- check_duplicate(a, av)[source]
Check whether alpha vector av is already in set a
- Parameters
a –
av –
- Returns
- compute_all_conditional_plans_conditioned_on_a_t(n_alpha_vectors_t_plus_one, n_obs)[source]
Compute the number of conditional plans conditioned on an action a. It produces all possible combinations of (observation -> conditional_plan)
- Parameters
n_alpha_vectors_t_plus_one – Number of alpha-vectors (number of conditional plans) for t+1
n_obs – Number of observations
- Returns
list of lists, where each list contains n_obs elements, and each element is in [0, n_alpha_vectors-1].
The number of conditional plans will be be n_alpha_vectors^n_obs elements. The plan is of the form: (o^(1)_i, o^(2)_j, …, o^(n_alpha_vectors_t_plus_one)_k) where o^(1)_i means that if observation o_i is observed, conditional plan 1 should be followed, o^(2)_j means that if observation o_j is observed, conditional plan 2 should be followed, o^(n_alpha_vectors_t_plus_one)_k means that if observation o_k is observed, conditional plan n_alpha_vectors_t_plus_one should be followed.
- evaluate_policy(policy: csle_common.dao.training.alpha_vectors_policy.AlphaVectorsPolicy, eval_batch_size: int) float [source]
Evalutes a tabular policy
- Parameters
policy – the tabular policy to evaluate
eval_batch_size – the batch size
- Returns
None
- prune(n_states, aleph)[source]
Remove dominated alpha-vectors using Lark’s filtering algorithm :param n_states :return:
- sondik_vi(P, Z, R, T, gamma, n_states, n_actions, n_obs, b0, eval_batch_size: int, use_pruning: bool = True) Tuple[List[Any], List[int], List[float], List[float], List[float]] [source]
- Parameters
P – The transition probability matrix
Z – The observation probability matrix
R – The immediate rewards matrix
T – The planning horizon
gamma – The discount factor
n_states – The number of states
n_actions – The number of actions
n_obs – The number of observations
eval_batch_size – number of simulations to evaluate the policy induced by the alpha vectors at each iteration
b0 – The initial belief
- Returns