csle_agents.agents.q_learning package

Submodules

csle_agents.agents.q_learning.q_learning_agent module

class csle_agents.agents.q_learning.q_learning_agent.QLearningAgent(simulation_env_config: csle_common.dao.simulation_config.simulation_env_config.SimulationEnvConfig, experiment_config: csle_common.dao.training.experiment_config.ExperimentConfig, training_job: Optional[csle_common.dao.jobs.training_job_config.TrainingJobConfig] = None, save_to_metastore: bool = True, env: Optional[csle_common.dao.simulation_config.base_env.BaseEnv] = None)[source]

Bases: csle_agents.agents.base.base_agent.BaseAgent

Q-learning Agent

create_policy_from_q_table(num_states: int, num_actions: int, q_table: numpy.ndarray[Any, numpy.dtype[Any]]) numpy.ndarray[Any, numpy.dtype[Any]][source]

Creates a tabular policy from a q table

Parameters
  • num_states – the number of states

  • num_actions – the number of actions

  • q_table – the q_table

Returns

the tabular policy

eps_greedy(q_table: numpy.ndarray[Any, numpy.dtype[Any]], A: List[int], s: int, epsilon: float = 0.2) int[source]

Selects an action according to the epsilon-greedy strategy

Parameters
  • q_table – the q table

  • A – the action space

  • s – the state

  • epsilon – the exploration epsilon

Returns

the sampled action

evaluate_policy(policy: numpy.ndarray[Any, numpy.dtype[Any]], eval_batch_size: int) float[source]

Evalutes a tabular policy

Parameters
  • policy – the tabular policy to evaluate

  • eval_batch_size – the batch size

Returns

None

hparam_names() List[str][source]
Returns

a list with the hyperparameter names

initialize_count_table(n_states: int = 256, n_actions: int = 5) numpy.ndarray[Any, numpy.dtype[Any]][source]

Initializes the count table

Parameters
  • n_states – the number of states in the MDP

  • n_actions – the number of actions in the MDP

Returns

the initialized count table

initialize_q_table(n_states: int = 256, n_actions: int = 5) numpy.ndarray[Any, numpy.dtype[Any]][source]

Initializes the Q table

Parameters
  • n_states – the number of states in the MDP

  • n_actions – the number of actions in the MDP

Returns

the initialized Q table

q_learning(exp_result: csle_common.dao.training.experiment_result.ExperimentResult, seed: int) csle_common.dao.training.experiment_result.ExperimentResult[source]

Runs the q-learning algorithm

Parameters
  • exp_result – the experiment result object

  • seed – the random seed

Returns

the updated experiment result

q_learning_update(q_table: numpy.ndarray[Any, numpy.dtype[Any]], count_table: numpy.ndarray[Any, numpy.dtype[Any]], s: int, a: int, r: float, s_prime: int, gamma: float, done: bool) Tuple[numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]], float][source]

Watkin’s Q-learning update

Parameters
  • q_table – the Q-table

  • count_table – the count table (used for determining SA step sizes)

  • s – the sampled state

  • a – the exploration action

  • r – the reward

  • s_prime – the next sampled state

  • gamma – the discount factor

  • done – boolean flag indicating whether s_prime is terminal

Returns

the updated q table and updated count table and the updated learning rate

step_size(n: int) float[source]

Calculates the SA step size

Parameters

n – the iteration

Returns

the step size

train() csle_common.dao.training.experiment_execution.ExperimentExecution[source]

Runs the q-learning algorithm to compute Q*

Returns

the results

train_q_learning(A: List[int], S: List[int], gamma: float = 0.8, N: int = 10000, epsilon: float = 0.2, epsilon_decay: float = 1.0) Tuple[List[float], List[float], List[float], List[List[float]], List[List[float]]][source]

Runs the Q learning algorithm

Parameters
  • A – the action space

  • S – the state space

  • gamma – the discount factor

  • N – the number of iterations

  • epsilon – the exploration parameter

  • epsilon_decay – the epsilon decay rate

Returns

the average returns, the running average returns, the initial state values, the q table, policy

Module contents