csle_agents.agents.q_learning package
Submodules
csle_agents.agents.q_learning.q_learning_agent module
- class csle_agents.agents.q_learning.q_learning_agent.QLearningAgent(simulation_env_config: csle_common.dao.simulation_config.simulation_env_config.SimulationEnvConfig, experiment_config: csle_common.dao.training.experiment_config.ExperimentConfig, training_job: Optional[csle_common.dao.jobs.training_job_config.TrainingJobConfig] = None, save_to_metastore: bool = True, env: Optional[csle_common.dao.simulation_config.base_env.BaseEnv] = None)[source]
Bases:
csle_agents.agents.base.base_agent.BaseAgent
Q-learning Agent
- create_policy_from_q_table(num_states: int, num_actions: int, q_table: numpy.ndarray[Any, numpy.dtype[Any]]) numpy.ndarray[Any, numpy.dtype[Any]] [source]
Creates a tabular policy from a q table
- Parameters
num_states – the number of states
num_actions – the number of actions
q_table – the q_table
- Returns
the tabular policy
- eps_greedy(q_table: numpy.ndarray[Any, numpy.dtype[Any]], A: List[int], s: int, epsilon: float = 0.2) int [source]
Selects an action according to the epsilon-greedy strategy
- Parameters
q_table – the q table
A – the action space
s – the state
epsilon – the exploration epsilon
- Returns
the sampled action
- evaluate_policy(policy: numpy.ndarray[Any, numpy.dtype[Any]], eval_batch_size: int) float [source]
Evalutes a tabular policy
- Parameters
policy – the tabular policy to evaluate
eval_batch_size – the batch size
- Returns
None
- initialize_count_table(n_states: int = 256, n_actions: int = 5) numpy.ndarray[Any, numpy.dtype[Any]] [source]
Initializes the count table
- Parameters
n_states – the number of states in the MDP
n_actions – the number of actions in the MDP
- Returns
the initialized count table
- initialize_q_table(n_states: int = 256, n_actions: int = 5) numpy.ndarray[Any, numpy.dtype[Any]] [source]
Initializes the Q table
- Parameters
n_states – the number of states in the MDP
n_actions – the number of actions in the MDP
- Returns
the initialized Q table
- q_learning(exp_result: csle_common.dao.training.experiment_result.ExperimentResult, seed: int) csle_common.dao.training.experiment_result.ExperimentResult [source]
Runs the q-learning algorithm
- Parameters
exp_result – the experiment result object
seed – the random seed
- Returns
the updated experiment result
- q_learning_update(q_table: numpy.ndarray[Any, numpy.dtype[Any]], count_table: numpy.ndarray[Any, numpy.dtype[Any]], s: int, a: int, r: float, s_prime: int, gamma: float, done: bool) Tuple[numpy.ndarray[Any, numpy.dtype[Any]], numpy.ndarray[Any, numpy.dtype[Any]], float] [source]
Watkin’s Q-learning update
- Parameters
q_table – the Q-table
count_table – the count table (used for determining SA step sizes)
s – the sampled state
a – the exploration action
r – the reward
s_prime – the next sampled state
gamma – the discount factor
done – boolean flag indicating whether s_prime is terminal
- Returns
the updated q table and updated count table and the updated learning rate
- step_size(n: int) float [source]
Calculates the SA step size
- Parameters
n – the iteration
- Returns
the step size
- train() csle_common.dao.training.experiment_execution.ExperimentExecution [source]
Runs the q-learning algorithm to compute Q*
- Returns
the results
- train_q_learning(A: List[int], S: List[int], gamma: float = 0.8, N: int = 10000, epsilon: float = 0.2, epsilon_decay: float = 1.0) Tuple[List[float], List[float], List[float], List[List[float]], List[List[float]]] [source]
Runs the Q learning algorithm
- Parameters
A – the action space
S – the state space
gamma – the discount factor
N – the number of iterations
epsilon – the exploration parameter
epsilon_decay – the epsilon decay rate
- Returns
the average returns, the running average returns, the initial state values, the q table, policy