citylearn.agents.q_learning module

class citylearn.agents.q_learning.TabularQLearning(env: CityLearnEnv, epsilon: float = None, minimum_epsilon: float = None, epsilon_decay: float = None, learning_rate: float = None, discount_factor: float = None, q_init_value: float = None, **kwargs: Any)[source]

Bases: Agent

Implementation of Tabular Q-Learning algorithm for discrete observation and action space and epsilon-greedy action selection.

Parameters:

env (CityLearnEnv) – CityLearn environment.
epsilon (float, default: 1.0) – Exploration rate.
minimum_epsilon (float, default: 0.01) – Minimum value exporation rate can decay to.
epsilon_decay (float, default: 0.0001) – epsilon exponential decay rate.
learning_rate (float, default: 0.05) – Defines to what degree new knowledge overrides old knowledge: for learning_rate = 0, no learning happens, while for learning_rate = 1, all prior knowledge is lost.
discount_factor (float, default: 0.90) – Balance between an agent that considers only immediate rewards (discount_factor = 0) and one that strives towards long term rewards (discount_factor = 1)
q_init_value (float, default: np.nan) – Q-Table initialization value.
**kwargs (Any) – Other keyword arguments used to initialize citylearn.agents.base.Agent super class.

predict(observations: List[List[float]], deterministic: bool = None) → List[List[float]][source]

Provide actions for current time step.

If deterministic = True or, randomly generated number is greater than epsilon, return deterministic action from Q-Table i.e. action with max Q-value for given observations otherwise, return randomly sampled action.

Parameters:

observations (List[List[float]]) – Environment observations
deterministic (bool, default: False) – Wether to return purely exploitatative deterministic actions.

Returns:

actions – Action values

Return type:

List[List[float]]

update(observations: List[List[float]], actions: List[List[float]], reward: List[float], next_observations: List[List[float]], terminated: bool, truncated: bool)[source]

Update Q-Table using Bellman equation.

Parameters:

observations (List[List[float]]) – Previous time step observations.
actions (List[List[float]]) – Previous time step actions.
reward (List[float]) – Current time step reward.
next_observations (List[List[float]]) – Current time step observations.
terminated (bool) – Indication that episode has ended.
truncated (bool) – If episode truncates due to a time limit or a reason that is not defined as part of the task MDP.