citylearn.agents.q_learning module

class citylearn.agents.q_learning.TabularQLearning(env: CityLearnEnv, epsilon: float = None, minimum_epsilon: float = None, epsilon_decay: float = None, learning_rate: float = None, discount_factor: float = None, q_init_value: float = None, **kwargs: Any)[source]

Bases: Agent

Implementation of Tabular Q-Learning algorithm for discrete observation and action space and epsilon-greedy action selection.

  • env (CityLearnEnv) – CityLearn environment.

  • epsilon (float, default: 1.0) – Exploration rate.

  • minimum_epsilon (float, default: 0.01) – Minimum value exporation rate can decay to.

  • epsilon_decay (float, default: 0.0001) – epsilon exponential decay rate.

  • learning_rate (float, default: 0.05) – Defines to what degree new knowledge overrides old knowledge: for learning_rate = 0, no learning happens, while for learning_rate = 1, all prior knowledge is lost.

  • discount_factor (float, default: 0.90) – Balance between an agent that considers only immediate rewards (discount_factor = 0) and one that strives towards long term rewards (discount_factor = 1)

  • q_init_value (float, default: np.nan) – Q-Table initialization value.

  • **kwargs (Any) – Other keyword arguments used to initialize citylearn.agents.base.Agent super class.

predict(observations: List[List[float]], deterministic: bool = None) List[List[float]][source]

Provide actions for current time step.

If deterministic = True or, randomly generated number is greater than epsilon, return deterministic action from Q-Table i.e. action with max Q-value for given observations otherwise, return randomly sampled action.

  • observations (List[List[float]]) – Environment observations

  • deterministic (bool, default: False) – Wether to return purely exploitatative deterministic actions.


actions – Action values

Return type:


update(observations: List[List[float]], actions: List[List[float]], reward: List[float], next_observations: List[List[float]], terminated: bool, truncated: bool)[source]

Update Q-Table using Bellman equation.

  • observations (List[List[float]]) – Previous time step observations.

  • actions (List[List[float]]) – Previous time step actions.

  • reward (List[float]) – Current time step reward.

  • next_observations (List[List[float]]) – Current time step observations.

  • terminated (bool) – Indication that episode has ended.

  • truncated (bool) – If episode truncates due to a time limit or a reason that is not defined as part of the task MDP.