citylearn.agents.q_learning module

class citylearn.agents.q_learning.TabularQLearning(env: citylearn.citylearn.CityLearnEnv, epsilon: Optional[float] = None, minimum_epsilon: Optional[float] = None, epsilon_decay: Optional[float] = None, learning_rate: Optional[float] = None, discount_factor: Optional[float] = None, q_init_value: Optional[float] = None, **kwargs: Any)[source]

Bases: citylearn.agents.base.Agent

Implementation of Tabular Q-Learning algorithm for discrete observation and action space and epsilon-greedy action selection.

Parameters
  • env (CityLearnEnv) – CityLearn environment.

  • epsilon (float, default: 1.0) – Exploration rate.

  • minimum_epsilon (float, default: 0.01) – Minimum value exporation rate can decay to.

  • epsilon_decay (float, default: 0.0001) – epsilon exponential decay rate.

  • learning_rate (float, default: 0.05) – Defines to what degree new knowledge overrides old knowledge: for learning_rate = 0, no learning happens, while for learning_rate = 1, all prior knowledge is lost.

  • discount_factor (float, default: 0.90) – Balance between an agent that considers only immediate rewards (discount_factor = 0) and one that strives towards long term rewards (discount_factor = 1)

  • q_init_value (float, default: np.nan) – Q-Table initialization value.

  • **kwargs (Any) – Other keyword arguments used to initialize citylearn.agents.base.Agent super class.

predict(observations: List[List[float]], deterministic: Optional[bool] = None) List[List[float]][source]

Provide actions for current time step.

If deterministic = True or, randomly generated number is greater than epsilon, return deterministic action from Q-Table i.e. action with max Q-value for given observations otherwise, return randomly sampled action.

Parameters
  • observations (List[List[float]]) – Environment observations

  • deterministic (bool, default: False) – Wether to return purely exploitatative deterministic actions.

Returns

actions – Action values

Return type

List[List[float]]

update(observations: List[List[float]], actions: List[List[float]], reward: List[float], next_observations: List[List[float]], done: bool)[source]

Update Q-Table using Bellman equation.

Parameters
  • observations (List[List[float]]) – Previous time step observations.

  • actions (List[List[float]]) – Previous time step actions.

  • reward (List[float]) – Current time step reward.

  • next_observations (List[List[float]]) – Current time step observations.

  • done (bool) – Indication that episode has ended.