citylearn.agents.q_learning module
- class citylearn.agents.q_learning.TabularQLearning(env: CityLearnEnv, epsilon: float = None, minimum_epsilon: float = None, epsilon_decay: float = None, learning_rate: float = None, discount_factor: float = None, q_init_value: float = None, **kwargs: Any)[source]
Bases:
Agent
Implementation of Tabular Q-Learning algorithm for discrete observation and action space and epsilon-greedy action selection.
- Parameters:
env (CityLearnEnv) – CityLearn environment.
epsilon (float, default: 1.0) – Exploration rate.
minimum_epsilon (float, default: 0.01) – Minimum value exporation rate can decay to.
epsilon_decay (float, default: 0.0001) –
epsilon
exponential decay rate.learning_rate (float, default: 0.05) – Defines to what degree new knowledge overrides old knowledge: for
learning_rate
= 0, no learning happens, while forlearning_rate
= 1, all prior knowledge is lost.discount_factor (float, default: 0.90) – Balance between an agent that considers only immediate rewards (
discount_factor
= 0) and one that strives towards long term rewards (discount_factor
= 1)q_init_value (float, default: np.nan) – Q-Table initialization value.
**kwargs (Any) – Other keyword arguments used to initialize
citylearn.agents.base.Agent
super class.
- predict(observations: List[List[float]], deterministic: bool = None) List[List[float]] [source]
Provide actions for current time step.
If deterministic = True or, randomly generated number is greater than epsilon, return deterministic action from Q-Table i.e. action with max Q-value for given observations otherwise, return randomly sampled action.
- Parameters:
observations (List[List[float]]) – Environment observations
deterministic (bool, default: False) – Wether to return purely exploitatative deterministic actions.
- Returns:
actions – Action values
- Return type:
List[List[float]]
- update(observations: List[List[float]], actions: List[List[float]], reward: List[float], next_observations: List[List[float]], terminated: bool, truncated: bool)[source]
Update Q-Table using Bellman equation.
- Parameters:
observations (List[List[float]]) – Previous time step observations.
actions (List[List[float]]) – Previous time step actions.
reward (List[float]) – Current time step reward.
next_observations (List[List[float]]) – Current time step observations.
terminated (bool) – Indication that episode has ended.
truncated (bool) – If episode truncates due to a time limit or a reason that is not defined as part of the task MDP.