We have the answers to your questions! - Don't miss our next open house about the data universe!

Q-learning – Machine Learning with reinforcement learning

- Reading Time: 4 minutes
Q-learning - Machine Learning with reinforcement learning

Reinforcement learning is a Machine Learning method that enables complex tasks to be performed autonomously. The aim of this article is to help you understand Q-learning and save time when implementing this type of solution.

Even more recently, this family of algorithms made headlines in e-sports with the release of AlphaStar, an algorithm developed to challenge the best players in the world in StarCarft. These algorithms have great potential, but can be very time-consuming to build and configure. 

Definitions: Q-Learning

What is reinforcement learning?

Reinforcement learning is a machine learning method whose objective is to enable an agent (virtual entity: robot, programme, etc.), placed in an interactive environment (its actions modify the state of the environment), to choose actions that maximise quantitative rewards. The agent performs trials and improves its action strategy according to the rewards provided by the environment.

What is Q-learning?

There are many reinforcement learning algorithms, categorised into several sub-families. Q-learning is both relatively simple and allows us to understand the learning mechanisms common to many other models.

As an introductory illustration, a Q-learning algorithm works to solve a basic problem. For example, in the maze game, the aim of the game is to teach the robot to get out of the maze as quickly as possible when it is randomly placed on one of the white squares. To achieve this, there are three central stages in the learning process:

  1. Knowledge: defining a Q action-value function ;
  2. Reinforcing knowledge: updating the Q function;
  3. Taking action: adopting a PI action strategy.

Q-learning is therefore a reinforcement learning algorithm that seeks to find the best action to take given the current state.

It is considered non-policy because the Q-learning function learns actions that are outside the current policy, such as taking random actions, and therefore a policy is not required. More precisely, Q-learning seeks to learn a policy that maximises total reward.

The “Q” in Q-learning stands for quality. In this case, quality represents the utility of a given action in obtaining a future reward.

Q-learning: practical implementation

Creating a Q table

When learning is complete, we create a so-called q-table or matrix of the form [state, action] and initialise our values to zero. We then update and store our q-values after each episode. This table of values becomes a reference table for our agent, who selects the best action based on the values in this matrix.

					import numpy as np 
# Initialize q-table values to 0
Q = np.zeros((state_size, action_size))


Q-learning and updates

The next step is simply for the agent to interact with the environment and update the state-action pairs in our Q[state, action] array.

- Taking action: Explore or Exploit

An agent interacts with the environment in two ways. The first is to use the Q table as a reference and view all the possible actions for a given state. The agent then selects the action based on the maximum value of these actions. This is called exploitation, as we use the information available to us to make a decision.

The second way is to act randomly. This is called exploration. Instead of selecting actions based on the maximum future reward, we choose an action at random. Acting randomly is important because it allows the agent to explore and discover new states that might not otherwise be selected during the exploitation process.

You can balance exploration / exploitation by using epsilon (ε) and setting the value of how often you want to explore or exploit. Here is some rough code that will depend on how the state and action space are configured.

					import random  
# Set the percent you want to explore
epsilon = 0.2
if random.uniform(0, 1) < epsilon:
    Explore: select a random action    
    Exploit: select the action with max value (future reward)    


- Updating the q-table

Updates take place after each step or action and end when an episode is completed. In this case, “completed” means that the agent has reached a terminal point. For example, an end state might be landing on a payment page or achieving a desired goal. With enough exploration (steps and episodes), the agent will be able to converge and learn the optimal values of q or q-star (Q∗).

Here are the 3 basic steps:

  1. The agent starts in a state (s1) takes an action (a1) and receives a reward (r1).
  2. The agent chooses the action by referring to the Q-table with the highest value (max) OR at random (epsilon, ε).
  3. Updating q-values

The basic update rule for Q-learning is as follows:

					# Update q values
Q[state, action] = Q[state, action] + lr * (reward + gamma * np.max(Q[new_state, :]) — Q[state, action])


In the update above, there are a few variables that we haven’t mentioned yet. What happens here is that we adjust our q-values according to the difference between the new updated values and the old ones. We update the new values using gamma and adjust our step size using the learning rate (lr).


  • Learning rate: lr, often called alpha, can be defined as the degree of acceptance of the new value compared to the old one. Above, we take the difference between the new value and the old value, then multiply this value by the learning rate. This value is then added to our previous q-value, moving it in the direction of our last update.
  • Gamma: gamma or γ is a discount factor. It is used to balance immediate and future reward. In our update rule above, you can see that we apply the discount to the future reward. In general, this value can vary between 0.8 and 0.99.
  • Reward: The reward is the value received after performing a certain action in a given state. A reward can occur at any given time step or only at the terminal time step.
  • Max: np.max() uses the numpy library and takes the maximum of the future reward and applies it to the reward in the current state. This has the effect of influencing the current action by the possible future reward. In effect, thanks to Q-learning, we are able to allocate future reward to current actions to help the agent select the most profitable action in any given state.

You are not available?

Leave us your e-mail, so that we can send you your new articles when they are published!
icon newsletter


Get monthly insider insights from experts directly in your mailbox