The Maze: Reinforcement Learning — Part 1

9 min readMar 27, 2023

Reinforcement learning (RL) is a field of artificial intelligence that deals with learning how to optimize behavior based on a reward function. In this story, we have an agent learn a maze problem that is integrated with flags, and objects.

Reinforcement Learning aims to find an optimal policy that maximizes the cumulative reward over time reduced by discounts. In other words, the reward function that the agent tries to maximize is determined by the state of the agent and the action that it decides to take. The policy function, though, determines what action the agent should take based on its state and previous knowledge of rewards [1].

The agent–environment interaction in reinforcement learning [1].

Reward function: r(s, a) -> r
Policy function: δ(s, a) -> s'

Reinforcement learning is different from supervised learning in that the learning agent does not have access to labeled data. It is also different from Unsupervised Learning which typically works with unlabeled data. In the Reinforcement Learning problems, the agent finds the hidden structures of its environment by taking actions and being signaled by a reward. Therefore, RL is considered the third branch of machine learning [1].

Q-Learning

One popular approach to RL is Q-learning, a value-based method that seeks to find the best action to take in a given state by estimating the maximum expected future reward of each action. In this approach, the agent tries to learn from its environment by relaying on its previous experiences from the reward function. Having a learning rate of α and a discount rate of γ, the previous experiences are updated by taking new actions and precepting the reward function [2].

Q(s, a) <- Q(s, a) + α[r + γ max Q(s’, a’) – Q(s, a)]
                              a’

In other words, a table is generated for each state and action pair and then, each register is initialized with zero — for now just consider the reward function non-positive — and would be updated throughout the training phase of the algorithm’s implementation [3].

While we do not intend to provide a comprehensive introduction to Q-learning, we find the current overview to be sufficient.

The Maze Problem

The objective of the maze problem — or rat in a maze — is to find a path from the starting point to the target while the path is not blocked by any obstacle. In the version that is studied, there are some flags that should be gathered before reaching the target. And also, there might be objects in the maze that going towards them results in pushing them; some pushing could result in blocking a valid path and some other pushing might open a way to a valid path. Therefore, precise consideration is ought to be put into devising the reward function. A maze problem is represented below, with the agent indicated by blue, the target by green, and the flags by red; Blocks are represented by black squares.

Now, we formalize the problem.

I. States

States of this problem consist of any position of the agent in the maze environment regarding its cumulated flags. It means that for an n*m maze that has p flags, the state space consists of n*m*p states that only one of them is the goal state. Note that in this description, pushable objects are excluded (To consider them we can focus on the positions they can move to, thus for each object, a term, like o_i would be multiplied by the original space).

We can reduce the number of states by excluding the flags in the space and amending the reward function in a way that the agent would get a reward per each earned flag and cannot win the game unless all of the flags are earned. With this attempt, the size of the space would reduce to n*m.

def take_a_step(self):
        action_set = {"L":0,"U":0,"R":0,"D":0}

        if self.agent[0] == 0: action_set.pop("U", None)
        if self.agent[0] == self.env_0-1: action_set.pop("D", None)
        if self.agent[1] == 0: action_set.pop("L", None)
        if self.agent[1] == self.env_1-1: action_set.pop("R",None)

        #...

Another way to reduce the number of states is to exclude blocks from the agent’s valid positions. This approach is not found effective, as the agent would not be able to learn from the presence of blocks or their positions. For this reason, this change has not been implemented.

II. The Goal State

As noted, the goal space with the first definition of the states of a state in which the agent is in the target position and is the p-1 position of the 3rd dimension; meaning that all of the flags are achieved.

Following the second definition of the states, a goal state is a state in which the agent is in the target position and all of the flags are earned. Note how earning a flag is remote from the states.

The Reward Function

Now that we have formalized our problem, we can define a reward function that allows the agent to learn the optimal path to the goal state while respecting the constraints of the environment. For now, we do not consider objects in the environment — it is included in part 2.

The Reward Function must satisfy two objectives: first, the agent reaches the target while all of the flags are achieved, and second, the agent takes the fastest path. Thus, we want to prevent our agent from wandering around back and forth between a couple of states. Therefore, we try to punish the agent when it is visiting a visited state. Though, we do not want to restrict the agent from exploring so, re-visiting a state just for one time would not be punished hard. The reward function is introduced below. Note how the agent’s punishment when it visits a basic state is proportional to the number of its previous visits.

  def reward(self,action):
        new_agent = self.find_position(action)
            
        if self.visited[new_agent[0]][new_agent[1]]>=2:
            return (-0.5*self.visited[new_agent[0]][new_agent[1]])/self.env_size
        elif self.visited[new_agent[0]][new_agent[1]]>=1:
            return 0.9/self.env_size

Besides, when the agent wants to move toward a block, it will be punished by a great value in a way that it losses the game, in this way the agent never dares to move toward a block in any circumstances.

#...  
      elif self.env[new_agent[0]][new_agent[1]]=="B":
            return 10.0*self.env_lose

Additionally, the reward function must consider flags — more precisely unachieved ones. when the agent enters the state where its flag is not achieved, the reward function allocates more rewards to the agent than usual. In this way, the agent is forced to obtain flags on its way to the target state. And also, the agent uses those flag states as pinpoints in the space to find its goal.

#...
    elif self.env[new_agent[0]][new_agent[1]]=="F":
            self.achive_flag(new_agent)

Moreover, the agent should be rewarded when it completes the task. Therefore, a huge amount of reward is allocated to the agent when the target is approached and all of the flags are achieved. In this way, we hope to have the final reward propagated back and the fastest path would be discriminated from others.

#...
     elif self.env[new_agent[0]][new_agent[1]]=="T" and len(self.flags)==0:
            return 1*self.env_size

In addition, the game is eliminated when the agent approaches the target or when the agent cannot estimate a valuable future reward — a valuable reward is set to be more than -1.

This definition of the reward function guarantees that the agent would not enter the block states, because if it enters, its future reward would be less than -1 then the game would be eliminated. Also, the agent is forced to obtain all of the flags before entering the goal state because unless all flags are achieved it would not win.

Also, remark how each reward/punish is proportionate to the size of the environment in order to have a homogenous distribution of values in the Q-table of any problem.

def reward(self,action):
        new_agent = self.find_position(action)

        if self.env[new_agent[0]][new_agent[1]]=="B":
            return 10.0*self.env_lose
        elif self.env[new_agent[0]][new_agent[1]]=="F":
            self.achive_flag(new_agent)
            return 2*(self.env_size/self.flag_num)
        elif self.env[new_agent[0]][new_agent[1]]=="T" and len(self.flags)==0:
            return 1*self.env_size
        elif self.visited[new_agent[0]][new_agent[1]]>=2:
            return (-0.5*self.visited[new_agent[0]][new_agent[1]])/self.env_size
        elif self.visited[new_agent[0]][new_agent[1]]>=1:
            return 0.9/self.env_size
        elif self.env[new_agent[0]][new_agent[1]]=='O':
            #...
        else:
            return 1.0/self.env_size

Choosing an Action

A way for the agent to choose its action is based on maximizing its discounted future rewards; but in this way, when the agent finds the goal, it would not explore the environment searching for better paths. Meaning, exploration is reduced when the agent is focused on exploitation. Therefore, for our agent, we consider some degree of randomness.

max_reward = max(action_set.values())
        random.choice([k for (k, v) in action_set.items()
                                        if v >= 0.8 * max_reward])

In this way, the agent chooses an action randomly if it satisfies the constraint of having at least 80% of the best reward.

Training

First, we train the agent in an easy environment. Below is a 9*9 maze with an agent and a target located at [0,0], and [4,4], respectively with six flags distributed on the board.

In the following, the first maze is solved by the algorithm and with the help of the introduced reward function. Four different variables for alpha and lambda are chosen from the set {1, 0.75, 0.5, 0.1} — nine in total. And for each, the maze is solved through 1000 epochs

n_epochs = 1000
fig = plt.figure(figsize=(20, 15))
fig.subplots_adjust(hspace=0.3, wspace=0.2)
plot_num = 0
for alpha_ in [1,0.75,0.5,0.1]:
    for lambda_ in [1,0.75,0.5,0.1]:
        plot_num+=1
        tmaze = maze([
                ["W","W","W","W","B","W","F","W","W"],
                ["W","W","B","W","W","W","W","W","B"],
                ["F","W","B","B","F","B","B","W","W"],
                ["B","W","W","W","W","W","W","W","W"],
                ["B","W","B","B","W","W","W","W","B"],
                ["W","W","B","W","B","W","W","W","F"],
                ["W","F","W","W","B","W","W","B","W"],
                ["B","W","B","W","W","B","W","W","W"],
                ["B","B","B","B","B","W","W","W","F"]
                ],[0,0],[4,4])
        
        tmaze.train(n_epochs,alpha_,lambda_)
        ax = fig.add_subplot(4, 4, plot_num)
        ax.set_title("alpha: "+str(alpha_)+", lambda: "+str(lambda_))
        ax.plot(tmaze.train_curve)
        
plt.show()

A summarized training output is illustrated below. For each epoch, a new training algorithm is performed regarding the constructed Q-table. Each two printed lines show the agent’s choices and their rewards in each state and the choice that the agent has made based on the rewards. Therefore, the episode of each epoch and changed rewards through taking actions can be obtained from this printed output.

{'R': 0.004, 'D': 0.004}
>-> D
{'U': 0.0036400000000000004, 'R': 0.004, 'D': 0.004}
>-> R
{'L': 0.0036400000000000004, 'U': 0.004, 'R': -1.0, 'D': 0.004}
>-> U
{'L': 0.0036400000000000004, 'R': 0.004, 'D': 0.0036400000000000004}
>-> R
{'L': 0.0036400000000000004, 'R': 0.004, 'D': -1.0}
>-> R
{'L': 0.0036400000000000004, 'R': -1.0, 'D': 0.004}
>-> D
{'L': -1.0, 'U': 0.0036400000000000004, 'R': 0.004, 'D': -1.0}
>-> R
{'L': 0.0036400000000000004, 'U': -1.0, 'D': 0.004}
>-> D
{'L': -1.0, 'U': 0.0036400000000000004, 'D': 0.004}
>-> D
{'L': 0.004, 'U': 0.0036400000000000004, 'D': 2.5}
>-> D
R , 0.00 |R , 0.00 |R , 0.00 |D , 0.00 |L , 0.00 |
R , 0.00 |U , 0.00 |L , 0.00 |R , 0.00 |D , 0.00 |
U , 0.00 |L , 0.00 |L , 0.00 |L , 0.00 |D , 0.00 |
U , 0.00 |L , 0.00 |L , 0.00 |L , 0.00 |D , 2.50 |
U , 0.00 |L , 0.00 |L , 0.00 |L , 0.00 |L , 0.00 |
------------------------------------------------------------------
.
.
.
------------------------------------------------------------------
R , 0.03 |R , 0.03 |R , 0.03 |D , 0.03 |L , 0.00 |
R , 0.01 |U , 0.01 |L , 0.00 |R , 0.03 |D , 0.03 |
U , 0.00 |L , 0.00 |L , 0.00 |L , 0.00 |D , 0.12 |
U , 0.00 |L , 0.00 |L , 0.00 |L , 0.00 |D , 0.99 |
U , 0.00 |L , 0.00 |L , 0.00 |L , 0.00 |L , 0.00 |

These are the training plots for the algorithm for every 16 different variables for alpha and beta for 1000 iterations each. Note that the more that iteration length is lower, the algorithm performs better, and if it doesn’t change from a value, the learning is finished. Additionally, if the output is 0, the agent is losing the game.

As can be seen, the models with lambda=1 failed to find an optimized path because in this parameter is used to control the trade-off between the short-term and long-term rewards in the update of the Q-table. A lambda=1 means that the agent will consider the long-term reward as important as the short-term reward which may result in failure because of some reasons.

The first problem and the one that we are dealing with is the overestimation of Q-values which will result in incorrect action selections. Other problems deriving from this choice are slow convergence and inefficient exploration.

What’s Next?

Another problem will be solved and the constraints of this approach are discussed. Additionally, objects are introduced which can be moved by the agent when they move toward them. These objects are two sides of a sword; they can be used to open a path or block it fully. Thus, the reward function should be determined precisely. Proceed with part 2 from this link.

The updated version of the code can be found in the linked repo.

References

[1] Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction

[2] T. Mitchell, Machine Learning

[3] Sathish Kumar, Simple Reinforcement Learning using Q tables