The Maze: Reinforcement Learning — Part 2

Pouya Mohseni
6 min readMar 27, 2023

--

In the previous part, we formalized the maze problem and discussed the states, the goal state, and the reward function of the maze problem. We formulated the reward function in a way that aims to allow the agent to learn the optimal path to the goal state while respecting the constraints of the environment. Moreover, we trained an agent on an environment.

Photo by Juli Kosolapova on Unsplash

In this part, we discuss the constraints of this approach and after that, introduce objects to the environment and make changes to the reward function.

Constraints

We, now, train the agent in a new environment illustrated below.

As can be seen, there is a flag in the position [4,5]. If we want to solve this problem with a Q-table, the agent should first enter the [4,5] block from the only accessible [5,5] position, and then, move toward the goal by exiting that niche from the only accessible [5,5] position. Therefore, in the optimal path, there should be ‘U’ and ‘D’ both in the position [5,5]; thus, when the agent enters [5,5] it goes up, and after reaching again, it goes down and follows the trail to the ultimate goal.

In conclusion with a 2d Q-learning table, the problem cannot be solved; but there are two ways to overcome this: First is to use a Neural Network to resemble a stronger Q-table; we will be solving this problem according to the Deep Q-learning in this repo. The second solution is to expand the states according to the number of flags therefore, the agent would be in a new Q-table after achieving the [4,5] flag.

After all, we find it great to illustrate the learning process of the agent to see how it struggles throughout each iteration and how each iteration takes a longer time compared with the problem introduced in the first part. The below plot illustrates the learning curve of the Q-learning problem with 16 different sets of variables for alpha and lambda which each took 1000 iterations to be completed.

n_epochs = 1000
fig = plt.figure(figsize=(20, 15))
fig.subplots_adjust(hspace=0.3, wspace=0.2)
plot_num = 0
for alpha_ in [1,0.75,0.5,0.1]:
for lambda_ in [1,0.75,0.5,0.1]:
plot_num+=1
tmaze = maze([
["W","B","W","W","W","W","W","W","W","W"],
["W","W","W","F","W","B","W","W","W","W"],
["F","W","W","W","W","B","F","W","W","W"],
["B","B","W","B","B","W","B","W","W","W"],
["W","F","B","W","B","F","B","B","B","W"],
["W","W","B","W","B","W","W","W","W","W"],
["W","W","W","W","W","W","W","W","W","W"],
["W","F","W","W","W","W","B","B","B","B"],
["W","B","B","B","B","B","W","W","W","W"],
["W","W","W","W","W","W","W","B","F","W"]
],[0,0],[9,9])

tmaze.train(n_epochs,alpha_,lambda_)
ax = fig.add_subplot(4, 4, plot_num)
ax.set_title("alpha: "+str(alpha_)+", lambda: "+str(lambda_))
ax.plot(tmaze.train_curve)

plt.show()

As can be seen, many algorithms even failed to find a fine path. However, among others, the one with alpha=0.1, and lambda=0.75 shows a better performance. Though, it would take around 3000 steps from the agent to reach the goal, flagged.

Comparing to the previous problem. The total number of steps in each iteration is a lot more in this one; compare 3000 with 500. And as was talked about, the answer cannot be found in a 2D environment, therefore, for solving the problem a 3D Q-table with the use of flags as another dimension should be considered.

Objects

Defining the reward function when the agent pushes an object is different; because it can open or block a valid path. One way is to check if there are still any paths connecting the agent to the target after the push. However, it is not the best way because, in addition to cost, the reward function should be able to fully observe the environment and moreover, the flags in the path are not considered.

Therefore, we define the reward function based on the agent’s surroundings: If the number of blocks around the obstacle is increased the agent would be punished, and if the number of surrounding block states of that obstacle is decreased the agent would be rewarded.

#...
elif self.env[new_agent[0]][new_agent[1]]=='O':
#find position of the moved object:
new_obj = self.new_position(action,new_agent)

#cases in which object acts as a block

#when the object goes out of the board
if new_obj[0]<0 or new_obj[1]<0 or new_obj[0]>=env_0 or new_obj[1]>=env_1:
return 10.0*self.env_lose

#when the object cannot move because of a block
if self.env[new_obj[0]][new_obj[1]]=='B':
return 10.0*self.env_lose

#we will not be punishing the agent if the object goes on a flag or target;
#because maybe the agent is trying to open a new way

#now that we are sure that object moves to a valid position
#we can determine blocks around the object's old position
#and its new position to determine how we want to punish the agent
score1 = self.score_block(new_agent)
score2 = self.score_block(new_obj)

#if the object is moved to a better position
if score2>=score1:
return (1.0+(score2-score1)/5)/self.env_size
elif score2+1==score1:
return (1.0+2(score2-score1)/5)/self.env_size
elif score2+2==score1:
return (1.0+4(score2-score1)/5)/self.env_size
else:
#punish hardly if new new position looks terrible
return (-0.5*(score2-score1)/5)/self.env_size

If the surrounding blocks do not change after the push, the move would be counted as an ordinary move. If the number of surrounding is decreased, the move will be regarded as a good move and the agent will be rewarded based on the position’s score. However, if the action increases the surrounding blocks of the object by one or two, the agent will be punished but not harshly because we want to give a chance to move the object in order to find better paths. But if the action seems terrible the agent will be punished in order to prevent this action from happening if the agent is not in the need of that move.

def score(self,position):
# This function scores based on the souranding enviornment
score = 0
for p in ps:
if p[0]>=self.env_0: score+=1
elif p[0]<0: score+=1
elif p[1]>=self.env_1: score+=1
elif p[1]<0: score+=1
elif self.env[p[0]][p[1]]=='B': score+=1
elif self.env[p[0]][p[1]]=='O': score+=1

return score

It is obvious that if that object cannot be pushed in the direction of the action, the reward function takes that move as a block move so the agent would be much punished to lose the game.

As an example, for the move below, the agent is punished with 0.8; Note that a free move is punished with 1.0 in this algorithm.

Therefore, the complete reward function is as below.

def reward(self,action):
new_agent = self.find_position(action)

if self.env[new_agent[0]][new_agent[1]]=="B":
return 10.0*self.env_lose
elif self.env[new_agent[0]][new_agent[1]]=="F":
self.achive_flag(new_agent)
return 2*(self.env_size/self.flag_num)
elif self.env[new_agent[0]][new_agent[1]]=="T" and len(self.flags)==0:
return 1*self.env_size
elif self.visited[new_agent[0]][new_agent[1]]>=2:
return (-0.5*self.visited[new_agent[0]][new_agent[1]])/self.env_size
elif self.visited[new_agent[0]][new_agent[1]]>=1:
return 0.9/self.env_size
elif self.env[new_agent[0]][new_agent[1]]=='O':
#find position of the moved object:
new_obj = self.new_position(action,new_agent)

#cases in which object acts as a block

#when the object goes out of the board
if new_obj[0]<0 or new_obj[1]<0 or new_obj[0]>=env_0 or new_obj[1]>=env_1:
return 10.0*self.env_lose

#when the object cannot move because of a block
if self.env[new_obj[0]][new_obj[1]]=='B':
return 10.0*self.env_lose

#we will not be punishing the agent if the object goes on a flag or target;
#because maybe the agent is trying to open a new way

#now that we are sure that object moves to a valid position
#we can determine blocks around the object's old position
#and its new position to determine how we want to punish the agent
score1 = self.score_block(new_agent)
score2 = self.score_block(new_obj)

#if the object is moved to a better position
if score2>=score1:
return (1.0+(score2-score1)/5)/self.env_size
elif score2+1==score1:
return (1.0+2(score2-score1)/5)/self.env_size
elif score2+2==score1:
return (1.0+4(score2-score1)/5)/self.env_size
else:
#punish hardly if new new position looks terrible
return (-0.5*(score2-score1)/5)/self.env_size
else:
return 1.0/self.env_size

The updated version of the code can be found in the linked repo.

References

[1] Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction

[2] T. Mitchell, Machine Learning

[3] Sathish Kumar, Simple Reinforcement Learning using Q tables

--

--