Reinforcement Learning #2: Agent in the Warehouse

6 min readNov 18, 2020

The last RL walkthrough was a simple walk through that helped us set-up a RL problem: create an environment, create an agent, create states, create a reward system, and create/update a Q-table. This RL walkthrough by Dr. Daniel Soper will help us reinforce (pun intended) what we learned and we will emphasis on negative reward system, exploration, exploitation, and epsilon greedy algorithm. Dr. Soper did a great job explaining his work in the video and even provides a Colab notebook complementing it.

The scenario for this walkthrough is a real world problem that occurs in warehouses (but not as simple). A robot picks up some stuff from an aisle in a warehouse and needs to deliver it to the loading dock. In this scenario, the goal of the agent is to start from a location anywhere in the warehouse and navigate to the loading dock in the shortest path.

The first step is to create an environment. The environment will be an 11x11 matrix to represent the warehouse floor (image above) The black squares are obstacles where the agent cannot walk through. The agent can only follow the white squares to the goal.

The next step is to create the set of actions. For simplicity, the actions for an agent are up, down, left, and right. This means that wherever the agent is in the warehouse, it will have the option to choose one of those four actions.

Now comes the reward system. The purpose of the reward system is to guide the agent the most optimal way. Previously we used a positive reward system. If an agent goes in the right direction, give it a reward. If it goes in the wrong direction, take away some reward. In this project, the Dr. Soper used a negative reward system. This means that each step an agent takes, it receives a negative reward (in this case it gets -1 reward).

At hindsight, this is probably the best option. The goal of an agent is to get to its destination. The agent figures this out by maximizing its reward points that it collects. If the reward points are positive, then the agent might just walk around collecting points to maximize its reward without an incentive to actually finding the goal. With the negative reward system, an agent gets deducted a reward for every time it did not reach the goal. This gives the agent an incentive to get to the goal in the shortest amount of steps.

To set up the environment for the agent to train in, we will set the rewards for the obstacles as -100, the goal as +100 and every other square -1. Setting obstacles as a big deduction prevents any chance for the agent to try to go through them.

Dr. Soper started with filling up all squares with -100. Then set the terminal state (loading dock) to 100. Then set every possible state (aisle locations) as -1.

So the above code creates this reward system:

The next step is to train the agent. As mentioned in the previous blog, the way an agent learns at the beginning is to randomly choose an action (up, down, left, right) in a state (location in aisle). After the decision is made it, goes into a new state (new location in the aisle). It collects the reward in that state. The agent will repeat this process until it gets to the goal (loading dock). After it reaches the loading dock, the goal is completed and that is considered one iteration. The agent needs to go through many, many iterations to be able to figure out the best path. The number of iterations is a hyperparameter meaning we have to decide what that number is.

During training, the agent will record the rewards it gets from choosing an action in a Q-table. That way, in the next iteration, when the agent is in the same state, it will look at the Q-table and see which action gave the most reward and should take it. This is called exploitation.

The reason I said should because that is what we want at the end of training. Therefore during testing, an agent looks at a Q-table and sees which action gives the highest reward and takes it. The key words here are “at the end of training.” To build a robust Q-table during training we cannot always take the best action. We need to let the agent take other actions just to see if those actions lead to more rewards. In RL, this is called exploration.

Imagine going to a new restaurant called FOO. You leave your house and take a random path (also imagine there is no such thing as Google maps) and get to FOO in 10 minutes. You get there and DELICIOUSO! That becomes your favorite restaurant. The next day you want to go to FOO again. This time you wonder if yesterday’s path was the fastest. Would you be able to determine if yesterday’s path was the fastest by taking the same path again? The answer is no. To figure out the fastest path, you need to explore some other path options. That day you take a different path and it takes you 12 minutes to get to FOO. Now you know the first path is better than the second, but there are other paths to take. So for several days, you take different paths to FOO and record the time you get from door to door. However the days you are really hungry you take the path that you found to be the fastest so far. After a few iterations of traveling, you will have the fastest path to FOO. If you do not explore your other options then you will always take that first 10 minute path.

This trade-off between exploitation and exploration is very common in RL. In this walkthrough, the decision about how much to exploit and how much to explore is made by the epsilon greedy algorithm. The epsilon determines how much exploitation is made. In this case, epsilon=0.9 so that means 90% of the time the agent will select the best action. The remaining 10% of the time, the agent will randomly select an action from all the possible actions. This provides a chance for other actions to be explored. The value of epsilon is another hyperparameter. We have to decide what that value should be.

The next step is to train the agent. When the agent goes through the iterations, it needs to update the Q-table with the latest rewards it collected. In this walkthrough, the way the rewards were updated was through an algorithm called Temporal Difference (TD). The difference with TD is that the rewards get updated after every time step as opposed to waiting for a whole iteration (more about this in a later blog).

Here is the training process:

Dr. Soper’s code with some of my comments based on my understanding.

That is it. After training the 1000 iterations, the Q-table should be filled with the appropriate reward for each action. With the Q-table filled with some experience, now what is left to do is exploit it during testing phase. Dr. Soper created a function that takes in a location and outputs the shortest path.

Code to find the shortest path given a location (row, column)

Here are some examples of finding the shortest path:

Three examples to show that it is working. You start from the starting location and move to the next coordinate.

Here is the code with my comments

This was a great walkthrough by Dr. Soper. I learned a lot from it and definitely makes it more simply than reading a RL textbook. So far I did two blogs on walkthroughs and it is definitely helping me learn RL. I will continue on going through walkthroughs and learning from them.

Reinforcement Learning #2: Agent in the Warehouse

Written by Taraqur Rahman