A Super Simple Implementation of Reinforcement Learning
I found a very simple implementation of Reinforcement Learning (RL) using a Q-table and thought that this is a good start for me to learn how a RL algorithm works and how it is programmed.
The Goal of the Agent
In this project, the goal of the agent is to start from node 0 and get to node 7 in the shortest amount of steps. For example in the path mapped out below, the goal of the agent would be to take steps 1 -> 2 -> 7, starting from 0.
Determine the Reward System
The first step is to create a reward matrix that maps out reward points for each action an agent takes at each state. In this scenario, the reward matrix will be 8x8 (8 for numbers from 0–7). We can think of the reward matrix as a coordinate system. Not all the possible points in a 8x8 grid is part of the path. To fix this, we fill in -1 for coordinate points that do not exist in our path. This indirectly create barriers for the agent because an agent will not chose those coordinate points since it will take away reward points. For viable points in the path, the reward point will be 0. For the goal point (in this case 7), it will be 100.
What we will do is reward the computer for taking the correct path towards 7. If it doesn’t take the correct path, it doesn’t get a reward. If it goes out of boundary, it will lose points. Eventually after many episodes of trial and error, an agent will determine the correct path based on the highest rewards for each step.
Create a Q-Table
When an agent is in a particular state, it is provided with available actions. The agent will take the action that has the most reward. The information about the rewards of actions given that particular state is stored in a Q-table. Q-table sounds scary and the equation might look intimidating, but it is just a simple table that an agent uses to see what the reward is for an action in a given state.
But wait, how does a Q-table know which action provides the most rewards at which state? The Q-table knows this through the training process. For this simple project, the Q-table is filled with 0 in the beginning. For each iteration, the agent is exploring, where it randomly selects actions and checks to see what the reward is. Once it gets a reward, the Q-table collects the reward it gets for taking that action at a state.
As more iterations are taken, the Q-table adjusts itself to find a better and better reward system for each action at each state. Once the training process is over (exploration), an agent can exploit that Q-table to find the optimized path to accomplish its goal.
To update the values of the Q-table, we will use the Bellman equation.
This equation can be intimidating at first, but if we break it down, it is straightforward. First lets go over the variables:
- Q(s,a): is the stored reward value when an agent is in the state, s, and takes the action, a, in the Q-table.
- r: current reward value
- max Q(s’, a’): max Q-value of the next state, s’, and taken the action, a’.
- gamma(y-looking symbol): How much the future reward influences the current reward
Basically what the equation is doing is taking the reward it cumulated and adds a fraction (gamma) of the highest future reward in the next state. Adding the future reward is key to letting the agent know that it is taking good actions. For example, if there are two actions, one with a future reward and one without, then obviously the action with a future reward would be taken.
Simply put, Q-table just collects the rewards taken by each action at each state during training. Once that information is collected, during the test phase, an agent will look up the state it is in and see what the rewards are for each action. Obviously it will select the action that gives the agent the highest reward. That way it will be guaranteed to find the optimal path.
Training process:
During the training phase, the agent refers to its Q-table and sees it filled with 0’s. Therefore it randomly selects an action and sees what the result is. Whether it is a good action or a bad action, the agent calculates the value using the Bellman equation and jots it down in the Q-table for the next time.
After taking the action, the agent is in a new state. Once again, it refers to its Q-table and still sees zeros. Therefore it chooses an action randomly again, updates the Q-table, and goes into the next state. This process repeats over and over until a determined number of trials. In this code, that threshold is 700 (arbitrary). That means the agent goes through this process 700 times.
Once the training is complete, a Q-table is filled up with the experiences that the agent went through. To get the optimized path, we just have to follow the Q-table. A well trained Q-table will have high Q-values for actions that lead closer to the task and low (or even negative numbers) to tell an agent not to take that action in that state.
Here is the full code with some commentary:
That is basically what Q-Learning in Reinforcement Learning is. This is an over simplified example but the idea stays the same. There is an agent with a goal to maximize its rewards. To maximize the rewards, the agent needs to explore all the states and actions for each state. For each action, there is a reward (either positive, negative, or 0). This reward is converted to a Q-value using the Bellman equation and stored in a Q-table. After a sufficient amount of training, the agent exploits the data collected in the Q-table to accomplish its goal.
This simple implementation was a great first walkthrough to understand what is happening under the hood of Reinforcement Learning algorithms. I will continue adding on this foundation in future blogs.