Model-Based and Model-Free Reinforcement Learning – Pytennis Case Study – hkshco.com

Reinforcement learning is a field of Artificial Intelligence in which you build an intelligent system that learns from its environment through interaction and evaluates what it learns in real-time. 

A good example of this is self-driving cars, or when DeepMind built what we know today as AlphaGo, AlphaStar, and AlphaZero. 

AlphaZero is a program built to master the games of chess, shogi and go (AlphaGo is the first program that beat a human Go master). AlphaStar plays the video game StarCraft II.

In this article, we’ll compare model-free vs model-based reinforcement learning. Along the way, we will explore:

  1. Fundamental concepts of Reinforcement Learning
    a) Markov decision processes / Q-Value / Q-Learning / Deep Q Network
  2. Difference between model-based and model-free reinforcement learning.
  3. Discrete mathematical approach to playing tennis – model-free reinforcement learning.
  4. Tennis game using Deep Q Network – model-based reinforcement learning.
  5. Comparison/Evaluation
  6. References to learn more

SEE RELATED ARTICLES
👉 7 Applications of Reinforcement Learning in Finance and Trading
👉 10 Real-Life Applications of Reinforcement Learning
👉 Best Reinforcement Learning Tutorials, Examples, Projects, and Courses


Fundamental concepts of Reinforcement Learning

Any reinforcement learning problem includes the following elements:

  1. Agent – the program controlling the object of concern (for instance, a robot).
  2. Environment – this defines the outside world programmatically. Everything the agent(s) interacts with is part of the environment. It’s built for the agent to make it seem like a real-world case. It’s needed to prove the performance of an agent, meaning if it will do well once implemented in a real world application.
  3. Rewards – this gives us a score of how the algorithm performs with respect to the environment. It’s represented as 1 or 0. ‘1’ means that the policy network made the right move, ‘0’ means wrong move. In other words, rewards represent gains and losses.
  4. Policy – the algorithm used by the agent to decide its actions. This is the part that can be model-based or model-free.

Every problem that needs an RL solution starts with simulating an environment for the agent. Next, you build a policy network that guides the actions of the agent. The agent can then evaluate the policy if its corresponding action resulted in a gain or a loss.

The policy is our main discussion point for this article. Policy can be model-based or model-free. When building, our concern is how to optimize the policy network via policy gradient (PG). 

PG algorithms directly try to optimize the policy to increase rewards. To understand these algorithms, we must take a look at Markov decision processes (MDP).

Markov decision processes / Q-Value / Q-Learning / Deep Q Network

MDP is a process with a fixed number of states, and it randomly evolves from one state to another at each step. The probability for it to evolve from state A to state B is fixed.

A lot of Reinforcement Learning problems with discrete actions are modeled as Markov decision processes, with the agent having no initial clue on the next transition state. The agent also has no idea on the rewarding principle, so it has to explore all possible states to begin to decode how to adjust to a perfect rewarding system. This will lead us to what we call Q Learning.

The Q-Learning algorithm is adapted from the Q-Value Iteration algorithm, in a situation where the agent has no prior knowledge of preferred states and rewarding principles. Q-Values can be defined as an optimal estimate of a state-action value in an MDP. 

It is often said that Q-Learning doesn’t scale well to large (or even medium) MDPs with many states and actions. The solution is to approximate the Q-Value of any state-action pair (s,a). This is called Approximate Q-Learning. 

DeepMind proposed the use of deep neural networks, which work much better, especially for complex problems – without the use of any feature engineering. A deep neural network used to estimate Q-Values is called a deep Q-network (DQN). Using DQN for approximated Q-learning is called Deep Q-Learning.

Difference Between Model-Based and Model-Free Reinforcement Learning

In a model-based RL environment, the policy is based on the use of a machine learning model. To better understand RL Environments/Systems, what defines the system is the policy network. Knowing fully well that the policy is an algorithm that decides the action of an agent. In this case, when an RL environment or system utilizes the use of machine learning models like random forest, gradient boost, neural networks, and others, such an RL system is model-based. Moreover, this isn’t the case with a model-free RL, such a system has no policy based on the use of machine learning models; its policy is guided by the use of non-ML algorithms. For instance, trying to balance a lever system that ensures perfect stability of the effort and load while on the fulcrum (see fig 1). A simple algorithm can be written to basically, ensure that when load is greater than the effort, the fulcrum moves leftward and if otherwise, it moves right. This kind of system is, model-free because it requires no form of machine learning model to achieve stability.

To better understand this, we will explain everything with an example. In the example, we’ll build model-free and model-based RL for tennis games. To build the model, we need an environment for the policy to get implemented. However we won’t build the environment in this article, we’ll import one to use for our program.

model based free RL

Pytennis environment

We’ll use the Pytennis environment to build a model-free and model-based RL system.

A tennis game requires the following:

  1. 2 players which implies 2 agents.
  2. A tennis lawn – main environment.
  3. A single tennis ball.
  4. Movement of the agents left-right (or right-left direction). 

The Pytennis environment specifications are:

  1. There are 2 agents (2 players) with a ball.
  2. There’s a tennis field of dimension (x, y) – (300, 500)
  3. The ball was designed to move on a straight line, such that agent A decides a target point between x1 (0) and x2 (300) of side B (Agent B side), therefore it displays the ball 50 different times with respect to an FPS of 20. This makes the ball move in a straight line from source to destination. This also applies to agent B.
  4. Movement of Agent A and Agent B is bound between (x1= 100, to x2 = 600).
  5. Movement of the ball is bound along the y-axis (y1 = 100 to y2 = 600).
  6. Movement of the ball is bound along the x-axis (x1 = 100, to x2 = 600).

Pytennis is an environment that mimics real-life tennis situations. As shown below, the image on the left is a model-free Pytennis game, and the one on the right is model-based. 

pytennis model free
pytennis model based

Discrete mathematical approach to playing tennis – model-free Reinforcement Learning

Why “discrete mathematical approach to playing tennis”? Because this method is a logical implementation of the Pytennis environment. 

The code below shows us the implementation of the ball movement on the lawn. You can find the source code here. 

import time
import numpy as np
import pygame
import sys

 
from pygame.locals import *
pygame.init()
 
 
class Network:
   def __init__(self, xmin, xmax, ymin, ymax):
       """
       xmin: 150,
       xmax: 450,
       ymin: 100,
       ymax: 600
       """
 
       self.StaticDiscipline = {
           'xmin': xmin,
           'xmax': xmax,
           'ymin': ymin,
           'ymax': ymax
       }
 
   def network(self, xsource, ysource=100, Ynew=600, divisor=50):  
       """
       For Network A
       ysource: will always be 100
       xsource: will always be between xmin and xmax (static discipline)
       For Network B
       ysource: will always be 600
       xsource: will always be between xmin and xmax (static discipline)
       """
 
       while True:
           ListOfXsourceYSource = []
           Xnew = np.random.choice([i for i in range(
               self.StaticDiscipline['xmin'], self.StaticDiscipline['xmax'])], 1)
           
 
           source = (xsource, ysource)
           target = (Xnew[0], Ynew)
 
           
           slope = (ysource - Ynew)/(xsource - Xnew[0])
           intercept = ysource - (slope*xsource)
           if (slope != np.inf) and (intercept != np.inf):
               break
           else:
               continue
 
       
       
       XNewList = [xsource]
 
       if xsource < Xnew:
           differences = Xnew[0] - xsource
           increment = differences / divisor
           newXval = xsource
           for i in range(divisor):
 
               newXval += increment
               XNewList.append(int(newXval))
       else:
           differences = xsource - Xnew[0]
           decrement = differences / divisor
           newXval = xsource
           for i in range(divisor):
 
               newXval -= decrement
               XNewList.append(int(newXval))
 
       
       yNewList = []
       for i in XNewList:
           findy = (slope * i) + intercept  
           yNewList.append(int(findy))
 
       ListOfXsourceYSource = [(x, y) for x, y in zip(XNewList, yNewList)]
 
       return XNewList, yNewList

Here is how this works once the networks are initialized (Network A for Agent A and Network B for Agent B):

net = Network(150, 450, 100, 600)
NetworkA = net.network(300, ysource=100, Ynew=600)  
NetworkB = net.network(200, ysource=600, Ynew=100)  

Each network is bounded by the directions of ball movement. Network A represents Agent A, which defines the movement of the ball from Agent A to any position between 100 and 300 along the x-axis at Agent B. This also applies to Network B (Agent B).

When the network is started, the .network method discretely generates 50 y-points (between y1 = 100 and y2 = 600), and corresponding x-points (between x1 which happens to be the location of the ball from Agent A to a randomly selected point x2 on Agent B side) for network A. This also applies to Network B (Agent B). 

To automate the movement of each agent, the opposing agent has to move in a corresponding direction with respect to the ball. This can only be done by setting the x position of the ball to be the x position of the opposing agent, as in the code below.

playerax = ballx 
 
playerbx = ballx 

Meanwhile the source agent has to move back to its default position from its current position. The code below illustrates this.

def DefaultToPosition(x1, x2=300, divisor=50):
   XNewList = []
   if x1 < x2:
       differences = x2 - x1
       increment = differences / divisor
       newXval = x1
       for i in range(divisor):
           newXval += increment
           XNewList.append(int(np.floor(newXval)))
 
   else:
       differences = x1 - x2
       decrement = differences / divisor
       newXval = x1
       for i in range(divisor):
           newXval -= decrement
           XNewList.append(int(np.floor(newXval)))
   return XNewList

Now, to make the agents play with each other recursively, this has to run in a loop. After every 50 counts (50 frame display of the ball), the opposing player is made the next player. The code below puts all of it together in a loop.

def main():
   while True:
       display()
       if nextplayer == 'A':
           
           if count == 0:
               
               NetworkA = net.network(
                   lastxcoordinate, ysource=100, Ynew=600)  
               out = DefaultToPosition(lastxcoordinate)
 
               
 
               bally = NetworkA[1][count]
               playerax = ballx 
               count += 1




           else:
               ballx = NetworkA[0][count]
               bally = NetworkA[1][count]
               playerbx = ballx
               playerax = out[count]
               count += 1
 
           
           if count == 49:
               count = 0
               nextplayer = 'B'
           else:
               nextplayer = 'A'
 
       else:
           
           if count == 0:
               
               NetworkB = net.network(
                   lastxcoordinate, ysource=600, Ynew=100)  
               out = DefaultToPosition(lastxcoordinate)
 
               
               bally = NetworkB[1][count]
               playerbx = ballx
               count += 1
 




           else:
               ballx = NetworkB[0][count]
               bally = NetworkB[1][count]
               playerbx = out[count]
               playerax = ballx
               count += 1
           
 
           
           if count == 49:
               count = 0
               nextplayer = 'A'
           else:
               nextplayer = 'B'
 
       
       DISPLAYSURF.blit(PLAYERA, (playerax, 50))
       DISPLAYSURF.blit(PLAYERB, (playerbx, 600))
       DISPLAYSURF.blit(ball, (ballx, bally))
 
       
       lastxcoordinate = ballx
 
       pygame.display.update()
       fpsClock.tick(FPS)
 
       for event in pygame.event.get():
 
           if event.type == QUIT:
               pygame.quit()
               sys.exit()
       return

And this is basic model-free reinforcement learning. It’s model-free because you need no form of learning or modelling for the 2 agents to play simultaneously and accurately.

Tennis game using Deep Q Network – model-based Reinforcement Learning

A typical example of model-based reinforcement learning is the Deep Q Network. Source code to this work is available here. 

The code below illustrates the Deep Q Network, which is the model architecture for this work.

from keras import Sequential, layers
from keras.optimizers import Adam
from keras.layers import Dense
from collections import deque
import numpy as np
 
 
 
class DQN:
   def __init__(self):
       self.learning_rate = 0.001
       self.momentum = 0.95
       self.eps_min = 0.1
       self.eps_max = 1.0
       self.eps_decay_steps = 2000000
       self.replay_memory_size = 500
       self.replay_memory = deque([], maxlen=self.replay_memory_size)
       n_steps = 4000000 
       self.training_start = 10000 
       self.training_interval = 4 
       self.save_steps = 1000 
       self.copy_steps = 10000 
       self.discount_rate = 0.99
       self.skip_start = 90 
       self.batch_size = 100
       self.iteration = 0 
       self.done = True 
 
 
      
      
       self.model = self.DQNmodel()
      
       return
  
 
 
   def DQNmodel(self):
       model = Sequential()
       model.add(Dense(64, input_shape=(1,), activation='relu'))
       model.add(Dense(64, activation='relu'))
       model.add(Dense(10, activation='softmax'))
       model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.learning_rate))
       return model
 
  
   def sample_memories(self, batch_size):
       indices = np.random.permutation(len(self.replay_memory))[:batch_size]
       cols = [[], [], [], [], []] 
       for idx in indices:
           memory = self.replay_memory[idx]
           for col, value in zip(cols, memory):
               col.append(value)
       cols = [np.array(col) for col in cols]
       return (cols[0], cols[1], cols[2].reshape(-1, 1), cols[3],cols[4].reshape(-1, 1))
 
 
   def epsilon_greedy(self, q_values, step):
       self.epsilon = max(self.eps_min, self.eps_max - (self.eps_max-self.eps_min) * step/self.eps_decay_steps)
       if np.random.rand() < self.epsilon:
           return np.random.randint(10) 
       else:
           return np.argmax(q_values) 

In this case, we need a policy network to control the movement of each agent as they move along the x-axis. Since the values are continuous, that is from (x1 = 100 to x2 = 300), we can’t have a model that predicts or works with 200 states. 

To simplify this problem, we can split x1 and x2 into 10 states / 10 actions, and define an upper and lower bound for each state.

Note that we have 10 actions, because from a state there are 10 possibilities.

The code below illustrates the definition of both upper and lower bounds for each state.

def evaluate_state_from_last_coordinate(self, c):
       """
       cmax: 450
       cmin: 150
 
       c definately will be between 150 and 450.
       state0 - (150 - 179)
       state1 - (180 - 209)
       state2 - (210 - 239)
       state3 - (240 - 269)
       state4 - (270 - 299)
       state5 - (300 - 329)
       state6 - (330 - 359)
       state7 - (360 - 389)
       state8 - (390 - 419)
       state9 - (420 - 450)
       """
       if c >= 150 and c <= 179:
           return 0
       elif c >= 180 and c <= 209:
           return 1
       elif c >= 210 and c <= 239:
           return 2
       elif c >= 240 and c <= 269:
           return 3
       elif c >= 270 and c <= 299:
           return 4
       elif c >= 300 and c <= 329:
           return 5
       elif c >= 330 and c <= 359:
           return 6
       elif c >= 360 and c <= 389:
           return 7
       elif c >= 390 and c <= 419:
           return 8
       elif c >= 420 and c <= 450:
           return 9

The Deep Neural Network (DNN) used experimentally for this work is a network of 1 input (which represents the previous state), 2 hidden layers of 64 neurons each, and an output layer of 10 neurons (binary selection from 10 different states). This is shown below:

def DQNmodel(self):
       model = Sequential()
       model.add(Dense(64, input_shape=(1,), activation='relu'))
       model.add(Dense(64, activation='relu'))
       model.add(Dense(10, activation='softmax'))
       model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.learning_rate))
       return model

Now that we have a DQN model that predicts the next state/action of the model, and the Pytennis environment already sorted out the ball movement in a straight line, let’s go ahead and write a function that carries out an action by an agent, based on the DQN model prediction regarding it’s next state. 

The detailed code below illustrates how agent A makes a decision on where to direct the ball (on Agent B’s side and vice-versa). This code also evaluates agent B, if it was able to receive the ball.

   def randomVal(self, action):
       """
       cmax: 450
       cmin: 150
 
       c definately will be between 150 and 450.
       state0 - (150 - 179)
       state1 - (180 - 209)
       state2 - (210 - 239)
       state3 - (240 - 269)
       state4 - (270 - 299)
       state5 - (300 - 329)
       state6 - (330 - 359)
       state7 - (360 - 389)
       state8 - (390 - 419)
       state9 - (420 - 450)
       """
       if action == 0:
           val = np.random.choice([i for i in range(150, 180)])
       elif action == 1:
           val = np.random.choice([i for i in range(180, 210)])
       elif action == 2:
           val = np.random.choice([i for i in range(210, 240)])
       elif action == 3:
           val = np.random.choice([i for i in range(240, 270)])
       elif action == 4:
           val = np.random.choice([i for i in range(270, 300)])
       elif action == 5:
           val = np.random.choice([i for i in range(300, 330)])
       elif action == 6:
           val = np.random.choice([i for i in range(330, 360)])
       elif action == 7:
           val = np.random.choice([i for i in range(360, 390)])
       elif action == 8:
           val = np.random.choice([i for i in range(390, 420)])
       else:
           val = np.random.choice([i for i in range(420, 450)])
       return val
 
   def stepA(self, action, count=0):
       
       if count == 0:
           self.NetworkA = self.net.network(
               self.ballx, ysource=100, Ynew=600)  
           self.bally = self.NetworkA[1][count]
           self.ballx = self.NetworkA[0][count]
 
           if self.GeneralReward == True:
               self.playerax = self.randomVal(action)
           else:
               self.playerax = self.ballx
 
 




 
       else:
           self.ballx = self.NetworkA[0][count]
           self.bally = self.NetworkA[1][count]
 
       obsOne = self.evaluate_state_from_last_coordinate(
           int(self.ballx))  
       obsTwo = self.evaluate_state_from_last_coordinate(
           int(self.playerbx))  
       diff = np.abs(self.ballx - self.playerbx)
       obs = obsTwo
       reward = self.evaluate_action(diff)
       done = True
       info = str(diff)
 
       return obs, reward, done, info
 
 
   def evaluate_action(self, diff):
 
       if (int(diff) <= 30):
           return True
       else:
           return False

From the code above, function stepA gets executed when AgentA has to play. While playing, AgentA uses the next action predicted by DQN to estimate the target (x2 position, at Agent B, from the current position of the ball, x1, which is on it’s own side), by using the ball trajectory network developed by the Pytennis environment to make its own move. 

Agent A, for example, is able to get a precise point x2 on Agent’s B side by using the function randomVal, as shown above, to randomly select a coordinate x2 bounded by the action given by DQN. 

Finally, function stepA evaluates the response of AgentB to target point x2 by using the function evaluate_action. The function evaluate_action defines if AgentB should be penalized or rewarded. Just as this is described for AgentA to AgentB, it applies for AgentB to AgentA (same code by different variable names).

Now that we have the policy, reward, environment, states and actions correctly defined, we can go ahead and recursively make the two agents play the game with each other. 

The code below shows how turns are taken by each agent after 50 ball displays. Note that for each ball display, the DQN is making a decision on where to toss the ball for the next agent to play.

while iteration < iterations:
 
           self.display()
           self.randNumLabelA = self.myFontA.render(
               'A (Win): '+str(self.updateRewardA) + ', A(loss): '+str(self.lossA), 1, self.BLACK)
           self.randNumLabelB = self.myFontB.render(
               'B (Win): '+str(self.updateRewardB) + ', B(loss): ' + str(self.lossB), 1, self.BLACK)
           self.randNumLabelIter = self.myFontIter.render(
               'Iterations: '+str(self.updateIter), 1, self.BLACK)
 
           if nextplayer == 'A':
 
               if count == 0:
                   
                   q_valueA = self.AgentA.model.predict([stateA])
                   actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)
 
                   
                   obsA, rewardA, doneA, infoA = self.stepA(
                       action=actionA, count=count)
                   next_stateA = actionA
 
                   
                   self.AgentA.replay_memory.append(
                       (stateA, actionA, rewardA, next_stateA, 1.0 - doneA))
                   stateA = next_stateA
 
               elif count == 49:
 
                   
                   q_valueA = self.AgentA.model.predict([stateA])
                   actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)
                   obsA, rewardA, doneA, infoA = self.stepA(
                       action=actionA, count=count)
                   next_stateA = actionA
 
                   self.updateRewardA += rewardA
                   self.computeLossA(rewardA)
 
                   
                   self.AgentA.replay_memory.append(
                       (stateA, actionA, rewardA, next_stateA, 1.0 - doneA))
 
                   
                   if rewardA == 0:
                       self.restart = True
                       time.sleep(0.5)
                       nextplayer = 'B'
                       self.GeneralReward = False
                   else:
                       self.restart = False
                       self.GeneralReward = True
 
                   
                   X_state_val, X_action_val, rewards, X_next_state_val, continues = (
                       self.AgentA.sample_memories(self.AgentA.batch_size))
                   next_q_values = self.AgentA.model.predict(
                       [X_next_state_val])
                   max_next_q_values = np.max(
                       next_q_values, axis=1, keepdims=True)
                   y_val = rewards + continues * self.AgentA.discount_rate * max_next_q_values
 
                   
                   self.AgentA.model.fit(X_state_val, tf.keras.utils.to_categorical(
                       X_next_state_val, num_classes=10), verbose=0)
 
                   nextplayer = 'B'
                   self.updateIter += 1
 
                   count = 0
                   
 
               else:
                   
                   q_valueA = self.AgentA.model.predict([stateA])
                   actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)
 
                   
                   obsA, rewardA, doneA, infoA = self.stepA(
                       action=actionA, count=count)
                   next_stateA = actionA
 
                   
                   self.AgentA.replay_memory.append(
                       (stateA, actionA, rewardA, next_stateA, 1.0 - doneA))
                   stateA = next_stateA
 
               if nextplayer == 'A':
                   count += 1
               else:
                   count = 0
 
           else:
               if count == 0:
                   
                   q_valueB = self.AgentB.model.predict([stateB])
                   actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)
 
                   
                   obsB, rewardB, doneB, infoB = self.stepB(
                       action=actionB, count=count)
                   next_stateB = actionB
 
                   
                   self.AgentB.replay_memory.append(
                       (stateB, actionB, rewardB, next_stateB, 1.0 - doneB))
                   stateB = next_stateB
 
               elif count == 49:
 
                   
                   q_valueB = self.AgentB.model.predict([stateB])
                   actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)
 
                   
                   obs, reward, done, info = self.stepB(
                       action=actionB, count=count)
                   next_stateB = actionB
 
                   
                   self.AgentB.replay_memory.append(
                       (stateB, actionB, rewardB, next_stateB, 1.0 - doneB))
 
                   stateB = next_stateB
                   self.updateRewardB += rewardB
                   self.computeLossB(rewardB)
 
                   
                   if rewardB == 0:
                       self.restart = True
                       time.sleep(0.5)
                       self.GeneralReward = False
                       nextplayer = 'A'
                   else:
                       self.restart = False
                       self.GeneralReward = True
 
                   
                   X_state_val, X_action_val, rewards, X_next_state_val, continues = (
                       self.AgentB.sample_memories(self.AgentB.batch_size))
                   next_q_values = self.AgentB.model.predict(
                       [X_next_state_val])
                   max_next_q_values = np.max(
                       next_q_values, axis=1, keepdims=True)
                   y_val = rewards + continues * self.AgentB.discount_rate * max_next_q_values
 
                   
                   self.AgentB.model.fit(X_state_val, tf.keras.utils.to_categorical(
                       X_next_state_val, num_classes=10), verbose=0)
 
                   nextplayer = 'A'
                   self.updateIter += 1
                   
 
               else:
                   
                   q_valueB = self.AgentB.model.predict([stateB])
                   actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)
 
                   
                   obsB, rewardB, doneB, infoB = self.stepB(
                       action=actionB, count=count)
                   next_stateB = actionB
 
                   
                   self.AgentB.replay_memory.append(
                       (stateB, actionB, rewardB, next_stateB, 1.0 - doneB))
                   tateB = next_stateB
 
               if nextplayer == 'B':
                   count += 1
               else:
                   count = 0
 
           iteration += 1

Comparison/Evaluation

Having played this game model-free and model-based, here are some differences that we need to be aware of:

s/n Model-free Model-based
1 Rewards are not accounted for (since
this is automated, reward = 1).
Rewards are accounted for.
2 No modelling (no decision policy
is required).
Modelling is required (policy network).
3 This doesn’t require the use of initial
states to predict the next state.
This requires the use of initial states
to predict the next state using the
policy network.
4 The rate of missing the ball with
respect to time is zero.
The rate of missing the ball with
respect to time approaches zero.

If you’re interested, the videos below show these two techniques in action playing tennis games:

1. Model-free

2. Model-based

Conclusion

Tennis might be simple compared to self-driving cars, but hopefully this example showed you a few things about RL that you didn’t know. 

The main difference between model-free and model-based RL is the policy network, which is required for model-based RL and unnecessary in model-free. 

It’s worth noting that oftentimes, model-based RL takes a massive amount of time for the DNN to learn the states perfectly without getting it wrong.

But every technique has its drawbacks and advantages, choosing the right one depends on what exactly you need your program to do. 

Thanks for reading, I left a few additional references for you to follow if you want to explore this topic more.

References

  1. AlphaGo documentary: https://www.youtube.com/watch?v=WXuK6gekU1Y
  2. List of reinforcement learning environments: https://medium.com/@mauriciofadelargerich/reinforcement-learning-environments-cff767bc241f
  3. Create your own reinforcement learning environment: https://towardsdatascience.com/create-your-own-reinforcement-learning-environment-beb12f4151ef
  4. Types of RL Environments: https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781838649777/1/ch01lvl1sec14/types-of-rl-environment
  5. Model-based Deep Q Network: https://github.com/elishatofunmi/pytennis-Deep-Q-Network-DQN
  6. Discrete mathematics approach youtube video: https://youtu.be/iUYxZ2tYKHw
  7. Deep Q Network approach YouTube video: https://youtu.be/FCwGNRiq9SY
  8. Model-free discrete mathematics implementation: https://github.com/elishatofunmi/pytennis-Discrete-Mathematics-Approach-
  9. Hands-on Machine Learning with scikit-learn and TensorFlow: https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291

Data Scientist | AI Engineer | Machine Learning Researcher


READ NEXT

ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

Jakub Czakon | Posted November 26, 2020

Let me share a story that I’ve heard too many times.

”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…

…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…

…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”

– unfortunate ML researcher.

And the truth is, when you develop ML models you will run a lot of experiments.

Those experiments may:

  • use different models and model hyperparameters
  • use different training or evaluation data, 
  • run different code (including this small change that you wanted to test quickly)
  • run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)

And as a result, they can produce completely different evaluation metrics. 

Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.  

This is where ML experiment tracking comes in. 

Continue reading ->


hkshco.com

Leave a Comment