Integrating policy transfer, policy reuse and experience replay in speeding up reinforcement learning of the obstacle avoidance task
Obstacle avoidance is one of the key functionalities necessary for the proper functioning of an autonomous mobile robot. Smart & Kaelbling (2002) showed that the time required to learn this task using reinforcement learning can be unfeasibly long even for simplified versions of the task. Knowledge transfer and experience replay are two techniques that have been suggested to speed up reinforcement learning. However, their application to the obstacle avoidance task is very limited. For instance Lin(1991) applied experience replay to an obstacle avoidance task in which the robot environment was bounded by a wall and therefore, the robot control agent needed to learn how to prevent the robot from colliding with the wall. Smart & Kaelbling (2002) applied a form of knowledge transfer known as teaching to the obstacle avoidance task, but there was only one obstacle in the environment. Policy reuse and policy transfer are two knowledge transfer techniques that have been shown to improve learning performance when applied to the Keepaway sub problem of robotic soccer (Taylor & Stone, 2009; Fernández et al., 2010). These techniques can be used when there is the possibility of learning a simpler version of the intended task, then reusing the knowledge acquired in the simpler version of the task to bootstrap learning in the intended task. The simpler version of the task is called the source task while the intended task is called the target task. The obstacle avoidance task can be structured in this way. One way to achieve this is by creating an obstacle avoidance task containing fewer actions than the intended target task. In this study, we investigated how policy transfer, policy reuse and experience replay can be combined to speedup learning in an obstacle avoidance task. We investigated the performance when the techniques were used in isolation and when they were used together. The experiments testing the performance of the techniques were setup in a robotics simulation environment. We used two sets of source and target task pairs. The first set comprised of two actions in the source task and three actions in the target task, while the second set comprised of three actions in the source task and six actions in the target task. The performance was measured by the average number of times the robot was able to reach the goal position under the guidance of the learned policy. The performance was calculated both at the initial level and at the asymptotic level. In this study, our findings are that in the first pair of tasks, policy transfer outperformed policy reuse in as far as speeding up learning in the target task was concerned by about 30% at the initial level and about 10% at the asymptotic level. The combination of policy transfer with policy reuse was not found to lead to any significant improvement in the first pair of tasks, but it was found to lead to significant improvement in the second pair of tasks. The improved performance could be attributed to the fact that the six actions task is more difficult, hence it benefits more from bootstrapping. The combined techniques outperform policy reuse alone by 10% points in the six actions task based on the number of episodes that end at the goal region. The combination also overcomes the problem of declining performance observed in policy transfer from the three actions task to the six actions task. It was also found that while experience replay led to significant improvement in learning the source task, it did not offer any improvement in learning when it was combined with knowledge transfer in the target task. In fact, it was seen to lead to degraded performance in most cases. For instance when experience replay was combined with policy transfer in the six actions task, the initial performance was 89% while the asymptotic performance was 59%. When policy reuse was combined with experience replay, the initial performance was 68% while the final performance was 59%. There was however some initial improvement when policy reuse was combined with experience replay, but this improvement was not sustained for long. The main contribution of this work was a reinforcement learning framework that combines experience replay, policy reuse and policy transfer in learning the obstacle voidance task. We have also shown when each of these techniques can be most useful when applied to improve learning performance in reinforcement learning. These results will promote greater adoption and acceptance of the reinforcement learning techniques in the process of developing autonomous mobile robots.