This blog post presents some results presented by myself and my Ph. D. student Javier Garcia-Barcos. To help facilitate discussion of these results, which were originally presented at IJCAI 2019, we start this post with some historical perspective on the topic of robotic planning.
Robot planning and reinforcement learning
Hans Moravec is one of the founding fathers of artificial intelligence and robotics. His research was rooted on the idea that motion and perception are inherent capabilities required to achieve true intelligence. A passive learning system can never be considered a true artificial intelligence.
He also coined Moravec’s paradox, which says that: “it is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility”. Although the paradox was defined in the 80s, it is still quite relevant these days. We have seen it many times. There are machines that are able to reach superhuman performance at Chess, or recently, at the game of Go, but any toddler can beat a robot to just lay the board and manipulate the pieces. In fact, neuroscience has confirmed that sensorimotor tasks require more computational power from our brain, than most high level reasoning, but thanks to evolution, we are unaware of the former.
Moravec also developed one of the first planning algorithms that allowed a robot to move autonomously across a room, avoiding obstacles using only onboard sensors and computers. In the 21st century, sensors are much more precise and computing power is several orders of magnitude higher than the computers of Moravec. Furthermore, there has been tremendous research in that area that has culminated in planetary rovers and self-driving cars. Yet, the problem of motion planning for robots is far from being solved.
Classic planning, as the one that Moravec and other robotics researchers did in the 80s and 90s, relies on perfect knowledge of the environment and accurate sensing and localization of the robot. This might be true in certain situations, like automated factories and warehouses, but it is not true for self-driving cars or even service robots like those small autonomous vacuum cleaners that many have wandering in our living room.
Figure 1: A robot is navigating through a room while building a map of the environment. Depending on the trajectory, the sensors perceive the room differently and the quality of the map can change considerably. (© R:SS conference. University of Zaragoza)
A more recent approach considers the planning problem within a Markov decision process (MDP), named after a brilliant Russian mathematician, which allows to consider uncertainty in the motion, perception and environment. This has also become the origin of many modern machine learning successes. Once we are in Markov’s territory, we can take two paths.
First, if we know “the rules of the game”, we can predict the probability of certain outcomes without even having “to play the game”. This has been literally applied to games, and it is the technique used in AlphaGo. But it has also been used in many other applications with tremendous economical impact, such as logistics, fraud detection, chatbots, etc. A second path for MDPs is the popular reinforcement learning, where the robot “learns the rules of the game” by playing it. This has even more potential and we are just seeing the tip of the iceberg, although playing the game without knowing the rules might be risky.
Recently, there has been increased popularity for a middle ground. We use reinforcement learning, but instead of taking actual risks, we use a simulator. But wait! If you can build a simulator, you know the rules of the game. Thus, you should be using the first approach, right? In this case, most engineers and programmers will tell you that it might be easier to define and code a full simulator, than to define some specific behavior or set of rules in a complex environment. As Leslie Kaelbling pointed out in a recent talk, reinforcement learning has become a modern form of compiler, where the engineers define the problem in a language that they understand, like a simulator, and the algorithm translates those specifications to the desired behavior for the robot or agent.
This is the approach that we decided to apply. Below, we detail our strategy for conducting reinforcement learning through policy search, where the desired behavior (policy) is optimized to solve the task. We use our favorite optimization algorithm for the job; however, we also included several tricks.
Active policy search
This is Bayesian optimization meets reinforcement learning in its core. One of the most popular approaches to RL is the set of algorithms following the policy search strategy. In policy search, the desired policy or behavior is found by iteratively trying and optimizing the current policy. For example, in Figure 1, we can look for the optimal parameters of the waypoints that define the robot path.
Most of the time, the policy is optimized using stochastic gradient descent, but it can be highly inefficient. As Gandalf travels through the mines of Moria in The Lord of the Rings he claims to have been in the labyrinthine caves before, but they have a hard time finding an exit because Gandalf claims: “I have no memory of this place”. Similarly, when gradient descent explores the same area that has previously been visited, there is nothing it can do, because it has no memory. Instead, the probabilistic approach of Bayesian optimization provides a map of previously visited places and improves the exploration.
Figure 2: A plane hovering a terrain and mapping certain landmarks. One advantage of our policy search strategy is that it can be easily exchanged and applied in multiple systems. (© University of Zaragoza)
As we have seen, the probabilistic model in Bayesian optimization is great to improve efficiency and reduce the total cost, even by several orders of magnitude (thousands can easily become dozens). However, the probabilistic model is also a double-edged sword. For example, we can draw a map in a piece of paper as roads, but the mines of Moria also have stairs up and down, with multiple levels. If we think we are at the wrong level, the results might be even worse than not having a map. This is especially important in RL applications, where the reward functions might not be as smooth or even continuous as the functions that we are used to work with, in machine learning.
The Spartan kernel is a self-adaptive kernel that is able to provide enough accuracy where it is needed, by combining a global and local approach. The name comes from the famous battle of Thermopylae, where the Spartan army was forced to retreat by the much larger Persian army, which was about to surround them. Instead of retreating all at one, they followed a similar strategy of using a few soldiers in a narrow pass (local), while the rest of the army was retreating along the mountains (global) to avoid being completely surrounded and decimated.
Stochastic acquisition function
One of the key elements in reinforcement learning is the exploration-exploitation trade-off. How much time do we spend finding new strategies vs how much time do we spend refining and fine-tuning the existing behavior. Without exploration, we might get stuck in a poor set of solutions. Without fine-tuning, the behavior might be too coarse. Exploration is also fundamental to remove any initial bias that we might have, especially in terms of modeling.
In all Bayesian optimization algorithms, the next evaluation is computed by maximizing some acquisition function. This greedy strategy corresponds to a pure exploitation setup. This might work beautifully if the probabilistic model is suitable for our problem. But, if there is some initial bias, the results might be really bad. Instead, we use a Boltzmann policy (also called softmax) which gives a probability to any point based on the value of the acquisition function and then, proceeds to select the next point at random. Thus, points with high value will be selected more frequently, but all the points can be selected eventually.
Figure 3: Example of a probabilistic model (blue) on top of the target function (red). The greedy policy selects the maximum of the acquisition function (red dot), but completely ignores the rest of the regions which are almost as valuable. With our stochastic policy, we could sample from all the nodes. (© IJCAI. University of Zaragoza)
Sometimes, the robot is not able to precisely follow the desired behavior. See the simulated example in the figure below. In this case, the desired path might depend on the precision of the robot. If the terrain is bumpy and the robot has trouble following a straight line, you might want to find a path that leaves all the obstacles and troubles at a safe distance. Instead, if the robot is able to precisely follow the ideal path, then we can take a riskier and more efficient route. By using the unscented transformation (a numerical trick inherited from control system theory) we are able to compute the path risk on the fly and adapt accordingly.
This is really important as conditions might change dynamically. For example, a self-driving car on a highway on a sunny day might make different decisions than the same car in a mountain road under heavy rain.
In conclusion, Bayesian optimization is much more than efficient optimization. Under the Bayesian umbrella we can deploy a vast number of tricks, provide all kinds of prior information and take advantage of the information encoded in the models in multiple ways.
Figure 4: A planetary rover traversing a bumpy road. In orange there are patches of tricky gravel paths and slippery terrain that the rover takes a lot of time and power to cross. Meanwhile the red boxes represent rocks blocking the way. Hitting a rock should be avoided. There is no absolute better path. If the robot has good navigation control and the systems are in good shape, the right path might be better because it is shorter and without orange terrain. On the other hand, if the rover is not able to precisely follow the path, because the sensors are covered in dust or the wheels have worn off, the left path is safer as it keeps rocks away. (Rover photo: NASA, public domain)
All of these strategies were combined to produce the results discussed in our IJCAI 2019 article. We have continued to pursue this exciting topic (see our NeurIPS 2019 workshop paper for some additional results) and hope to identify further improvements in future work. For previous work, the following articles can provide some additional perspective on the topic: Active Policy Learning for Robot Planning and Exploration under Uncertainty, Unscented Bayesian Optimization for Safe Robot Grasping, Funneled Bayesian Optimization for Design, Tuning and Control of Autonomous Systems.