*Junhong Xu, Kai Yin, Lantao Liu*

We propose a principled kernel-based policy iteration algorithm to solve the continuous-state Markov Decision Processes (MDPs). In contrast to most decision-theoretic planning frameworks, which assume fully known state transition models, we design a method that eliminates such a strong assumption which is oftentimes extremely difficult to engineer in reality. To achieve this, we first apply the second-order Taylor expansion of the kernelized value function. The Bellman equation is then approximated by a partial differential equation, which only relies on the first and second moments of the transition model. By combining the kernel representation of value function, we then design an efficient policy iteration algorithm whose policy evaluation step can be represented as a linear system of equations evaluated at a finite set of supporting states. We have validated the proposed method through extensive simulations on both simplified and realistic planning scenarios, and the experiments show that our proposed approach leads to a much superior performance over several baseline methods.

Start Time | End Time | |
---|---|---|

07/15 15:00 UTC | 07/15 17:00 UTC |

Overall, the paper is very well written. The rationale for the proposed approach is well supported and placed within the context of existing literature. The results and accompanying figures do a good job of illustrating the performance of the proposed method. The paper is original and the quality of the presentation is very high with a clear description of the proposed method. The results are compelling and the contrast against other methods is well supported.

The purpose of the paper itself is not clear. One could assume that the authors have a general solution for continuous MDPs, but they define their function approximations without mentioning error bounds or convergence properties. In the second half of the paper it is clear that their intended application is path/navigational planning, so their framework and approximation criteria makes more sense from this applied perspective. An interesting contribution is that their function approximation is based on only the mean and variance of the probability distribution of actions so, as the authors suggest, they reduce the dependence on a complete transition model as is typical in planning problems. However the limitations of both MDP planning and RL seem a bit exaggerated: *cumbersome* training trials, *unrealistic* transition models. In contrast, this model is heavily reliant on a prebuilt set of supporting states, and sensitive to at least two hyperparameters, an issue that is only casually addressed as part of the experiments and not as part of the principles. I think the tradeoff is justified for navigational planning tasks, but the paper starts from a different premise. Consider on the other hand a paper like: Bonet, Blai. "An e-optimal grid-based algorithm for partially observable Markov decision processes." Proc. of the 19th Int. Conf. on Machine Learning (ICML-02). 2002. which is intended for POMDPs but addresses the general case of how a grid-based approximation affects value functions in large state spaces. In general the writing quality could be improved, it is somewhat repetitive (eg. first paragraph of section IV) and overexplained (eg. item 2 in section V-A is unnecessary, average returns are standard practice). There are also several cases of incorrect use of articles (eg. grids in vicinity, characteristics of state space, the value iteration, etc.) that interrupt the reader. Some of the mathematical notation is also not properly introduced (m and N in eq. 4, x and y in IV-A, etc.). In eqs. 6a, 6b, 13 and 14, what is the meaning of (s' - s)? It looks like a difference between states. I suppose they mean (v(s') - v(s)) or the equivalent function. Also are both 6a and 6b defined as a sum over a *discrete* set of states? The results show two sets of simulated navigation problems, one with obstacle avoidance and another with punishing Martian terrain. Not sure why they need to say that the first problem is goal oriented; all planning is goal oriented (even if the goal is solving an optimization problem). In the first problem they test four techniques on a varying number of supporting (grid) states, in which their Taylor-based approach performs similarly to the Direct Kernel-Based approach, and both of these obtain higher average returns than a neural network approach and some unspecified grid-based PI. The proposed Taylor method and the direct Kernel method also have very similar runtimes, although the Taylor method requires only the mean and the variance. There are some qualitative evaluations like a "reasonable approximation" and an "aggressive policy" that seem largely subjective. An important thing to address here is whether all four methods used the same supporting states. If so, this might constitute a bias that could have affected the real performance of methods that perform their own state abstractions, such as the NN method (maybe) or a more sophisticated grid-based PI. Not enough information is provided. In the Mars navigation problem they test three state sampling methods and show that importance sampling (based on the terrain slope) produces better results than even/uniform sampling, but this should be expected. Notice that this kind of sampling assumes the existence of a full (and accurate) map in advance, which (could be argued) decreases the benefit of not requiring a full transition model. The main conclusion made by the authors themselves is that the performance of their algorithm is mostly determined by the number and distribution/selection of supporting states, but I think this is a general conclusion applicable to all grid-based approximation methods. The proposed method seems appealing for motion and navigational planning tasks, and appears to perform at least as well as the direct kernel-based method. The issue of whether this Taylor-based method requires less prior knowledge (or assumptions) than competitive approaches is probably domain dependent. It requires only the mean and variance of action outcomes but assumes there is sufficient correct information in advance to perform a meaningful selection of supporting states, which turns out to be the most significant factor. I think the paper is somewhat disorganized in both ideas and structure, and the experimental results aren't very convincing. It does have an interesting contribution but the paper quality needs to be substantially improved.