Cristian Bodnar, Adrian Li, Karol Hausman, Peter Pastor, Mrinal Kalakrishnan
The distributional perspective on reinforcement learning (RL) has given rise to a series of successful Q-learning algorithms, resulting in state-of-the-art performance in arcade game environments. However, it has not yet been analyzed how these findings from a discrete setting translate to complex practical applications characterized by noisy, high dimensional and continuous state-action spaces. In this work, we propose Quantile QT-Opt (Q2-Opt), a distributional variant of the recently introduced distributed Q-learning algorithm for continuous domains, and examine its behaviour in a series of simulated and real vision-based robotic grasping tasks. The absence of an actor in Q2-Opt allows us to directly draw a parallel to the previous discrete experiments in the literature without the additional complexities induced by an actor-critic architecture. We demonstrate that Q2-Opt achieves a superior vision-based object grasping success rate, while also being more sample efficient. The distributional formulation also allows us to experiment with various risk distortion metrics that give us an indication of how robots can concretely manage risk in practice using a Deep RL control policy. As an additional contribution, we perform batch RL experiments in our virtual environment and compare them with the latest findings from discrete settings. Surprisingly, we find that the previous batch RL findings from the literature obtained on arcade game environments do not generalise to our setup.
Start Time | End Time | |
---|---|---|
07/16 15:00 UTC | 07/16 17:00 UTC |
Originality The authors build upon QT-Opt, which performs Q-learning in continuous action spaces by using the Cross Entropy Method (CEM) for selecting maximum value actions, and recent advances in distributional reinforcement learning by modeling the distribution of Q-values with quantiles. The originality of the algorithm itself is minimal - it is basically a previous method, QT-Opt, combined with prior distributional RL methods (such as Implicit Quantile Networks). Many parts of the method have already been used in prior work - for example, risk distortion metrics have already been used in the Implicit Quantile Networks paper. However, the study conducted by the authors on the efficacy of using distributional RL in a robotic grasping setting is novel and useful. The study itself is also quite thorough - several risk metrics are compared in both simulation and the real world. Quality As mentioned before, while there is little to no novelty in the method, there is merit in the evaluation of distributional RL and risk metrics on simulated and real robotic grasping. The experiments in the paper are well-motivated and the results are interesting and useful. Clarity The paper is clear and well-written. The authors cover the relevant background work and explicitly state the modifications they make to form their algorithm. Significance The results that the authors present are interesting. In simulation, the authors demonstrate that while their method does not lead to significant asymptotic improvement (around 2%), their method is more sample efficient (Figures 3 and 4). Table 3 is also a useful comparison of the effect of different risk metrics and how it impacts final performance. The authors also evaluate their algorithm in a real world grasping setup. Table 4 demonstrates significant improvement over the QT-Opt baseline. Figure 6 is greatly appreciated - showing how the number of broken gripper fingers roughly corresponds to risk-averse, risk-neutral, and risk-seeking policies is interesting. The qualitative behaviors from risk-averse policies in the supplementary video is also useful to visualize, as are the live plots of the q-value distributions. Finally, the results in the batch reinforcement learning setting are interesting in light of recent work in this setting. They suggest that continuous control domains and Atari are not equivalent in terms of learning from batch datasets and that diversity in batch dataset is critical to achieve good performance.
Originality: The paper presents an original algorithm that extends [12] to distributional value estimation. This is a considerable step forward in knowledge and understanding. Quality: The paper is very well written, with few exceptions. Proper experiments and comparisons were performed, and proper analysis is provided. However, no ablation study was performed, and this is particularly missing with respect to the quantile embedding. Is it truly useful to pass the entire quantile vector tau jointly through the network, as opposed to each quantile tau_i separately? There's also confusion in the equation in IV.D due to i being used to index both quantiles and basis functions. Clarity: The paper is very clear. Two issues: It'd be valuable to have some details on the "multitude of control policies" used to generate the real-world grasping dataset. What were they? Sorting Tables III and IV would help compare them.
This paper presents a study on distributional RL with application to grasping tasks The system is built on QT-Opt with some incremental improvements: 1) replacing Q-learning with distributional Q-learning 2) maximizing the risk-sensitive score function instead of the reward function The main contribution of this paper is empirical study of distributional Q-learning in the context of the grasping task. This study empirically investigates the risk-sensitive score functions, which were previously studied by Dabney et al. [2018]. The results show that the proposed methods with the distributional Q-learning outperform QT-Opt. The experimental results also show that the use of the risk-sensitive cost function can improve the safety during the training phase. In addition, the empirical study presented in Section V.E. presents interesting insights on batch RL with offline training. I summarize the strong and weak points of the paper: Strong points: - The experiments contain interesting insights on distributional Q-learning and batch RL - Performing an empirical study with this scale is challenging and the it worth sharing the results with the research community. Weak points: - Some details of the experiments seem missing. Please refer to the following comments. Suggestion for improvement: - Although this study is focused on variants of Q-learning, the motivation of this design choice is not clear to me. What would be the difficulty when we apply actor-critic methods to grasping tasks? Why should we use the variant of Q-learning, which requires running CEM for selecting actions? It would be benefitial for readers if it is discussed in the related work section. - For reproducibility, it would be better to provide some more information of the implementation and experiments. -- In Q2R-Opt, N quantile midpoints of the value distribution. What is the value of N in the experiment? How many midpoints are learned? -- In CEM, how many iterations of sampling were performed and how many samples were generated in each iteration? How much time is required to select the action with CEM? -- When training a neural network, is any pre-training used? -- Some more information of traning the neural network: batch size, learning rate. -- it is reported that 500,000 episodes are used. How many time steps does each episode contain? How many is the total time steps? -- In page 5, "500,000 episodes, collected over many months using a multitude of control policies." This is not academic. How many months did it take to collect the data?