*Kyungjae Lee (Seoul National University); Sungyub Kim (KAIST); Sungbin Lim (UNIST); Sungjoon Choi (Disney Research); Mineui Hong (Seoul National University); Jaein Kim (Seoul National University); Yong-Lae Park (Seoul National University); Songhwai Oh (Seoul National University)*

In this paper, we present a new class of Markov decision processes (MDPs), called Tsallis MDPs, with Tsallis entropy maximization, which generalizes existing maximum entropy reinforcement learning (RL). A Tsallis MDP provides a unified framework for the original RL problem and RL with various types of entropy, including the well-known standard Shannon-Gibbs (SG) entropy, using an additional real-valued parameter, called an entropic index. By controlling the entropic index, we can generate various types of entropy, including the SG entropy, and a different entropy results in a different class of the optimal policy in Tsallis MDPs. We also provide a full mathematical analysis of Tsallis MDPs.Our theoretical result enables us to use any positive entropic index in RL. To handle complex and large-scale problems such as learning a controller for soft mobile robot, we also propose a Tsallis actor-critic (TAC). For a different type of RL problems, we find that a different value of the entropic index is desirable and empirically show that TAC with a proper entropic index outperforms the state-of-the-art actor-critic methods. Furthermore, to alleviate the effort for finding the proper entropic index, we propose a linear scheduling method where an entropic index linearly increases as the number of interactions increases. In simulations, the linear scheduling shows the fast convergence speed and a similar performance to TAC with the optimal entropic index, which is a useful property for real robot applications. We also apply TAC with the linear scheduling to learn a feedback controller of a soft mobile robot and shows the best performance compared to other existing actor critic methods in terms of convergence speed and the sum of rewards. Consequently, we empirically show that the proposed method efficiently learns a controller of soft mobile robots.

Start Time | End Time | |
---|---|---|

07/15 15:00 UTC | 07/15 17:00 UTC |

Entropy-based methods are very popular in RL due to improved exploration, stability and performance. As a result, improvements to them would have great impact. Changing the form of the entropy term seems like one promising way of improvement. The paper is predominately a theory paper. The main contribution (from my perspective) is formalizing the Tsallis MDP and proving convergence of the methods. This is a solid contribution. The theory would not be particularly helpful if it couldn't be used in the algorithms and if those algorithms didn't improve performance. Thankfully, Tsallis entropy can be naturally incorporated into methods such as SAC. I see this simplicity of extension as a benefit as it could widen usage. Similarly, the experiments show improved performance of SAC and other actor-critic methods in almost all domains. They also show robustness with different alpha values and their scheduled entropy method also performs well. The real robot experiments (and part of the story of the paper) is focused on soft robotics. This is a fine domain, but is motivated by the need to have better exploration due to the properties of the soft robots. Sure, but other exploration methods could be used here instead (e.g., Bayesian, curiosity). Therefore, it isn't really clear what the robot experiments add over the simulation results. The paper is generally well written, but there are a number of typos that should be corrected.

1. I find the paper well-written and well-developed. I am excited about the annealing of the Tsallis entropy parameter during training to reduce the entropic regularization in a controller manner. In this context, I also find the performance bound in Theorem 7 useful. There are potential connections of this idea with proximal algorithms https://link.springer.com/article/10.1007/s40687-018-0148-y. 2. This paper seems like a direct application of the Tsallis entropy to the existing theory of regularized MDPs (reference 21 in the paper) and RL algorithms. The novelty is therefore marginal. 3. The experiments in Fig. 4 have a very large variance in some cases, how is one to understand their importance? It is also surprising that the TD3 algorithm gets zero returns for Humanoid-v2. It has been recently recognized that entropic regularization may not effective in these benchmarks, e.g., https://spinningup.openai.com/en/latest/spinningup/bench.html. Can you discuss how the Tsallis entropy-based regularization may be better in practice?

Summary Many reinforcement learning method use some kind of entropy regularisation. This usually employs a Shannon-Gibbs entropy term, although the sparse Tsallis entropy has also been used. This work generalises both and employs Tsallis entropy, which is a family of functionals parametrised by q for which Shanon Gibbs (q=1) and sparse Tsallis (q=2) are special cases. The paper finds that by properly tuning the additional parameter, or by defining a curriculum over it that slowly goes from q=1 to q=2, they can often outperform various variants that employ Shanon-Gibbs entropy. This is demonstrated both on simulated MuJoCo environments as well as a hard to control real-robot system. Technical Quality The paper proposes a technically solid algorithm, and show how its qualities both in theoretical proofs as well as in empirical demonstrations. These seem well executed. The proofs, however, are 11+ pages of dense content separate from the main material of the paper, and as such I cannot review them in detail. Perhaps this indicate that the paper would be more suitable for another venue where the proofs could take the spotlight rather than being relegated to supplementary material. Some relatively minor remarks on technical quality: -> It doesn’t become very clear why Tsallis entropy works better then SG Entropy. The paper discusses stronger / less strong regularization (more or less stochasticity), but if this was the whole story one would image that having a curriculum for the “alpha” coefficient or for the minimum entropy (in something like SAC-AEA) should get similar results. It remains an open question what ‘above’ just making the regularisation less strong causes the difference. -> Optimal solution for entropy-regularised learning attributed to a bunch of papers from the last couple of years, but it is skipping the older work from Peters et al. which focuses on the relative entropy (e.g. Peters et al., Relative Entropy Policy Search, AAAI 2010). Novelty, Significance, Relevance My main concern about the paper is whether this is really a robotics paper. The topic of efficient reinforcement learning is relevant to the robotics community, and the method is tested on a real robot (which is actually an interesting system, see below). However, the main contribution of the paper seem the theoretical proofs on machine learning and the simulation studies. Whenever we add a hyperparameter, we expect that it can be tuned such that it improve performance. What makes this paper stronger is that the curriculum seems to be valid across multiple environment, potentially avoiding an extra tuning step. I would thus consider the results to be somewhat significant. I haven’t seen Tsallis entropy used in reinforcement learning before, so I would consider the method quite novel. The robot task used is also quite interesting and as the task would be challenging for traditional control it is a good motivation to use a learning approach. Clarity The structure of the manuscript is mostly good. There are a couple of minor grammar errors in the manuscript (see below for some examples) and there are a couple of sentences which seem quite cryptic (again, some examples provided below). The list below is not meant to be exhaustive, and a good proofreading pass should be performed. The description of the robot platform in VI.B is brief to the point of being hard to understand. Wouldn’t it be better to refer to a more complete description elsewhere (appendix or another paper)? Minor issues: -“the trial and error” -> trial and error -“of policy” -> of the policy -“whose element is a probability” -> there seems to be a noun missing? -“an MDP with the maximum Tsallis entropy” -> I’m guessing the performance of a *policy* that maximises (3) is meant? That is, not just Tsallis entropy but the sum of this entropy with the reward? -I wasn’t sure what is meant by “Since updating J_phi requires to compute a stochastic gradient, we use a reparametrization trick […] instead of a score function estimation”. Using a score function estimator also results in a stochastic gradient. There are many reasons why you might prefer a reparametrization gradient, but needing a stochastic gradient doesn’t seem to be one of them. -In Section VI.C. theta_t isn’t defined where it is first used -The vertical axis of the plots are slightly different, making it a bit harder than necessary to compare lines between e.g. 3a, 3b, 3c, 3d. -Some of the equations where the re-parametrization trick is used do not seem quite right. For example, in VII.A, consider the equation directly following “the gradient fo the Tsallis entropy becomes”. This would be completely correct if the expectation is taken with respect to \epsilon and a is replaced everywhere with f(a; \epsilon), with f indicating the reparametrization. The current notation hides the dependence of a on \epsilon in the gradient term on the right. -The words “proportional” and “inverse proportional” seem to be used a bit loosely in VII.A. (basically stating that \pi^2 is proportional to \pi?)