*Vincent Pacelli (Princeton University); Anirudha Majumdar (Princeton)*

This paper presents a reinforcement learning approach to synthesizing task-driven control policies for robotic systems equipped with rich sensory modalities (e.g., vision or depth). Standard reinforcement learning algorithms typically produce policies that tightly couple control actions to the entirety of the system's state and rich sensor observations. As a consequence, the resulting policies can often be sensitive to changes in task-irrelevant portions of the state or observations (e.g., changing background colors). In contrast, the approach we present here learns to create a task-driven representation that is used to compute control actions. Formally, this is achieved by deriving a policy gradient-style algorithm that creates an information bottleneck between the states and the task-driven representation; this constrains actions to only depend on task-relevant information. We demonstrate our approach in a thorough set of simulation results on multiple examples including a grasping task that utilizes depth images and a ball-catching task that utilizes RGB images. Comparisons with a standard policy gradient approach demonstrate that the task-driven policies produced by our algorithm are often significantly more robust to sensor noise and task-irrelevant changes in the environment.

Start Time | End Time | |
---|---|---|

07/16 15:00 UTC | 07/16 17:00 UTC |

Originality: The problem formulation is not novel [A]. The motivation that such methods are robust to changes in the environment has also been studied [B]. The method itself uses [4], but the policy gradient formulation is original, including differentiating through the MINE and stabilizing it with EMA. Quality: The paper is quite well written. Issues: - Despite the recent trend to call every trade off with information rate "information bottleneck", the latter refers to a specific trade off between two information quantities [44]. Eq. (3) uses instead the much earlier concept of rate–distortion [C], and particularly sequential rate–distortion [D], although the approximation that x_t and y_t are independent of phi loses the sequential nature. - The equation for pi (Section II) is confusing, because the LHS gives the impression that the policy has no memory. It is also inaccurate, because the RHS omits the dependence between \tilde{x}_{t-1} and y_t. - It is unclear what is gained by Theorem II.1. Is the paper claiming that the RHS of (6) is a good proxy for its LHS? But the LHS is not our objective, because of the very restrictive (5) (which is made increasingly restrictive by minimizing I[x_t, \tide{x}_t] ). - In what sense is (3) a "first-order approximation" of (6)? They coincide when the beta of (3) is 1 and that of (6) tends to 0, but can otherwise be very different. - How much is performance improved by having time-variant theta, phi, and psi? - Prior work solves the problem optimally in the linear–Gaussian case [E]. Since the domain in example IV.A is linear, the paper should compare the proposed method with the optimal solution. - In all experiments, instead of fixing beta to an arbitrary value (which one?), it would be useful to show a curve of the value as a function of beta. This will also reveal different phases of qualitatively different control behaviors. - The reported standard deviation is presumably over the variability of the domain. No error bounds on the mean estimation are given, making it hard to evaluate statistical significance. - Presumably the method encourages completely ignoring features that are completely task-irrelevant. However, in Table II, it performs extremely poorly on several backgrounds, which suggests that this is not the case. No explanation of this is provided. Clarity: The paper is very clear. [A] Information theory of decisions and actions, Tishby and Polani, Perception–Action Cycle, 2011 [B] Trading value and information in MDPs, Rubin et al., Decision Making with Imperfect Decision Makers, 2012 [C] Elements of information theory, Cover and Thomas, 2006 [D] Control of LQG systems under communication constraints, Tatikonda et al., CDC 1998 [E] Minimum-information LQG control part ii: Retentive controllers, Fox and Tishby, CDC 2016

The authors address the important policy of learning policies that generalize to novel environments and conditions. To do so, they propose a reinforcement learning algorithm (TDPG) which learns a policy operating on a state representation that is simultaneously useful for the control task at hand while containing as little information as possible from the sensory input. Specifically, the learned policy consists of two parts: a conditional distribution q(xb_t | xb_{t-1}, y_t) that is used to perform bayesian filtering over the latent state representation xb, as observations y are observed; and the distribution \pi(u_t \mid xb_t) which defines a distribution over actions given a latent state. Reasoning about the latent state explicitly allows directly minimizing the mutual information between the true state x and the task representation xb. In this way, the resulting policy is robust to changes in task-irrelevant aspects of the input, such as the background color or texture. Furthermore, the policies learned try to accomplish the task while minimally relying on the sensory input. The key technical tool used to accomplish this is to create an information bottleneck by explicitly minimizing the mutual information between the learned state representation and the sensory input while optimizing the policy. The paper is well written and clear. The proposed approach is very well motivated, and the proposed algorithm seems like a good approach to solving the problem statement. I found the related work section to be quite thorough yet concise and to-the-point, providing the right level of background necessary to understand the author's contributions. Below are my main concerns with this work: Connection to entropic risk: - I found the presentation of the objective (3) from the perspective of mutual information to be clearer than the viewpoint as a first-order approximation on the bound of the expected cost on a different distribution. The notation is (6) is somewhat unclear, but I assume that the expectation is taken under some distribution pc_t(\tau) = \prod_{t=0}^T pc_t(x_t, xb_t, u_t), where each pc_t satisfies equation (5). My main confusion stems from how I am supposed to interpret (5). The general idea of minimizing the cost under a worst-case choice from a set of possible distributions makes sense for robustness, the set of distributions defined by (5) is not intuitive. The authors should spend more time explaining how optimizing (6) is intuitively the right thing to do. Right now, that motivation is unclear, and since the true objective is only an approximation to (6), Theorem II.1 and the connection to entropic risk adds little to the motivation of the paper. Scalability: - The approach requires training a neural network to compute a KL divergence for each time step, all in order to perform a single gradient step on the policy networks parameters \phi and \psi. The algorithm is only tested on problems with limited state space size (max dimension 5), and limited time horizon (max 25, and only 1 on the grasping problem). Furthermore, the training seems to require significant tuning of hyperparameters such as the EMA coefficient \alpha and the number of training epochs of MINE per outer gradient step. These factors lead me to question how easily the suggested algorithm will generalize to more complex control tasks and higher dimensional systems. - Furthermore this requires training with access to the underlying state, which may be hard to access in certain domains. Can this same approach be extended to operate on image observations directly, by minimizing the mutual information between the image observations and the learned representation? It seems to me that this directly handles the issue of preventing distractors such as background textures being a factor in the learned policy. What is the main bottleneck preventing this? Scalability of MINE? The paper would be strengthened with a discussion of this point.

As mentioned above, I believe that the topic of the paper is very important, and I find the proposed method very interesting. The paper is well written and mostly clear. There are however a few issues and open questions in my opinion: a) For the proposed method, access to the underlying state is needed at training time. However, if we have access to the state, why would one use the observation (e.g. image) instead of the state directly? It would be good if the paper could give some realistic examples. In fact, it is not clear to me why, for the arguments in this paper, there needs to be a separation between state and observations. Couldn't it just consider fully observable settings, where part of the state is irrelevant to the task? b) Somewhat related, it is not clear to me how having less information about the latent state helps with being robust to changes in the observation (which do not affect the state). E.g. in the ball-catching experiment, the policy is shown to be robust to changing background. However, the background is not part of the state, hence there is no reason the proposed information bottleneck should have encouraged invariance to the background. c) Related to this, why would one not place the information bottleneck between policy and observation, rather than state. Wouldn't this encourage to get rid of irrelevant information? Wouldn't this have the additional benefit of not requiring access to the hidden state? d) All the experiments are very small and constructed. It would be very nice to have one more experiment which is a bit more standard or realistic. (and where it is well-justified to have access to the state at training time but not at test time, see a) e) Finally, there are a few minor issues concerning clarity: e1) The term task-driven is used throughout the paper to describe the proposed method. In my opinion this term is not clear or even misleading. I think it would be better if the paper used a term which describes more clearly the main idea of using only relevant information. Or otherwise the term should at least be defined at the very beginning of the paper. e2) Theorem II.1 comes somewhat out of the blue and needs more explanation. How does this theorem exactly relate to this paper? It needs to be clear which term in the theorem corresponds to which term in the paper, and why the statement of the theorem is relevant. From what I understand, \check{p} can be any distribution, so it might correspond to an entirely different policy. How does this theorem hence tell us something about the robustness of the policy? Also, from (5) it seems that with higher mutual information (rhs), we can tolerate a larger shift in the distribution (lhs). Finally, in the end the paper does not use the objective function suggested by the theorem. Therefore, the theorem seems currently somewhat decorative to me. So the paper should either establish a clear relationship or remove the theorem. e3) After equation (15), the paper states that some assumptions are made: x_t, y_t are independent of phi and x_t, y_t, \tilde x_t are independent of psi. It would be good to go into some more detail about this. Why is this necessary, couldn't we do a similar trick as in (12), (13) to get the derivative? What error are we making by these assumptions? Why can we expect the error to be small? e4) Could the paper make computing the gradients of the mutual information term more explicit? I.e. after (15), the paper should go into more detail. e5) Experiments: Am I understanding correctly that for the baseline everything is identical, except that the mutual information term in the cost is missing?