### Learning Task-Driven Control Policies via Information Bottlenecks

Vincent Pacelli (Princeton University); Anirudha Majumdar (Princeton)

### Live Paper Discussion Information

Start Time End Time
07/16 15:00 UTC 07/16 17:00 UTC

### Paper Reviews

Review 1

Originality: The problem formulation is not novel [A]. The motivation that such methods are robust to changes in the environment has also been studied [B]. The method itself uses [4], but the policy gradient formulation is original, including differentiating through the MINE and stabilizing it with EMA. Quality: The paper is quite well written. Issues: - Despite the recent trend to call every trade off with information rate "information bottleneck", the latter refers to a specific trade off between two information quantities [44]. Eq. (3) uses instead the much earlier concept of rate–distortion [C], and particularly sequential rate–distortion [D], although the approximation that x_t and y_t are independent of phi loses the sequential nature. - The equation for pi (Section II) is confusing, because the LHS gives the impression that the policy has no memory. It is also inaccurate, because the RHS omits the dependence between \tilde{x}_{t-1} and y_t. - It is unclear what is gained by Theorem II.1. Is the paper claiming that the RHS of (6) is a good proxy for its LHS? But the LHS is not our objective, because of the very restrictive (5) (which is made increasingly restrictive by minimizing I[x_t, \tide{x}_t] ). - In what sense is (3) a "first-order approximation" of (6)? They coincide when the beta of (3) is 1 and that of (6) tends to 0, but can otherwise be very different. - How much is performance improved by having time-variant theta, phi, and psi? - Prior work solves the problem optimally in the linear–Gaussian case [E]. Since the domain in example IV.A is linear, the paper should compare the proposed method with the optimal solution. - In all experiments, instead of fixing beta to an arbitrary value (which one?), it would be useful to show a curve of the value as a function of beta. This will also reveal different phases of qualitatively different control behaviors. - The reported standard deviation is presumably over the variability of the domain. No error bounds on the mean estimation are given, making it hard to evaluate statistical significance. - Presumably the method encourages completely ignoring features that are completely task-irrelevant. However, in Table II, it performs extremely poorly on several backgrounds, which suggests that this is not the case. No explanation of this is provided. Clarity: The paper is very clear. [A] Information theory of decisions and actions, Tishby and Polani, Perception–Action Cycle, 2011 [B] Trading value and information in MDPs, Rubin et al., Decision Making with Imperfect Decision Makers, 2012 [C] Elements of information theory, Cover and Thomas, 2006 [D] Control of LQG systems under communication constraints, Tatikonda et al., CDC 1998 [E] Minimum-information LQG control part ii: Retentive controllers, Fox and Tishby, CDC 2016

Review 2