*David Fan (Georgia Institute of Technology ); Ali Agha (Jet Propulsion Laboratory); Evangelos Theodorou (Georgia Institute of Technology)*

Learning-based control aims to construct models of a system to use for planning or trajectory optimization, e.g. in model-based reinforcement learning. In order to obtain guarantees of safety in this context, uncertainty must be accurately quantified. This uncertainty may come from errors in learning (due to a lack of data, for example), or may be inherent to the system. Propagating uncertainty forward in learned dynamics models is a difficult problem. In this work we use deep learning to obtain expressive and flexible models of how distributions of trajectories behave, which we then use for nonlinear Model Predictive Control (MPC). We introduce a deep quantile regression framework for control that enforces probabilistic quantile bounds and quantifies epistemic uncertainty. Using our method we explore three different approaches for learning tubes that contain the possible trajectories of the system, and demonstrate how to use each of them in a Tube MPC scheme. We prove these schemes are recursively feasible and satisfy constraints with a desired margin of probability. We present experiments in simulation on a nonlinear quadrotor system, demonstrating the practical efficacy of these ideas.

Start Time | End Time | |
---|---|---|

07/16 15:00 UTC | 07/16 17:00 UTC |

The paper presents very interesting ideas that combine promising research directions. Nonetheless, the here presented results are not convincing enough (yet), partly hard to follow, written in a rather convoluted way, and overloaded with notation, which changes constantly. 1) My main criticism is that there is no comparison to other learning based (tube) MPC methods. In particular, what happens if the same amount of data is used for naive estimates of the tube? Is this significantly worse? How much more conservative are the tubes if worst-case error estimates of the dynamics are used and propagated through time. As is, It is hard to assess if this approach improves the current state of the art. 2) Further, the theoretical analysis is also hinging on some hidden details that should be elaborated in more detail. In particular, the used alpha probabilities are only asymptotically true. When finite data is used, it is not clear how the neural network will behave and thus, there are no finite sample bounds, which makes the presented results void. Right in front of Theorem II.1 is an assumption of i.i.d. data, which is crucial for most convergence results. For dynamical systems, this naturally is not the case and data is highly correlated. 3) I would also recommend to streamline the structure and remove any unnecessary notation. For example, the paper starts with a standard dynamical system in (1), then the notation of the disturbance changes because a model is used; however, the notation for the state stays the same in (2). In (5), there is now completely new notation for the same object, which is afterward substituted in the proofs anyways. Finally, in (16) and (17), this changes again in addition to having a different notation for each parametrization. 4) I do not understand why z is introduced in Sec. 2-A as a potentially lower dimensional object. This should create in my opinion many theoretical problems, however this assumption is dropped later anyways. Minor comments: - no reference to Fig. 1; - Figures should be at top or bottom of the page; - I think something is off with the color scheme in Fig. 2. Not clear what gray is and further, green and purple are interchanged from top to bottom although the same methods are presented for different systems. - Eq. (9): why not use the divergence operator: "div" instead of defining something new that looks a gradient. - Many of the "there are no structural knowledge assumptions" are hidden in Sec. 2 D in my opinion. This should be double checked with the claims in the introduction; - The numerical example in Page 4 is in an odd place and should be moved to the experimental section; - The part after Eq. (22) and (23) is very vaguely written and should be reworded; - Spacing around equations and figures is very tight and non-standard, which makes reading the paper unpleasant; - language.

Overall, I thought this paper was strong. The proposed setting is interesting, and the approach of learning quantiles for probabilistic tube MPC is a good approach. However, there are a few issues with the paper that, if addressed, would result in a substantially stronger paper. First, z is left vague throughout the paper. Concrete examples of what this quantity is when it is introduced would increase reader understanding. Learning of generic quantiles is discussed briefly but never expanded upon, or included in experiments. The authors should investigate this case! The authors should also discuss how their monotonicity criterion may be extended to this general quantile model. This model has applications in risk allocation for chance constrained programming, and thus is useful. There are a few disconnects between the goals of the paper and the stated theoretical results. - The authors prove probabilistic results for the case in which the quantile loss is minimized (in expectation). These results are almost certainly not satisfied, and this is only addressed briefly in the experimental results for the triple integrator system. The investigation of this point should be expanded. - Theorem 3.1 and 3.2 provide probabilistic bounds for one step (assuming theorem 2.1 holds, which should be stated more clearly). However, this is never connected to change constraint satisfaction, which could be used for probabilistic safety verification. I understand that in this field (deep learning-based safe control), it is difficult (or currently impossible) to design a framework that exactly satisfies all the desires safety criteria. However, the authors should aim to expand their current theoretical results as much as possible, and be more forthcoming with the limitations of their approach. The most substantial flaw of the paper are the experimental results. As mentioned above, the validity of the theoretical results/investigations of the fundamental claims/ablations are only performed for the triple integrator system, and these diagnostic experiments should be expanded to more systems and more depth. Moreover, the only nontrivial system investigated in a quadrotor with a stabilizing controller. It is unclear if the dynamics of this system are substantially different from a linear system. This latter point highlights a second important shortcoming of the experimental section: there are no baseline models implemented. The authors should include baselines, even if they are simple ablated models. This would provide evidence that the proposed components are providing value. Overall, this paper is interesting, but would be strengthened substantially by addressing the comments above. Typos: - In reference 29, Jonathan How's name is written as Howl. Upon looking on google scholar, is appears that it is incorrect there as well. Make sure to check/clean your references! - In page 2, column 2, paragraph 1, it should be referred to as the "reinforcement learning community", not the "reinforcement community" - the word "the" is repeated twice before "projection" in the second last paragraph of page 2, and again before "loss" just below eq. 7 - It appears that the curve corresponding to the star markers in figure 10 is not defined in the caption?

While the paper is in general well presented, some of the figures are not very clear or adequately explained. For example, in Fig. 1 the tube does not contain the tracking trajectory and this fact is not acknowledged. This also occurs in Figs. 8 and 11, which on the other hand provide an explanation for this behavior: the nominal tracking trajectory violates the constraints? What other baselines have you compared the approach to? Others exist, such as [29] with similar looking plots to Fig 11 - how does this approach compare? Fig 11 suggests that the approach works, but how is it better than the state of the art in robust MPC? The clarity of Fig. 2 would be improved if a different color was chosen for the histogram (blue is already used for the 3-sigma bounds using GP moment matching) and on a minor note the green line is dashed, not dotted. Finally, Fig. 10 should present blue circles instead of blue stars for consistency. Additionally, the meaning of the grey lines shown is not explained. Furthermore, the paper assumes well-calibrated epistemic uncertainty without adequately delving into its consequences: the authors should explain how would poor choosing of beta and W (which is done by hand) impact their methods. Regarding the results, having hardware experiments instead of just simulations would make the paper stronger, since the direct applicability of the methods presented is not guaranteed. In addition, it would be interesting to see the results of Alg. 1 and 3 for the quadrotor simulation and compare them with the results obtained with Alg. 2. Are the tube dynamics learned using open or closed-loop data? I assume it is closed loop, but then aren't the learned tube dynamics also possibly a ftn of the controller used to get that data? Minor points: - what "function" are you referring to just prior to (3)? Typos include: - recieved on page 1 - "the the" on page 2 and page 3, - "it is propagated in forward in time" in page 4, - "R^{1}8" in page 8, - "more a more" in page 8 - numerous typos in the citations - the second equation in (7) it should be L^alpha(y,r), - the safe set C in (13g) is only introduced in the next page, in Alg 1. - the authors should review the rules on the use of "which" vs "that"