Grounding Language to Non-Markovian Tasks with No Supervision of Task Specifications


Roma Patel (Brown University); Ellie Pavlick (Brown University); Stefanie Tellex (Brown University)

Abstract

Natural language instructions often exhibit sequential constraints rather than being simply goal-oriented, for example ``go around the lake and then travel north until the intersection''. Existing approaches map these kinds of natural language expressions to Linear Temporal Logic expressions but require an expensive dataset of LTL expressions paired with English sentences. We introduce an approach that can learn to map from English to LTL expressions given only pairs of English sentences and trajectories, enabling a robot to understand commands with sequential constraints. We use formal methods of LTL progression to reward the produced logical forms by progressing each LTL logical form against the ground-truth trajectory, represented as a sequence of states, so that no LTL expressions are needed during training. We evaluate in two ways: on the SAIL dataset, a benchmark artificial environment of 3,266 trajectories and language commands as well as on 10 newly-collected real-world environments of roughly the same size. We show that our model correctly interprets natural language commands with 76.9% accuracy on average. We demonstrate the end-to-end process in real-time in simulation, starting with only a natural language instruction and an initial robot state, producing a logical form from the model trained with trajectories, and finding a trajectory that satisfies sequential constraints with an LTL planner in the environment.

Live Paper Discussion Information

Start Time End Time
07/14 15:00 UTC 07/14 17:00 UTC

Virtual Conference Presentation

Paper Reviews

Review 1

This paper presents a solution to learning to understand natural language instructions that describe navigation steps. Thus, the problem is to map natural language expressions onto trajectories in space. It takes the approach of using a logical form encoded in LTL as an intermediate representation of the task. Given an LTL expression, it is relatively straightforward to generate a trajectory. However, it is nontrivial for humans to generate LTL expressions for training purposes. Therefore, the method is to provide (language, trajectory) pairs for training and automatically extract plausible LTL expressions from the training examples. Thus, the primary contribution of the paper is to learn the intermediate logical form in LTL without ever seeing training examples that contain LTL. The authors observe that the introduction of an intermediate representation aids learning, makes explicit any temporal ordering encoded in the natural language expression, and permits the use of formal methods to follow the navigation instructions. They cite several works that use intermediate logical forms, especially Artzi and Zettlemoyer. However, this paper's use of logical forms is novel because it's the first to use a temporal logic as an intermediate representation for semantic parsing. The authors raise good points about the value of an intermediate representation. However, they somewhat neglect recent work that has moved away from such representations toward the sensorimotor learning paradigm of mapping pixels to controls. For example, Artzi himself has taken a hard swing in this direction ([1] and other recent works). I take no position on this debate here, but I think it important that works in this area fully acknowledge both sides of the debate so that readers can fully appreciate the contribution. The paper shows that a model can be trained to predict LTL expressions that can in turn produce the original trajectory or one very similar with high probability. This is the largest strength of the paper. The evaluation, by comparison, does not hold up. This problem is admittedly difficult to benchmark effectively. The authors acknowledge that the bulk of instructions in the SAIL corpus do not have temporal dependencies and so the paper's methodology is not exploited. Thus, although the SAIL corpus has long been used to benchmark semantic parsing tasks, those results are not particularly interesting here. The paper invents a second benchmark comprising a set of OSM maps of the areas surrounding a number of American university campuses. These street maps can easily be used to generate sequential instructions that highlight the strength of the method. Unfortunately though, the baseline is somewhat of a straw man. It predicts the goal only, and then it uses a shortest-path planner to generate a path to the goal. The authors include non-shortest paths in the corpus, so it is a fait accompli that their method will perform better than the baseline. A baseline and corpus better suited to the authors' purpose is coincidentally also found in Artzi's more recent work. The LANI corpus [2] gives sequential navigation instructions and trajectories provided from workers on Amazon Mechanical Turk. The paper presents an interesting concept and methodological contribution. The evaluation is somewhat convincing, but it would benefit from using a newer standard corpus for which the results of competing methods are available. I think the paper would also significantly benefit from a more open discussion of limitations. For example, the use of LTL requires a continuum space to have discrete states. This is appropriate in the two environments explored in the evaluation, but is not an inherent property of the real world. To take the example of the MIT campus shown in Figure 1, people typically say it is located at Kendall Square, although parts of it would more appropriately be described as being in Central Square. Cambridge's squares/neighborhoods are inherently ambiguous in a manner that makes it hard to ascribe states to locations at that level. The LANI dataset is similar in that you can be "at the pumpkin" or "at the lighthouse" or in some vaguely defined in between position. Rather than taking this limitation as evidence of a flaw in the method, I would be inclined to think of it as a way to better focus the method towards where it is best suited. After all, even on the OSM dataset, a robot that was actually navigating these streets is going to experience the world by a very different set of landmarks (e.g. the tall building, the curve in the road, the new age sculpture) than a user who is viewing an overhead map. Perhaps the method would be most effectively be employed as part of a larger system. If so, then there is an opportunity to sketch out what components this contribution complements. [1] Valts Blukis, Yannick Terme, Eyvind Niklasson, Ross A. Knepper, and Yoav Artzi. "Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight". In: Proceedings of the Conference on Robot Learning (CoRL). Osaka, Japan, October 2019. [2] D. Misra, A. Bennett, V. Blukis, E. Niklasson, M. Shatkin, and Y. Artzi. "Mapping instructions to actions in 3D environments with visual goal prediction". In Conference on Empirical Methods in Natural Language Processing, 2018.

Review 2

This paper has a bunch of really nice ideas – I appreciate the weakly supervised mapping of natural language (NL) to LTL since, yes, annotating utterances with LTL is not an easy or quick process. I also really appreciate the new complex dataset which is going to be a unique resource for the community. That being said, this paper, in its current form, has a few issues that need to be addressed: Evaluation metrics: I found the evaluation metrics to be weird. The paper is focused on learning LTL formulas that describes trajectories. An LTL formula, unless it is really detailed and contains a lot of safety constraints which I do not think is the case here, captures a family of trajectories, not just one. The evaluation done in this paper is to compare either the end state (when comparing to other methods) or the generated path to the ground truth path. Both of these evaluations do not make sense to me since they do not evaluate how well the LTL formula captures the original trajectory. To me, a more meaningful evaluation would be to learn the model and then check, on the testing data, whether the ground truth trajectory satisfies the LTL formula that the model predicted. Granted, this does not compare to other techniques, but the other techniques solve a different problem so I am not sure what insight I am supposed to gain from the current comparison. Planning: Why is the planning done over an MDP? The map has no probabilities, LTL has no probabilities, why not do the usual LTL planning by creating the cross product between the Buchi automaton and the environment graph (see my comments about related work and relevant citations below)? Related work: the paper is missing a lot of relevant work in the context of mapping NL to LTL, mapping trajectories to temporal logics (although not from language but it is nonetheless relevant), and synthesis (planning) with LTL. See a list of citations at the end. The ideas in this paper are novel, but they are not well situated in the relevant work that is not in the deep learning flavor. Other comments: It would be really great to add to the video the data collection process – what is the interface that the turkers saw? Example trajectories and associated language would be very cool to see. Why use a Voronoi decomposition and not use the road network? That way the work would be relevant to a ground robot and not just an aerial robot. Progressing Vs semantics: The semantics of LTL are not really defined in the paper and instead the notion of “progression” is defined. It would be good to explain what the difference is between them especially since some of the lines in Table 1 are equivalent to the semantic and some are not. Furthermore, shouldn’t prog(sigma_i,next phi) be prog(sigma_i+1, phi)? And the same for the eventually operator. Writing just phi is related to the next position of the sequence but the definition of prog(sigma_i,p) relates to the current position so it seems to be an inconsistent definition. Fig 4 has an FSA – where is it coming from? For the progression of the formula, how do we know ‘p’ became true? Should the graph on the left be annotated with ‘p’ and ‘q’? Figure 4 is not referred to in the text – it would be useful to add a description for the figure. Minor comments: The paper uses the term “trajectories” but really the mapping is to sequences of propositions. It is worth clarifying. Trajectories implies continuous (x,y) locations. There are three broken citations. Search for ‘[]’ Some relevant citations: NL to LTL for robot control: V. Raman, C. Lignos, C. Finucane, M. Marcus, H. Kress-Gazit, Sorry Dave, I'm Afraid I Can't Do That: Explaining Unachievable Robot Tasks Using Natural Language, Robotics: Science and Systems 2013 A. Boteanu, J. Arkin, T.M. Howard, and H. Kress-Gazit, "A Model for Verifiable Grounding and Execution of Complex Language Instructions," In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, Oct. 2016, pp. 2649-2654 Trajectories to temporal logic: Bayesian Inference of Temporal Task Specifications from Demonstrations A Shah, P Kamath, S Li, J Shah, Conference on Neural Information Processing Systems 2018 Giuseppe Bombara, Calin Belta, Online Learning of Temporal Logic Formulae for Signal Classification, European Control Conference (ECC), Limassol, Cyprus, 2018 Chanyeol Yoo and Calin Belta, Rich Time Series Classification Using Temporal Logic, Robotics: Science and Systems (RSS), Boston, MA, 2017 LTL synthesis/planning : There is a lot of work in the “formal methods for robotics” community. Specifically, you should look at the work of Hadas Kress-Gazit, Calin Belta, Richard Murray, Lydia Kavraki, Jana Tumova, Necmiye Ozay, Dimos Dimarogonas, etc. Here is a fairly recent review on the topic: Synthesis for robots: Guarantees and feedback for robot behavior, H Kress-Gazit, M Lahijanian, V Raman, Annual Review of Control, Robotics, and Autonomous Systems vol. 1, 2018