3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans

Antoni Rosinol, Arjun Gupta, Marcus Abate, Jingnan Shi, Luca Carlone


We present a unified representation for actionable spatial perception: 3D Dynamic Scene Graphs. Scene graphs are directed graphs where nodes represent entities in the scene (e.g., objects, walls, rooms), and edges represent relations (e.g., inclusion, adjacency) among nodes. Dynamic scene graphs (DSGs) extend this notion to represent dynamic scenes with moving agents (e.g., humans, robots), and to include actionable information to support planning and decision-making (e.g., spatio-temporal relations, topology at different levels of abstraction). Our second contribution is to provide the first end-to-end fully automatic Spatial PerceptIon eNgine (SPIN) to build a DSG from visual-inertial data. We integrate state-of-the-art techniques for object and human detection and pose estimation, and we describe how to robustly infer object, robot, and human nodes in crowded scenes. To the best of our knowledge, this is the first paper that reconciles visual-inertial SLAM and dense human mesh tracking. Moreover, we provide algorithms to obtain hierarchical representations of indoor environments (e.g., places, structures, rooms) and their relations. Our third contribution is to demonstrate the proposed spatial perception engine in a photo-realistic Unity-based simulator, where we assess its robustness and expressiveness. Finally, we discuss the implications of our proposal on modern robotics applications. We believe 3D Dynamic Scene Graphs can have a profound impact on planning and decision-making, human-robot interaction, long-term autonomy, and scene prediction.

Live Paper Discussion Information

Start Time End Time
07/16 15:00 UTC 07/16 17:00 UTC

Virtual Conference Presentation

Supplementary Video

Paper Reviews

Review 1

In my opinion, the weakest part of the proposal is the graph of places in Layer 3. A place is simply a 3D position in free space with a bounding box, but what size of bounding box? Looking at the large number of places in figure 2 it seems that the system is assuming a very small flying robot, while a bigger land robot would need a totally different places graph for path-planning. So, the claim that this places graph allows for path planning should be moderated. Also, looking in detail at figure 2, we can observe that the system has found many paths between rooms that traverse crystal walls! Probably, good semantic mapping of this kind of environments would require detecting and explicitly representing doors.

Review 4

Originality: The proposed 3D DSG includes multiple semantic layers mainly targeted for urban, indoor environments. Similar hierarchical representations have been used in indoor navigation and more sophisticated layers on traversability and terrain type analyses are used for outdoor navigation. Having both static and dynamically moving objects such as humans seems to be interesting. How scalable is the proposed approach with respect to the number of static and dynamic objects? Quality: The basic idea of combining static environmental contexts and dynamically moving objects is great. One of the difficulties in doing so is to recognize whether an obstacle is a moving/movable one or not. It seems that the proposed approach resolves this issue using prior knowledge, e.g., humans are moving objects, but what if there is conflicting information from the common object detector and person detector? This question leads to a more general question on how the proposed approach handles uncertainty. Because the evaluation on this paper was done in a simulation setting, we could assume high accuracy in visual perception, however, in robotics, a key challenge is dealing with high-level of uncertainty due to sensor noise and environmental variables. I think that the paper can be strengthened by explaining how the perception uncertainty is represented and gets updated as an environment changes dynamically. Clarity: The paper is written clearly and organized following a common format of the technical paper that includes a survey of related work, the description of the proposed approach, and evaluation results and discussions. Significance: This paper proposes a representation of semantic information and the automatic building of it from sensor data. Here are some suggestions on how this paper can be improved for more significant contributions: 1) describe how the uncertainty can be included, 2) describe how the representation can be extended/modified to accommodate different needs of specific applications, e.g., adding topological and/or metric maps that can interface with path planning algorithms, 3) evaluation of SPIN on other datasets, e.g., Matterport datasets.