Dhiraj Gandhi (Carnegie Mellon University); Abhinav Gupta (Carnegie Mellon University); Lerrel Pinto (NYU/Berkeley)
Truly intelligent agents need to capture the interplay of all their senses to build a rich physical understanding of their world. In robotics, we have seen tremendous progress in using visual and tactile perception; however we have often ignored a key sense: sound. This is primarily due to lack of data that captures the interplay of action and sound. In this work, we perform the first large-scale study of the interactions between sound and robotic action. To do this, we create the largest available sound-action-vision dataset with 15,000 interactions on60 objects using our robotic platform Tilt-Bot. By tilting objects and allowing them to crash into the walls of a robotic tray, we collect rich four-channel audio information. Using this data, we explore the synergies between sound and action, and present three key insights. First, sound is indicative of fine-grained object class information, e.g., sound can differentiate a metal screwdriver from a metal wrench. Second, sound also contains information about the causal effects of an action, i.e. given the sound produced, we can predict what action was applied on the object. Finally, object representations derived from audio embeddings are indicative of implicit physical properties. We demonstrate that on previously unseen objects, audio embeddings generated through interactions can predict forward models 24%better than passive visual embeddings.
Start Time | End Time | |
---|---|---|
07/14 15:00 UTC | 07/14 17:00 UTC |
This paper provides insights on the importance of using sound for object classification, inverse and forward model predictions. To do this, the paper first present the data collection procedure used to sound-action-vision dataset available with 15,000 interactions on over 60 objects using a Sawyer robot. Then this dataset is used to explore explore the synergy between sound and action to gain insight into what sound can be used for. The paper reports on a number of evaluations including 1) object classification, 2) inverse-model learning, 3) multi-task audio embedding learning, 4) few shot learning, and 5) forward model learning. Pros: * The insights provided in the paper are very useful for the research community. * Relatively thorough experimental evaluation. * The authors mentioned they are planning to open-source the dataset. * The paper is easy to read. Cons: * The distributions of the objects are not described in the paper. Although some of the objects used for data collection are shown in Fig 4, but this is quite an important aspect of the paper that needs to be described in more details. Specifically, these distributions should be reported for both training and test datasets: 1) Object distribution based on MATERIAL (e.g. metal, plastic, glass, ...). 2) Object distribution based on SHAPE (e.g. small, medium, large). 3) Object distribution based on WEIGHT (e.g. light, medium, heavy). 4) Object distribution based on HARDNESS (e.g. soft, firm, hard). It would be also interesting to report on the accuracy based on these distributions. This might provide more insights on the effectiveness or in-effectiveness of using sound based on the object category. * Some claims in the paper are too generic and not well-supported and should be tuned-down. For example: (1) In Section IV-B: " This shows that audio data contains fine-grained information about objects. Although ..., our results show for the first time (to our knowledge) that audio information generated through action gives instance-level information like screwdriver, scissor, tennis ball etc" Though I believe this claim for the limited object dataset considered in this paper, one should avoid over-generalizing the results. (2) Although the collected dataset is useful for research, it is not a representative of real-life scenarios a robot may face. The proposed setup makes a number of exaggerated movements that produces loud noise. This plays an important role in improving the results in using sound. The paper briefly reports an experiment along this line in Section IV-G, however, this section is not well described missing critical details such objects being used and more thorough quantitative experiments. Suggestions: - The paper provides comparison with a visual-only baseline. I am wondering how much one may gain by using audio-vision vs force-vision. The latter refers to the case where one may use to force/torque value from contact to infer information. - Kids used toys with exaggerated noise to build their sound-action synergy. But, when they grow they could use the built skill during childhood in their adulthood. I was wondering if you could use the knowledge gained by tilt-bot robot and apply it to the pushing experiment? Further comments: - Please provide details on the action distribution used to tilt the box. - Section IV-C: It seems the visual model does a better generalization from seen to unseen objects compared to an audio model. Please elaborate this in the paper. - Fig 6: Some of the action prediction in Fig 6 are pretty off. This need to be discussed in the paper. - Fig 2 is missing in the paper. - What if you tilt the box but the object doesn't hit the wall (though does a small movement)? Do you still use this data? - There is a typo in the caption of Fig 6: "For each for images, ..." - Table 1: What does the arrow up and down correspond to? I assume it means whether a higher or lower value is desired. Please clarify this in the caption.
Overall comments: The submission is highly original. I do not know of any other work aiming to provide an open-sourced dataset that includes action, vision and sound. I believe such a dataset would be useful to the robotics community, as well as other communities more specifically interested in sound. There is room for improvement in the quality of the submission on the experimental side. As a dataset paper, I think it is very strong. However, the authors do not provide convincing experiments to show that sound is a useful perceptual input above vision. I give several specific points of (hopefully constructive) criticism on this below. Nitpicky writing points for the introduction “A truly intelligent agent would need to capture the interplay of all the three senses to build a physical understanding of the world.” --this really is not obviously true. Many truly intelligent humans and animals lack at least one sense, sometimes two, and are still capable of understanding the physical world. Consider removing this sentence, as I don’t think it is needed to still underscore the importance of sound as a sense. Consider splitting related work into two sections: one that is about datasets (and prior datasets), and one about how sound can be successfully leveraged for different tasks. As it currently reads, these two are conflated, as previous methods are mentioned that also introduce both datasets and applications, but their differences are only mentioned in the dataset axis but not the applications axis. “Multi-modal learning with sound” as a section should only compare to the application axis of previous work - the dataset section can separately compare to datasets in previous work. Constructive criticism for experiments: General criticism: not nearly enough detail is provided in the paper to be able to replicate any of the experiments. Architectures for the models are not described, there is no mention of how the models are trained, and hyperparameters are ignored. As it stands, this paper could not be reproduced by a reader. Fine-grained audio classification: --The authors state that they create a dataset by taking 80% of their data and testing on a held out set of 20% of their data, for each object. They provide a random embedding baseline, but not the standard “nearest neighbours in pixel space” baseline. Without this nearest neighbours baseline (in pixel space), it is difficult to get a sense of how difficult this problem is. This is a common baseline for other perceptual datasets. Inverse model learning --In Figure 6, it would be helpful to see for these same examples what the predictions from the “vision-only” model are. In addition, how much sound is being provided as input? Is it from the full 4 seconds? If so, it seems that a fair comparison might be to provide the full 4 second video from the visual domain, which I would expect could perform better than using the sound. Multi-task audio embedding learning --The authors first say that “training is performed on set A objects, while testing is done on set A held-out interactions and unseen set B objects”. However, they then go on to say that they see performance improvements from 73.8% to 78.6% when training on the set A objects only, and 76.1 to 79.5 when training on both set A and set B objects. If you were training on set A and set B objects, what was being tested? --for the inverse-model learning, is joint learning performed on both set A and set B? If so, it’s not surprising that you see improvement for set B regression but not set A. This needs to be stated more clearly in the manuscript. Similarly, when comparing to the visual baseline, it seems like the authors are just reporting the previous number that did not use joint training on both tasks. If you would like to compare to this baseline, it should also be jointly trained with the classification task. --Figure 8 could be improved or possibly removed. Perhaps color coding by physical similarity would help the visualization. At the moment, it’s not clear that physically different objects are well separated in the embedding space since the colors don’t represent a gradient along any property. Few-shot learning --The authors say they use a nearest neighbours method to perform few-shot learning. How many neighbours are used? --The authors show that using a ResNet embedding does better than the audio embeddings by a wide margin. Was any fine-tuning done to the ResNet? Forward model learning. --Authors say that audio embedding is based on a random interaction. Is this random interaction *significantly different* from the action that is then input to the forward model for prediction? The figure suggests these are in the same general direction, and it would be helpful to know how the accuracy varies as a function of the similarity between the “probe” action and the predicted action. --I am particularly concerned about this because of how badly the ResNet performs, even though it did a *much* better job at few-shot object classification in the previous section. Since the oracle is doing very well, and is defined as using the true “class” label, I would have expected that the method which most accurately captures the class would perform best in this task. However, the ResNet, which performs much better than the audio embeddings, is somehow worse for forward model prediction. Could the authors comment on why this is? Overall, I think this dataset will be a nice contribution to the community, but the paper as it is currently written does not perform clear experiments to show that audio is necessary above and beyond visual information. If the authors can address the criticisms above, I believe the paper could be improved and would help spur the adoption of this dataset within the community.
This is a very nice, very clear paper. It makes a compelling case for the value of the provided dataset for research, the value of using joint sensory inputs in embeddings, and that achieving some degree of generalization from audio data is actually attainable without a large amount of data/modeling/engineering. This last point is a bit surprising, and opens up new research directions (how much generalization? can we use 'pretrained' audio embeddings?