A Motion Taxonomy for Manipulation Embedding

David Paulius, Nicholas Eales, Yu Sun


To represent motions from a mechanical point of view, this paper explores motion embedding using the motion taxonomy. With this taxonomy, manipulations can be described and represented as binary strings called motion codes. Motion codes capture mechanical properties, such as contact type and trajectory, that should be used to define suitable distance metrics between motions or loss functions for deep learning and reinforcement learning. Motion codes can also be used to consolidate aliases or cluster motion types that share similar properties. Using existing data sets as a reference, we discuss how motion codes can be created and assigned to actions that are commonly seen in activities of daily living based on intuition as well as real data. Motion codes are compared to vectors from pre-trained Word2Vec models, and we show that motion codes maintain distances that closely match the reality of manipulation.

Live Paper Discussion Information

Start Time End Time
07/15 15:00 UTC 07/15 17:00 UTC

Virtual Conference Presentation

Paper Reviews

Review 1

The construction and the choices for constructing this encoding scheme are well explained and documented, however this work lack of an example(even not very difficult) of task that successfully use this encoding because while the construction of the encoding seem coherent it is not sure that this will have a significant effect on a learning task, or at least showing that it increase the learning speed. Creating a link between the geometric/physical/motion and semantic associated of the action is difficult but could be done by multiple ways, ones could be done by using for example siamese networks, by exploiting the similarities in the appearance of the actions(like in a action recognition task) in the dataset in order to create the embedding from the videos. Use both embedding in another task to show the significance of the encoding scheme presented and the benefit from using such encoding would have been appreciated instead of comparing it with word2vec that do not contain motion information about the action.

Review 2

This paper explores motion encoding using motion taxonomy based on the mechanics of motions, showing how the motion code assignment corresponds to actual data. And as a result, a new embedding method has been developed. The new method translates manipulations into a machine language called motion codes according to some attributes based on contact and trajectory information, whereas the popular word embedding technique Word2Vec embedding model is trained by context. In order to compare motion codes to Word2Vec embeddings, the authors used dimension reduction with PCA and then used t-SNE to visualize these embeddings and their relative distances in 2D. Experiments were carried out on the two methods respectively, and the results demonstrated that these motion codes, when compared to Word2Vec (which uses natural language for training and gives no innate information at all to compare the differences between two labels in a mechanical point of view), produce embeddings that provide better metrics for classification. A number of problems had been solved: 1. different forms of motion of a same word (e.g. mixing (liquid), mixing (non-liquid)) 2. ignoring the synonyms of label (different labels but similar mechanical motions, multiple synonyms are oversimplified into a single label, e.g. ‘chop’, ‘cut’ and ‘slice’) 3. multiple meanings of a single word (noun and verb of a word may not share a same meaning, e.g. tap). In addition, motion codes reduce the amount of features needed to label motions and contain more meaningful information about distances between motions. The aim of this research is clearly stated and fully addressed. But I still have some concerns as detailed below: For clarity: 1. In Abstract, it is mentioned that binary codes establish a road-map to transfer learned skills to unlearned skills that share similar properties. Please explain it more clearly. 2. In the introduction, we were told that this taxonomy can consolidate motion aliases. A further explain on the seasons for consolidation would be helpful. 3. In this paper, the author defines two weighted values to set the priority of contact or trajectory types when measuring distances, I suggest that the author elaborates on how this is achieved. 4. In the beginning of section II, a formal definition of motion feature and motion feature space should be introduced, as it may leads to confusion without a definition before having read through the whole paper. 5. I suggest the author presents the experimental results in a more intuitive form, such as circling the motions that belong to the same cluster. 6. Figure 4 illustrates how to extract revolute properties for the motion of loosening a screw. However, one more picture with coordinate axis as well as arrows indicating the direction of motion would make it easier to understand. 7. In Section IV B, this paper compares motion codes to pre-trained Word2Vec models. Please explain how to convert a motion code vector or a Word2Vec vector to the corresponding point in Figure 5. 8. PCA and t-SNE are used in this paper, I suggest that the author introduce these methods properly for non-expert readers. 9. In the conclusion (Section V), it says, "with a suitable model, motion codes can be automatically generated.” Please explain how to obtain this model. For quality: 1. The engagement type (rigid or soft) is connected with the structural outcome (deforming or non-deforming). The classification of these two attributes may be duplicate and there may be simpler code. 2. As mentioned just before the section III, the proposed motion taxonomy is not the ideal way of representing a motion. Considering this as a drawback, can you please indicate what are the most important features that forms a good motion representing method? Or Could you please quantitatively or qualitatively evaluate why this motion taxonomy has this drawback. Are there any other potential drawbacks can be seen? (since they are not explicitly explained in the paper) 3. In this paper, a criterion for the performance of the embeddings is proposed. This criterion is intuitive; however, it would be more convincing if this paper explains how to use the motion codes for motion recognition, analysis and generation rather than just comparing with Word2Vec embedding. 4. Section IV compares motion codes to Word2Vec embedding. Please explain why choosing Word2Vec embedding, why not other embedding methods? 5. In Section IV B, this paper compares motion codes to pre-trained Word2Vec models from Concept-Net, Google News and Wikipedia. However, the data sets used to train these models are not created for robotics. It would be better to use a data set that is created for robot manipulation. 6. In the conclusion (Section V) and section II, the authors say that their future work will be a neural network that can automatically generate codes for manipulations in video sequences. It would be more impressive to explain the relationship between the newly proposed motion taxonomy and this neural network. 7. When defining the contact duration, it mentioned that the duration can be measured visually or physically with sensors; However, the threshold or boundary was not given. A better definition on the threshold may be vital for a robot to generate this code automatically. 8. The text needs to be re-checked, since I found missing of punctuation mark in the paper. Other suggestions: 1. I suggest that the author adds the overall framework of the proposed method in the article. 2. It would be interesting if actions using double-active tools can be considered. E.g. bending a long stick using two hands.

Review 3

*The robot is clearly written and easy to follow. *Their proposed motion taxonomy represented as a binary vector is interesting, as they can compactly represent many different types of motion. However, it is unclear in what situations it would be useful to use it. *My main qualm with the paper is that the authors do not clearly validate the paper. They claim that the motion codes can be extracted directly from demonstrations. But they do not provide statistics of how frequently the robot extracted the correct motion codes. One possible validation would be for the robot to automatically create motion codes for a number of different demonstrations. And compare how close the automatically generated motion code is to motion codes hand-coded by an expert. *I am not sure if the comparison to Word2Vec is the fairest comparison as Word2Vec was created for a very different domain (NLP) rather than describing how a robot or people move. *In their comparison to Word2Vec, they claim that their taxonomy is better, but they have no statistics backing this claim. The authors do not present any statistics showing that one was more accurate than the other, instead, they compare several selected examples. A better validation would be to have an expert label how similar different motions are, and how distant Word2Vec and their proposed method are to the expert labels. *A strong validation might have been to learn several motion taxonomies and see how well the robot could generalize it to new motions (motions that it had never done before). *I would have liked to see several specific examples of how this taxonomy would improve current robotic applications.