Nonparametric Motion Retargeting for Humanoid Robots on Shared Latent Space


Sungjoon Choi (Disney Research); Matthew Pan (Disney Research); Joohyung Kim (University of Illinois Urbana-Champaign)

Abstract

In this work, we present a semi-supervised learning method to transfer human motion data to humanoid robots with varying kinematic configurations while avoiding self-collisions.To this end, we propose a data-driven motion retargeting named locally weighted latent learning which possesses the benefits of both nonparametric regression and deep latent variable modeling.The method can leverage both paired and domain-specific datasets and can maintain robot motion feasibility owing to the nonparametric regression and graph-based heuristics it uses. The proposed method is evaluated using two different humanoid robots,the Robotis ThorMang and COMAN, in simulation environments with diverse motion capture datasets. Furthermore, online puppeteering of a real humanoid robot is implemented.

Live Paper Discussion Information

Start Time End Time
07/16 15:00 UTC 07/16 17:00 UTC

Virtual Conference Presentation

Paper Reviews

Review 1

There are a few things I do not quite understand for this paper.  Once a shared latent space is created, why is it necessary to do a nearest neighbour search?  You can simply use the decoder to compute a corresponding pose of the robot - am I missing something?  The subsampling method sounds good - although the distance metric sounds quite naive.  It is only for the poses of the two arms.  Then, how is it going to be managed when the legs of the robot are also involved?   The result video appears very noisy and discontinuous.   I think a method based on spatial relations will produce far smoother motions compared to what I see here - maybe it should be compared with those.   Some motions like dual arm rotations look very dissimilar to the motion of the human. Molla, Eray, Henrique Galvan Debarba, and Ronan Boulic. "Egocentric mapping of body surface constraints." IEEE transactions on visualization and computer graphics 24.7 (2017): 2089-2102. Jin, Taeil, Meekyoung Kim, and Sung‐Hee Lee. "Aura mesh: Motion retargeting to preserve the spatial relationships between skinned characters." Computer Graphics Forum. Vol. 37. No. 2. 2018. Overall, I think the method sounds fine - the LPP module sounds very useful for producing a good mapping from imbalanced training data.   On the other hand, the other parts sounds a bit unclear - such as the nearest neighbour search, etc. The method sounds like a hybrid approach of deep learning approaches and classic approaches, but the justification of the entire pipeline is not satisfactory. I think there could have been some other approaches say, based on cycle-GAN to produce a better mapping between the two.  ”Once an encoder/decoder pair is constructed for each domain, we deploy locally weighted regression on the latent space to find a mapping from one domain to the other." - I do not understand this part too. If the pose is in the shared space, why is it necessary to do a locally weighted regression? A regression from which domain to which domain? minor typos: page 4, right column: that if when we apply Fig 4, caption: Uniform Samplpling Tab 2. Sef collision page 7: better retargeting results *than* the baseline

Review 2

The paper is well written and structured. The techniques of choice and assumptions are justified clearly and the overall approach is sound. The novelty stands from combining Wasserstein auto encoders with locally weighted regression on the embedded space, and the incorporation of collision handling and sub sampling for the retargeting task. The approach is, however, not a simple concatenation of previously presented techniques. The entire pipeline requires the definition of several quantities such as divergence and distance function and losses for the WAE, local parameter k for the local regression, DPP as a subset sampling mechanism for more accurate latent space learning. The authors excelled in making sure all the components are connected and justified. My main criticism is the experiments and the comparisons provided. The paper only presents comparisons to one other method [3] and no ablation studies are reported. The paper would benefit from a more detailed evaluation on the various choices. For example, it would be interesting to see the performance of the method with another regression technique instead of LWR, for example Gaussian processes, that can learn the parameters of the kernel directly. With today's ML tools and variational inference, GPs are fast and can scale to very large datasets. How sensitive is the method to different values of k? How does the performance improves with data augmentation of different sizes? And finally, how does it compare to a simple behaviour cloning strategy constrained by collisions? These comparisons and discussions would make the paper significantly more impactful. Overall, I believe there are sufficient novel ideas and the quality of presentation is excellent making the paper a solid contribution to the conference.

Review 3

This paper presents a framework for mapping motions from a robot to another robot. The proposed framework learns the latent space shared by motion domains of two different robots. For learning the shared latent space, Wasserstein autoencoder is adapted in this study. The contribution of the paper is 1) to propose the framework for learning the latent space shared by two different robot pose domains, 2) the heuristic to check the feasibility of transitions, and 3) a trick for training neural networks using imbalanced data sets. Regarding the first contribution, the objective function in Eq.(4) and (5) seem similar to style transfer GAN, although the paper is not cited. "Image Style Transfer Using Convolutional Neural Networks" Gatys et al., CVPR 2016. I recommend the authors to cite the style transfer GAN paper and discuss the relation. I summarize the strong and weak points of the paper: Strong points: - The entire algorithm seems work well as verified in the experiments. The proposed method reduces the self collision while keeping the tracking performance comparable to the baseline. - The heuristic for checking the feasibility of transitions looks practical - LA-DPP look also practical and I can see from equations that LA-DPP should be more computationally efficient than the original DPP. Weak points: - The paper requires some revisions to improve the presentation. Especially, the way of using the locally weighted regression is not clear. Please refer to the following comments. I suggest to put a pseudo code in the method section. - Regarding the second contribution, the benefit of the feasibility check of the transitions are not explicitly evaluated in the experiment section. - Regarding the third contribution, the computational efficiency fo proposed LA-DPP over the original DPP is not quantitatively evaluated in the experiment. Detailed comments on presentation: - I do not clearly understand how the locally weighted regression is used on the latent space. From the term "locally weighted regression", I think of something presented in this webpage. https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/cohn96a-html/node7.html "k" can be any positive real number in this case. However, the authors described, "setting k = 1, as the proposed LWL2 becomes a table look-up method." I do not understand this sentence. It is necessary to clarify how the locally weighted regression is used in the proposed framework. In addition, I do not clearly understand why we need the locally weight regression and why we cannot directly reconstruct the motion using the decoder P(z). - I do not understand the third paragraph of Section IV.D. Specifically, I do not understand the black squares in Fig.2. - In Eq.(4) and (5), $x^l_i$ is used, but its definition seems missing, although $x_i$ is defined. I understand that $x^l_i$ is the $i$th robot pose data point in the domain l, but it should be explicitly described in the text. - In the third paragraph of Section III, there are some equations using R(:,3). This programming-language-like expression should be avoided and please use mathematically correct equations. In addition, it seems that "R" is a rotation matrix, although it is defined as simply "orientation" in the text. If necessary, the reason why the use of the capsule representation is computationally efficient can be described in the supplementary material. Minor comment: - I suggest authors to have a look at "AUC optimization", which address the class imbalance in the context of classification problems. It maybe useful for future work.