Learning Memory-Based Control for Human-Scale Bipedal Locomotion

Jonah Siekmann (Oregon State University); Srikar Valluri (Oregon State University); Jeremy Dao (Oregon State University); Francis Bermillo (Oregon State University); Helei Duan (Oregon State University); Alan Fern (Oregon State University); Jonathan Hurst (Oregon State University)

Abstract

Controlling a non-statically stable biped is a difficult problem largely due to the complex hybrid dynamics involved. Recent work has demonstrated the effectiveness of reinforcement learning (RL) for simulation-based training of neural network controllers that successfully transfer to real bipeds. The existing work, however, has primarily used simple memoryless network architectures, even though more sophisticated architectures, such as those including memory, often yield superior performance in other RL domains. In this work, we consider recurrent neural networks (RNNs) for sim-to-real biped locomotion, allowing for policies that learn to use internal memory to model important physical properties. We show that while RNNs are able to significantly outperform memoryless policies in simulation, they do not exhibit superior behavior on the real biped due to overfitting to the simulation physics unless trained using dynamics randomization to prevent overfitting; this leads to consistently better sim-to-real transfer. We also show that RNNs could use their learned memory states to perform online system identification by encoding parameters of the dynamics into memory.

Live Paper Discussion Information

	Start Time	End Time
	07/14 15:00 UTC	07/14 17:00 UTC

Virtual Conference Presentation

Supplementary Video

Paper Reviews

Review 1

I think that the role of memory + domain randomization to effectively perform online-system-identification is important and understudied. By themselves they are not new ideas, but the fine points often matter, and it is never full clear whether proposed algorithmic ideas are as agnostic to the task and hardware as we might like. The framing of the paper could be improved, I think. Defining-and-distinguishing between online-system-identification and disturbance-observation would be helpful. It was not clear to this reader as to whether these were the same thing or not. It would also be nice to frame the work in the same space as work that learns a more explicit model of the dynamics parameters, i.e., ref 22. And splitting the discussion into the matrix of combinations defined by (FF,LSTM) x (noDR,DR) would be useful, because that is the core issue of the paper. One could then hypothesize that: FF, noDR -- hypothesis: will overfit to the simulation dynamics FF, DR -- hypothesis: will produce a motion that is robust to some param variation LSTM, noDR -- not clear what the hypothesis is here; it's not clear why this should learn something too much different than for FF, noDR LSTM, DR -- hypothesis: will use the policy memory to do online system identification One subtle issue worthwhile thinking about: if we draw an analogy between online adaptation and Kalman filtering, then what defines the Kalman "gain"that must be implicit in the memory-based controller? The abstract could jump more directly to the point, and instead devote more space to the conclusions. The task could be clearly defined, i.e., walk at a range of speeds [a--b], as modeled implicitly by a reference trajectory that is parameterized by speed (in some way). The structure of section IV A could be improved. The first paragraph merges technical details with some results discussion. Fig 4 the title (first part of caption) could be: "Learning curve without dynamics randomization" Adding a 4th column to Table III, i.e., FF DR, would help clarify the structure, even if all the entries are a dash, indicating a failure or poor policy. Why not give the disturbance/randomization information to the critic? It will be discarded at run time anyhow. This is a "asymmetric actor critic" structure (see a paper that has this title). Thus it is not clear that the critic requires the memory. The randomization interval for the COM seems excessively large, i.e., [-25,6] cm. Why is it difficult to know the pelvis cm to within 5cm? It would have been interesting to a see a dynamic alteration made to the COM of the robot and to see that the recovered COM estimate had adapted accordingly. Or some other parameter that might be easier to change in an online setting. "Our learning process makes use of a reference trajectory" Perhaps better to say that it learns to imitate a given reference trajectory. Fig 3 could be condensed, or simply summarized in the text. The recurrent PPO policy learning is unique, which is both good and bad. How does the learning structure compare to other similar work? Table IV and V are reference out-of-order in the text.

Review 2

The paper investigated the problem of transferring a simulation trained policy to a real physical robot (cassie). The core idea is to train a recurrent neural network policy (represented as LSTM) and randomize the dynamics of simulation during training. The result seems solid, the paper is well written and easy to follow, and it’s great to see that the method works on a real physical robot. My main concern about the paper is that the approach it’s taking is essentially the same as the one in Peng et al 2018, in which they also trained an LSTM policy and used dynamics randomization for sim-to-real transfer. The major differences are that the training algorithm and the robot are different. In addition to the method, the analysis on which dynamics to randomize is potentially interesting as it shows that the baseline methods, trained with certain dynamics parameters, can work as well as the proposed method. Though the focus of this paper is on the LSTM policy with dynamics randomization, it is nevertheless interesting to see more details about the selected dynamics sets as it might provide insights on which parameters are more important for the tasks. The analysis for predicting the dynamics parameters from LSTM latent variable is also interesting. However, they don’t really lead to significant change from the previous methods. In general, I think the paper has developed an interesting learning system that demonstrates good results on real robots, while the technical contribution is limited due to the similarity to prior works.

Review 3

This paper proposes to use deep reinforcement learning (PPO) and domain randomization to learn a recurrent policy (LSTM) for the Cassie robot. The paper is clearly written, thoroughly evaluated and the results are compelling. I really appreciate the real robot results. Although none of the individual components (PPO, reward based on imitation, LSTM policy, domain randomization) of this paper is novel, as a researcher in the field of locomotion and learning, I admit that I have learned a lot from this paper, which is summarized below: 1) The clock (phase) input is essential for learning a successful policy. 2) A significant sim-to-real gap does exist for the Cassie robot (It appeared otherwise in the prior work of [Xie et al.]). 3) Combining RNN and domain randomization gives the best sim-to-real performance. 4) The memory may encode the dynamics. Although I have some doubts about this conclusion, given that the Mean Percent Error is still high (~31%) in Table IV, this observation is inspiring and worth further investigation (maybe as future work) because this finding could be paradigm shifting. If the memory learned to encode dynamics, learning with memory could replace the painful manual system identification process. I believe that other researchers in this field would also be benefitted by reading this paper. It is clearly an important step towards automatic design of locomotion controller for legged robots. For this reason, I would recommend accepting the paper.