Iterative Repair of Social Robot Programs from Implicit User Feedback via Bayesian Inference

Michael Jae-Yoon Chung, Maya Cakmak


Creating natural and autonomous interactions with social robots requires rich, multi-modal sensory input from the user. Writing interactive robot programs that make use of this input can demand tedious and error-prone tuning of program parameters, such as tuning thresholds on noisy sensory streams for detecting whether the robot's user is engaged or not. This tuning process dealing with low-level streams and parameters makes programming of social robots time-consuming and inaccessible for people who could benefit the most from unique use cases of social robots. To address this challenge, we propose the use of iterative program repair, where programmers create an initial program sketch in our new Social Robot Program Transition Sketch Language (SoRTSketch), a domain-specific language that supports expressing uncertainties related to thresholds in transition functions. The program is then iteratively repaired using Bayesian inference based on corrections of interaction traces that are either provided by the programmer or derived from implicit feedback given by the user during the interaction. Based on experiments with a human simulator and with 10 human users, we demonstrate the ease and effectiveness of this approach in improving social robot programming and program outputs that represent three common human-robot interaction patterns. We also show how our approach helps programs adapt to environment changes over time.

Live Paper Discussion Information

Start Time End Time
07/14 15:00 UTC 07/14 17:00 UTC

Virtual Conference Presentation

Paper Reviews

Review 1

Enabling end-users to fine-tune behavior is important for democratizing robot programming. It allows users to personalize behavior without robot programming expertise. The motivation for this work is good, and the paper is relevant to RSS. I appreciate that the authors conducted both simulation and real-world experiments with human subjects. Application of the repair scheme appears to consistently improve the overlap score. == Areas for Improvement == I have concerns about the technical aspects of the paper. Specifically, the Bayesian update Eqn (2) appears incorrect. In the normalizer: - what is the index j associated with? - the sum appears to be over the correct data traces; shouldn't the marginalization be over hole variables? Potentially, this is a typographic error and can be easily corrected. However, if the update is inherently incorrect, the subsequent results would be invalid. The posterior update doesn't appear to be closed-form. If I understand correctly, the authors perform numerical integration over discretized parameter sets. However, this implies exponential complexity wrt the number of hole variables. This point should be clarified in the paper since it limits the applicability of the approach to a small number of hole variables. The input trace doesn't appear to be corrected like the output traces; won't this lead to a "mismatch" between the input and output traces? If so, this could lead to incorrect inference. I appreciate that experiments were performed with human users. A potential improvement is to perform a comparison to a control group (e.g., an alternative baseline method), with a proper statistical tests or Bayesian analysis. Finally, the paper requires a thorough proof-read to correct typographic errors, e.g., - "XXX describe how variables are sampled for execution???" is an unfinished sentence. - "corrections that are user by" -> "corrections that are used by" - \mu_open^iter4 and \mu_open^iter1 appear to be swapped? - Table 1 and in-text: "quite" -> "quiet" - "tunned for" -> "tuned for" Overall, I find the key idea of iterative Bayesian program repair interesting, but the presented work appears preliminary. I hope that the authors can address the technical and presentation issues above.

Review 2

The authors provide a solution to a problem that not only appears in non-expert programming but also when experts intend to program new applications involving social robot behavior. The work is original in the way it attempts to find the unknown variables, sometimes also referred to as magic numbers. The quality of the paper is promising given the number of experiments and examples that were incorporated within the paper but the paper in the current status needs some more clarification to get a good idea about the impact of the work. I had difficulties understanding some parts of the work and its implications properly. Therefore, I believe it will be important to improve the quality and clarity of the paper to ensure that the paper is understandable in all points and its implications are clear to every reader (see details below). From my point of view, one aspect that could be improved is the motivation and introduction. There is a missing link between the motivation and the approach which gives me the impression the approach is not the right solution to the problem given in the motivation. For example, I have difficulties imagining how a non-expert could choose the right distribution for a hole variable when already to experts the range of values might sometimes be unclear (motivation: how non-experts could find new applications). I really think the problem stated is very valuable to the community to be solved but following up on the previous point, the contribution could be more complete if it provided a way to deal with completely unknown hole variables. The need to give the probability distribution gives limits to the applicability of the proposed method. On the other hand, the method provides an excellent way for experts to find the unknown variables faster and in a more pleasant way. Even though the description of the simulation experiment seems detailed enough, I have difficulties to understand how the simulations were done. For example, I did not understand how the sensor data, e.g. head angle, is simulated from intentions. Further, I am unsure about the noisy state traces. Was the noise added to the time of transitions or the length? For me to evaluate the quality of the user study I would need more information on the users, e.g. did they have previous experience with programming. Further, the lack of statistical tests does leave a doubt on the perceived change in fluency or number of interventions. I can see that the authors might have decided to not apply statistical tests due to the small number of participants and large diversity among them. Still, the reasons for the lack of statistical tests should be stated explicitly. Especially, since the results seem not to support the author's hypothesis completely. As 3 out of 10 participants did not manage to create a more fluent interaction, I expect a more critical discussion on this limitation. I think it would be more valuable to the community to find that non-experts have difficulties understanding the "Back" button as their knowledge on state machines might be limited or how robot sensors work. Therefore, I do not agree with the conclusion that the experiments prove the feasibility of the approach. I think a thorough rework on the discussion and result section would improve the quality of the work immensely without needing to rerun experiments. To make the results even more convincing, I encourage the authors to collect more data about their participants and to rerun the experiments with at least 20 participants to get significant and more reliable results.

Review 3

===Detailed review=== Overall, the paper is well-written and clear. The paper makes a timely and relevant contribution to the field of robotics and human-robot interaction, where an active field of research is the effective and efficient creation and repair of robot programs, lowering the load on experts and shifting more towards non-experts to provide input and feedback. The contributions are clear and the motivation well explained. Overall, the methodology seems sound and the evaluation is thorough, including simulated and human experiments, as well as including changes in environment and noisy input. Please find my detailed comments below. ===Major=== - The paper describes three different ‘social robot tasks’ that consist of storytelling, a neck exercise, and open Q&A. Where the three tasks essentially differ, they all require the same type of user feedback (‘go back’, or ‘next’). It would be great if the paper can discuss this approach for more complex situations and how this affects the effectiveness of this approach. For example, what if there is a larger number of potential actions to choose from than three (stop, wait, read) and the 'go back' does not necessarily reflect that instead of action X the alternative is Y, maybe there are many alternatives, and it may even become unfeasible for a human to keep providing feedback until the robot has learned the correct transition, just because it becomes too timely and the user may be reluctant to providing feedback and/or using the technology. Since the paper focuses on social robot actions specifically, I think this is crucial to at least touch upon in a discussion in the paper, maybe if the paper would include a Discussion section to discuss any drawbacks or potential avenues for future work would improve the paper even more. - Similarly, it would be interesting to discuss the limitation of incorrect user feedback, whether on intentional or not. For example, one may turn the head to the left, while the robot said right but not notice that it was the incorrect side (left and right are often mixed by people). In this task, maybe the consequences are negligible, however, in a situation where the error may propagate further through the interaction, it may be worth discussing how to best handle this. Humans who did not fix incorrect transition errors is briefly described in IV.C.4 but not discussed in depth. - Section IV.C 4) Results: the before-repair percentage overlaps as reported in text seem incorrect, I think the value for iter4_open and iter4_neck are accidentally switched, given the numbers shown in Fig. 6 (left). Correct this and maybe (this is minor) the paper could instead report that mean_iter1_neck increased from 0.33 (SD = 0.08) to 0.36 (SD = 0.08), and mean_iter1_open decreased from 0.80 (SD = 0.30) to 0.57 (SD = 0.21), for clarity of reading. Currently, it’s a bit tricky to read in the way they are presented now. - One concern I have is that the sample of people used in the experiment (N=10) is small and therefore, it is reasonable not to perform stats, however, the results are reported in a strong manner, talking about strict increases and decreases. I think it is fair to say that this trend is observed, however, maybe tone down the results a bit to align it with the actual evidence that is presented to prevent the results to be overly confidently interpreted by readers. - The conclusion introduces sudden new directions for future work that are not mentioned before, which is not the goal of a conclusion. A conclusion section should summarize what has been discussed in the paper. Maybe it could be called ``conclusion and future work’’ section, but it should talk about avenues for future work in more detail. ===Minor=== - In Section III.A. Program Execution, there is an unfinished sentence that needs to be changed: ``XXX describe how variables are samples for execution???’’ Probably, due to the unfinished paragraph the goal of this subsubsection is not clear, maybe it can be included in another section rather than be separate. - In Fig. 4: It could increase clarity to mention ‘dotted line’ in the figure’s caption when talking about `missing transition’ errors. - In Section III.B, the paper mentions that speech commands can be used for ‘next’ and ‘go back’ button alternative, however, one must then consider that this is more error prone (e.g. speech recognition errors) in itself as input mechanism. - Very minor (typo): Section III.B, last paragraph: ``To that end the use of interaction repair mechanisms in the state trace need to be converted to state corrections that are user by the repair algorithm’’ – are user/are used. - Very minor: sort the reference numbers, for example [35, 10, 2] – [2, 10, 35]. - In section III.C the paper describes some details about implementation, I would like to ask for clarification if the code will be made available as to promote reuse and continuation of this work also by other researchers? And if chosen not to release the code, a motivation for this decision would be appreciated. - IV.B 1) procedure: ``IterativeBayesRepair (Alg. 2)’’ should be (Alg. 3) - Same paragraph IV.B 1) procedure: ``The noisy state trace was computed by adding uniform noisy of [-2,2] to every state changes in the ground truth …’’ should be ‘noisy of [-2,2]’ should be ‘noise’ and ‘every state changes’ should be ‘every state change’ - Same paragraph IV.B 1) procedure: ``(…) on observing the two first time users’’ – observing two first time users. The paper does not mention them before, so should not say ‘the users’. - Paragraph IV.B 2) measures: ``… being sensitive to the length of the interaction of which we control...’’ – ``which we control’’, and also ``we measured the speed of (…) without break in-between iterations’’ – `breaking in-between iterations’, or ‘breaks in between iterations’ or `a break in between iterations’? - IV.B 3) results: ``algorithbms’’ typo. - Why not the storytelling task in the evaluation? - Why was it set to four repair iterations, was it clear beforehand that this number was enough? - Q&A capitalized in some places not capitalized in others. - Section IV.C: ``Over the four iterations, The transition parameters (…)’’ – the should not be capitalized. - Section IV.C.5 ```(...) we conducted an experiment involving one human user who as a participant (…)’’ – who was a participant (typo) - Section IV.C.5 ``(…) i.e., the quite room (…) – quiet room (typo) Also in TABLE I. change ‘quite’ to ‘quiet’. - Section V. ``the goal of our research is motivating by…’’ -- ``motivated’’ (typo). - Section V. ``(..) exploring the interactive system for keep the programmer (…)’’ – ``for keeping the programmer’’ (typo)