### Iterative Repair of Social Robot Programs from Implicit User Feedback via Bayesian Inference

Michael Jae-Yoon Chung, Maya Cakmak

### Abstract

Creating natural and autonomous interactions with social robots requires rich, multi-modal sensory input from the user. Writing interactive robot programs that make use of this input can demand tedious and error-prone tuning of program parameters, such as tuning thresholds on noisy sensory streams for detecting whether the robot's user is engaged or not. This tuning process dealing with low-level streams and parameters makes programming of social robots time-consuming and inaccessible for people who could benefit the most from unique use cases of social robots. To address this challenge, we propose the use of iterative program repair, where programmers create an initial program sketch in our new Social Robot Program Transition Sketch Language (SoRTSketch), a domain-specific language that supports expressing uncertainties related to thresholds in transition functions. The program is then iteratively repaired using Bayesian inference based on corrections of interaction traces that are either provided by the programmer or derived from implicit feedback given by the user during the interaction. Based on experiments with a human simulator and with 10 human users, we demonstrate the ease and effectiveness of this approach in improving social robot programming and program outputs that represent three common human-robot interaction patterns. We also show how our approach helps programs adapt to environment changes over time.

### Live Paper Discussion Information

Start Time End Time
07/14 15:00 UTC 07/14 17:00 UTC

### Paper Reviews

Review 1

Enabling end-users to fine-tune behavior is important for democratizing robot programming. It allows users to personalize behavior without robot programming expertise. The motivation for this work is good, and the paper is relevant to RSS. I appreciate that the authors conducted both simulation and real-world experiments with human subjects. Application of the repair scheme appears to consistently improve the overlap score. == Areas for Improvement == I have concerns about the technical aspects of the paper. Specifically, the Bayesian update Eqn (2) appears incorrect. In the normalizer: - what is the index j associated with? - the sum appears to be over the correct data traces; shouldn't the marginalization be over hole variables? Potentially, this is a typographic error and can be easily corrected. However, if the update is inherently incorrect, the subsequent results would be invalid. The posterior update doesn't appear to be closed-form. If I understand correctly, the authors perform numerical integration over discretized parameter sets. However, this implies exponential complexity wrt the number of hole variables. This point should be clarified in the paper since it limits the applicability of the approach to a small number of hole variables. The input trace doesn't appear to be corrected like the output traces; won't this lead to a "mismatch" between the input and output traces? If so, this could lead to incorrect inference. I appreciate that experiments were performed with human users. A potential improvement is to perform a comparison to a control group (e.g., an alternative baseline method), with a proper statistical tests or Bayesian analysis. Finally, the paper requires a thorough proof-read to correct typographic errors, e.g., - "XXX describe how variables are sampled for execution???" is an unfinished sentence. - "corrections that are user by" -> "corrections that are used by" - \mu_open^iter4 and \mu_open^iter1 appear to be swapped? - Table 1 and in-text: "quite" -> "quiet" - "tunned for" -> "tuned for" Overall, I find the key idea of iterative Bayesian program repair interesting, but the presented work appears preliminary. I hope that the authors can address the technical and presentation issues above.

Review 2

===Detailed review=== Overall, the paper is well-written and clear. The paper makes a timely and relevant contribution to the field of robotics and human-robot interaction, where an active field of research is the effective and efficient creation and repair of robot programs, lowering the load on experts and shifting more towards non-experts to provide input and feedback. The contributions are clear and the motivation well explained. Overall, the methodology seems sound and the evaluation is thorough, including simulated and human experiments, as well as including changes in environment and noisy input. Please find my detailed comments below. ===Major=== - The paper describes three different ‘social robot tasks’ that consist of storytelling, a neck exercise, and open Q&A. Where the three tasks essentially differ, they all require the same type of user feedback (‘go back’, or ‘next’). It would be great if the paper can discuss this approach for more complex situations and how this affects the effectiveness of this approach. For example, what if there is a larger number of potential actions to choose from than three (stop, wait, read) and the 'go back' does not necessarily reflect that instead of action X the alternative is Y, maybe there are many alternatives, and it may even become unfeasible for a human to keep providing feedback until the robot has learned the correct transition, just because it becomes too timely and the user may be reluctant to providing feedback and/or using the technology. Since the paper focuses on social robot actions specifically, I think this is crucial to at least touch upon in a discussion in the paper, maybe if the paper would include a Discussion section to discuss any drawbacks or potential avenues for future work would improve the paper even more. - Similarly, it would be interesting to discuss the limitation of incorrect user feedback, whether on intentional or not. For example, one may turn the head to the left, while the robot said right but not notice that it was the incorrect side (left and right are often mixed by people). In this task, maybe the consequences are negligible, however, in a situation where the error may propagate further through the interaction, it may be worth discussing how to best handle this. Humans who did not fix incorrect transition errors is briefly described in IV.C.4 but not discussed in depth. - Section IV.C 4) Results: the before-repair percentage overlaps as reported in text seem incorrect, I think the value for iter4_open and iter4_neck are accidentally switched, given the numbers shown in Fig. 6 (left). Correct this and maybe (this is minor) the paper could instead report that mean_iter1_neck increased from 0.33 (SD = 0.08) to 0.36 (SD = 0.08), and mean_iter1_open decreased from 0.80 (SD = 0.30) to 0.57 (SD = 0.21), for clarity of reading. Currently, it’s a bit tricky to read in the way they are presented now. - One concern I have is that the sample of people used in the experiment (N=10) is small and therefore, it is reasonable not to perform stats, however, the results are reported in a strong manner, talking about strict increases and decreases. I think it is fair to say that this trend is observed, however, maybe tone down the results a bit to align it with the actual evidence that is presented to prevent the results to be overly confidently interpreted by readers. - The conclusion introduces sudden new directions for future work that are not mentioned before, which is not the goal of a conclusion. A conclusion section should summarize what has been discussed in the paper. Maybe it could be called conclusion and future work’’ section, but it should talk about avenues for future work in more detail. ===Minor=== - In Section III.A. Program Execution, there is an unfinished sentence that needs to be changed: XXX describe how variables are samples for execution???’’ Probably, due to the unfinished paragraph the goal of this subsubsection is not clear, maybe it can be included in another section rather than be separate. - In Fig. 4: It could increase clarity to mention ‘dotted line’ in the figure’s caption when talking about missing transition’ errors. - In Section III.B, the paper mentions that speech commands can be used for ‘next’ and ‘go back’ button alternative, however, one must then consider that this is more error prone (e.g. speech recognition errors) in itself as input mechanism. - Very minor (typo): Section III.B, last paragraph: To that end the use of interaction repair mechanisms in the state trace need to be converted to state corrections that are user by the repair algorithm’’ – are user/are used. - Very minor: sort the reference numbers, for example [35, 10, 2] – [2, 10, 35]. - In section III.C the paper describes some details about implementation, I would like to ask for clarification if the code will be made available as to promote reuse and continuation of this work also by other researchers? And if chosen not to release the code, a motivation for this decision would be appreciated. - IV.B 1) procedure: IterativeBayesRepair (Alg. 2)’’ should be (Alg. 3) - Same paragraph IV.B 1) procedure: The noisy state trace was computed by adding uniform noisy of [-2,2] to every state changes in the ground truth …’’ should be ‘noisy of [-2,2]’ should be ‘noise’ and ‘every state changes’ should be ‘every state change’ - Same paragraph IV.B 1) procedure: (…) on observing the two first time users’’ – observing two first time users. The paper does not mention them before, so should not say ‘the users’. - Paragraph IV.B 2) measures: … being sensitive to the length of the interaction of which we control...’’ – which we control’’, and also we measured the speed of (…) without break in-between iterations’’ – breaking in-between iterations’, or ‘breaks in between iterations’ or a break in between iterations’? - IV.B 3) results: algorithbms’’ typo. - Why not the storytelling task in the evaluation? - Why was it set to four repair iterations, was it clear beforehand that this number was enough? - Q&A capitalized in some places not capitalized in others. - Section IV.C: Over the four iterations, The transition parameters (…)’’ – the should not be capitalized. - Section IV.C.5 (...) we conducted an experiment involving one human user who as a participant (…)’’ – who was a participant (typo) - Section IV.C.5 (…) i.e., the quite room (…) – quiet room (typo) Also in TABLE I. change ‘quite’ to ‘quiet’. - Section V. the goal of our research is motivating by…’’ -- motivated’’ (typo). - Section V. (..) exploring the interactive system for keep the programmer (…)’’ – for keeping the programmer’’ (typo)