#46918 | AsPredicted

'Uncertain XAI - Aug 2020'
(AsPredicted #46918)

Created:       08/30/2020 12:33 AM (PT)

This is an anonymized version of the pre-registration.  It was created by the author(s) to use during peer-review.
A non-anonymized version (containing author names) should be made available by the authors when the work it supports is made public.

1) Have any data been collected for this study already?
No, no data have been collected for this study yet.

2) What's the main question being asked or hypothesis being tested in this study?
We will investigate how two techniques to communicate explainable AI (XAI) attribution uncertainty – showing or suppressing – affects user trust and decision quality when using the XAI for a decision task. Suppressing involves adjusting attribution scores to reduce uncertainty.

Participants will be presented with a wine quality inspection scenario, where they need to estimate the wine quality score as a number rating, and decide whether to accept (score >50) or reject a wine. This will be repeated for 15 randomly ordered trials for each condition (version of XAI). Participants will be exposed to two conditions (out of 5) within-subjects, randomly allocated.

We hypothesize that dependent variables will be influenced by different XAI conditions in increasing order as:
1. Decision time: Baseline < Suppress < Show < ShowSuppress
2. Decision closeness: Show < Baseline < Suppress <= ShowSuppress
3. Decision quality: Baseline <= Show < Suppress <= ShowSuppress
4. Decision uncertainty: Suppressed <= ShowSuppress < Baseline < Show
5. Trust of system: Show < Baseline < Suppress < ShowSuppress
6. Confidence in decision: Show < Baseline < Suppress <= ShowSuppress
7. Helpfulness of system overall: Baseline < Show < Suppress < ShowSuppress

We do not have specific hypotheses regarding specific system features, and will perform exploratory analysis for comparison.

3) Describe the key dependent variable(s) specifying how they will be measured.
Per trial, we will ask:
1. Decision time: measured automatically when participant rates and decides for each wine trial. (e.g. Log(Decision time))
(2,3). Wine quality score value: number text.
(4). Wine quality score uncertainty estimate: number text.
Measures numbered with brackets are used to compute dependent variables used in hypothesis testing.

We will compute
2. Decision closeness: similarity (e.g. exp(-distance)) between the participant’s wine quality score estimate and the XAI system’s score value.
3. Decision quality: similarity (e.g. exp(-distance)) between the participant’s wine quality score estimate and the ground truth score.
4. Decision uncertainty: ratio of participant’s score uncertainty to the XAI system’s score uncertainty.

After each condition, we will ask:
5. Trust of system: self-reported 7-point Likert scale (Strongly Disagree to Strongly Agree).
6. Confidence in decision: self-reported on a 7-point Likert scale (Strongly Disagree to Strongly Agree).
7. Helpfulness of overall system self-reported 7-point Likert scale (Strongly Disagree to Strongly Agree).
8. Helpfulness of system feature (for each of 6-8, depending on condition): self-reported 7-point Likert scale (Strongly Disagree to Strongly Agree).

4) How many and which conditions will participants be assigned to?
Our experiment has 5 conditions:
1. (None) AI system without explanation (non-X).
2. (Baseline) XAI system not showing or suppressing attribution uncertainty.
3. (Show) XAI system showing attribution uncertainty.
4. (Suppress) XAI system suppressing attribution uncertainty, not showing uncertainty.
5. (ShowSuppress) XAI system showing the suppressed attribution uncertainty.

Each participant will use two versions (conditions) in a within-subject experiment design. We will recruit 50 participants per condition, for a total of 5*50/2=125 participants.

5) Specify exactly which analyses you will conduct to examine the main question/hypothesis.
For per-trial dependent variables (decision time, decision quality, decision uncertainty, decision closeness), we will fit a multivariate linear mixed effects model, with XAI technique (5 conditions) as fixed effect, and participant ID and trial ID as random effects.

For per-condition dependent variables (trust, confidence, helpfulness), we will fit a multivariate linear mixed effects model, with XAI technique (5 conditions) as fixed effect, and participant ID as random effect.

We will perform post-hoc contrast t-tests for differences identified to specifically test hypotheses or to study unanticipated differences. We will perform Bonferroni correction to reduce Type 1 error.

6) Describe exactly how outliers will be defined and handled, and your precise rule(s) for excluding observations.
Participant inclusion criteria:
- Amazon Mechanical Turk workers,
- from the United States location,
- with acceptance rate of >97%, for at least 5000 HITs.
- Pass >= 5/7 screening comprehension questions after a tutorial session.

Data exclusion:
- Remove per-trial responses with (statistical) outlier completion times (too long/short).
- Remove responses that are too identical across trials and questions, which suggest participants rushing through questions and not answering carefully.
- Remove per-condition responses from participants who fail (≤1/3) pre-condition comprehension questions.
- Remove per-trial responses with inconsistent answers (e.g. accepting a wine but giving an invalid score below 50)
- Remove participants who wrote meaningless or nonsensical free text answers.

7) How many observations will be collected or what will determine sample size?
No need to justify decision, but be precise about exactly how the number will be determined.

Sample size: 125 participants who passed inclusion criteria and whose data is not excluded after cleaning.

8) Anything else you would like to pre-register?
(e.g., secondary analyses, variables collected for exploratory purposes, unusual analyses planned?)

To check for effects due to risk aversion, we will ask self-reported questions about coping with uncertainty (7-point Likert scale on agreement):
1. Uncertainty stops me from having a firm opinion.
2. The smallest doubt can stop me from acting.
3. I must get away from all uncertain situations.
4. When I feel uncertain, I try to take decisive steps to clarify the situation.
5. When I feel uncertain about something, I try to rationally weigh up all the information I have.
6. When uncertain, I act very cautiously until I have more information about the situation.

To check for effects due to uncertainty handling strategies, we will ask self-reported questions about decision making strategy under uncertainty (7-point Likert scale on frequency)
1. Use the currently known most likely estimation.
2. Use the uncertainty to estimate the best and worst cases.
3. Find out more information (by reading more, asking others, etc.) to estimate better or reduce the uncertainty.

To measure the representativeness of the participants, we will ask about demographic details:
1. Age: number text
2. Gender: male, female, or other
3. Education level: multiple choice
4: Ethnicity: multiple choice
5. Employment status: multiple choice
6. Occupation (which industry): multiple choice