#183,994 | AsPredicted

'Investigating the effect of GPT4-generated feedback on learning from videos'
(AsPredicted #183,994)


Author(s)
This pre-registration is currently anonymous to enable blind peer-review.
It has 5 authors.
Pre-registered on
2024/07/23 04:17 (PT)

1) Have any data been collected for this study already?
No, no data have been collected for this study yet.

2) What's the main question being asked or hypothesis being tested in this study?
This study investigates the role of AI-generated summaries and AI-generated questions on learning outcomes in the context of educational videos. Participants will watch a video lecture in chemistry and answer retention and transfer questions. They will be instructed to pause the video at points of comprehension difficulty, where they will be provided by either a GPT 4-generated summary or an open-ended question. The content of the feedback will cover the duration between two pauses.

Based on previous works (Wang et al., 2023) we expect that
1) GPT summaries would be more effective for learning outcomes in the retention test
2) GPT questions would be more effective for transfer
3a) Mental effort and 3b) perceived task difficulty will be perceived as higher in the question condition due to the higher amount of generative activity

Additionally, we will examine the role of personal characteristics for pause frequency and for learning outcomes. In our study, pause frequency is an indicator of AI usage in an experimental context.
In a collaborative human-ChatGPT writing task, Luther et al. (2024) found positive correlations between technology affinity (TA) and the frequency of prompts for complete texts. This means usage of AI in an experimental context was associated with TA. We expect similar results:
4a) technology affinity predicts frequency of pauses (higher TA ~ people pause more often)
In other studies, willingness to use AI is predicted by people´s risk and benefit perception (e.g. Schwesig et al., 2023). Based on these findings, we expect that:
4b) risk perception predicts frequency of pauses (higher RP ~ people pause less often)
4c) benefit perception predicts frequency of pauses (higher BP ~ people pause more often)
As an exploratory research question, we will examine if technology affinity, risk perception, benefit perception and pause frequency predict learning results for retention and/ or transfer outcomes.

3) Describe the key dependent variable(s) specifying how they will be measured.
The first four dependent variables are concerned with the experimental manipulation (questions vs. summaries):
DV 1: Retention-learning score. This variable results from the difference between the pre and post scores of a knowledge test consisting of 24 true-false questions (close-ended).
DV 2: Transfer. Five open-ended transfer question will be asked after participants watched the video. Every question can be scored up to three points, resulting in a 15-point score maximum.
DV 3: Mental effort. This variable will be measured after the video with a subjective rating scale (Paas, 1992; 9-point-scale, ranging from 1 (very low) to 9 (very high))
DV 4: Perceived task difficulty. This variable will be measured after the video with a subjective rating scale (Wang et al, 2023; 9-point-scale, 1(very easy)- 9 (very difficult)).

Regarding personal characteristics, we will measure three additional constructs:
DV 5: Technology affinity. We will measure this variable with the German version of the Affinity for Technology Interaction (ATI) Scale (Franke, Attig, & Wessel, 2019; 6-point scale, 1(completely disagree)- 6(completely agree)). Participants will be instructed to indicate the degree to which they agree/ disagree with nine statements.
DV 6: Risk perception. We will measure this variable with the same six items that Said et al. (2023) used. They translated items from the risk perception scale proposed by Walpole and Wilson (2021) into German language and adapted them to measure risk perception of AI in specific contexts. The participants in our study will be instructed to rate three questions regarding the affect dimension of risk perception and three questions regarding the susceptibility dimension of risk perception (thinking about the use of AI in learning contexts). Participants can give their answers on a 5-point scale (not at all, slightly, moderately, very, extremely).
DV 7: Benefit perception. We will measure this variable with the same six items that Said et al. (2023) used. They translated items from the risk perception scale proposed by Walpole and Wilson (2021) into German language and adapted them to measure benefit perception of AI in specific contexts. The participants in our study will be instructed to rate three questions regarding the affect dimension of benefit perception and three questions regarding the susceptibility dimension of benefit perception (thinking about the use of AI in learning contexts). Participants can give their answers on a 5-point scale, (not at all, slightly, moderately, very, extremely).

4) How many and which conditions will participants be assigned to?
The experiment has one independent variable (type of feedback) with two levels (AI-generated summary or AI-generated question). This variable will be randomly assigned between-subjects.

5) Specify exactly which analyses you will conduct to examine the main question/hypothesis.
To test hypotheses 1), 2), 3a) and 3b), we plan to apply four separate t-tests (testing the dependent variables 1-4: retention-learning score, transfer, mental effort and perceived task difficulty).
To test hypothesis 4a) regarding personal characteristics, we will use a simple regression model with frequency of pauses as criterion and technology affinity as predictor. To test hypotheses 4b) and 4c), we will then successively add risk perception and benefit perception as predictors and compare if the extended multiple regression models explain significantly more variance of pause frequency.
To examine the exploratory outcomes, we will use two additional regression models, one with retention outcomes and one transfer outcomes as criteria. The rest of the models is the same as described above.

6) Describe exactly how outliers will be defined and handled, and your precise rule(s) for excluding observations.
Participants who do not meet inclusion criteria (German language, over 18 years old, education level highschool diploma or above, job/ subject of studies is not concerned with chemistry) cannot participate in the study.
Not pausing: The experimental manipulation only works if the participants pause the video. If they do not press pause at least one time, they will be excluded. Participants pressing more than 50 times will also be excluded due to the summaries getting too fragmented for technical reasons.
Not answering any questions in the question-condition: Participants in the question condition will be excluded if they do not answer at least one question. This will be done to make sure that participants actively engage with the questions. While participants can omit questions (which could happen due to a lack of knowledge), given answers have to contain at least one word to ensure true engagement for participants to not be excluded.
Attention checks (according to prolific guidelines): Before watching the video, participants have to indicate their agreement with an absurd statement to ensure that they are attentive. At the end of the experiment the participants will be asked a question which it is important to read in full to answer correctly. Participants who fail both checks will be excluded.

7) How many observations will be collected or what will determine sample size?
No need to justify decision, but be precise about exactly how the number will be determined.

Experiment 3 of Wang et al. (2023) showed an effect size of Cohen's d = 0.83. A power analysis conducted by simulating the experiment resulted in 24 participants per group, so 46 participants in total. For the analyses of hypotheses regarding personal characteristics we did a power simulation in R based on data of Luther et al. (2024). In a collaborative human-ChatGPT writing scenario they found a correlation of r = 0.19 between ATI score (SD = 0.91) and prompts for complete texts. Expecting similar effect sizes, we need at least 214 participants to reach a power of 80%.
Considering the participants who might drop out or be excluded, we planned to stop the data collection upon reaching the number of 240.

8) Anything else you would like to pre-register?
(e.g., secondary analyses, variables collected for exploratory purposes, unusual analyses planned?)

Additionally, we plan to analyse for exploratory purposes the total time spend on the questions/summaries, the confidence measures (5-point scale), reasons for stopping the video, the number of words of the questions/ summaries and the timestamps of the pauses in relation to the time intervals that relate the retention questions. Also, we will investigate how often summaries and questions of GPT-4 contain hallucinations and how they impact learning.

References
Franke, T., Attig, C., & Wessel, D. (2018). A Personal Resource for Technology interaction: Development and Validation of the Affinity for Technology Interaction (ATI) scale. International Journal of Human-computer Interaction, 35(6), 456–467. https://doi.org/10.1080/10447318.2018.1456150
Luther, T., Kimmerle, J., & Cress, U. (2024). Teaming up with an AI: Exploring human-AI collaboration in a writing scenario with ChatGPT. OSFPreprints. https://doi.org/10.31219/osf.io/extmc
Paas, F. G. W. C. (1992). Training strategies for attaining transfer of problem-solving skill in statistics: A cognitive-load approach. Journal Of Educational Psychology, 84(4), 429–434. https://doi.org/10.1037/0022-0663.84.4.429
Said, N., Potinteu, A. E., Brich, I., Buder, J., Schumm, H., & Huff, M. (2023). An artificial intelligence perspective: How knowledge and confidence shape risk and benefit perception. Computers in Human Behavior, 149, 107855. https://doi.org/10.1016/j.chb.2023.107855
Schwesig, R., Brich, I., Buder, J., Huff, M., & Said, N. (2023). Using artificial intelligence (AI)? Risk and opportunity perception of AI predict people's willingness to use AI. Journal of Risk Research, 26(10), 1053–1084. https://doi.org/10.1080/13669877.2023.2249927
Walpole, H. D., & Wilson, R. S. (2020). Extending a broadly applicable measure of risk perception: the case for susceptibility. Journal of Risk Research, 24(2), 135–147. https://doi.org/10.1080/13669877.2020.1749874
Wang, Y., Wang, F., Mayer, R. E., Hu, X. & Gong, S. (2023). Benefits of prompting students to generate summaries during pauses in segmented multimedia lessons. Journal Of Computer Assisted Learning, 39(4), 1259–1273. https://doi.org/10.1111/jcal.12797

Version of AsPredicted Questions: 2.00