RUS  ENG
Full version
JOURNALS // Modelirovanie i Analiz Informatsionnykh Sistem // Archive

Model. Anal. Inform. Sist., 2025 Volume 32, Number 4, Pages 396–416 (Mi mais858)

Artificial intelligence

The impact of different prompt types on the quality of automatic assessment of student answers by artificial intelligence models

I. A. Meshcheryakov, N. S. Lagutina

P.G. Demidov Yaroslavl State University, Yaroslavl, Russia

Abstract: Artificial intelligence (AI) models can fully or partially automate the assessment of student assignments, making assessment methods more accurate and objective. The performance of such models depends not only on the underlying algorithms and training data but also on the effectiveness of the queries they formulate. The aim of the work is to investigate the possibility of using open artificial intelligence models to evaluate students' answers for compliance with the teacher's standard answer, as well as to increase the quality of problem solving using prompt engineering. The method for determining this quality was selected by statistical characteristics of the results of classifying answer texts into four categories: correct, partially correct, incorrect, inappropriate to the topic of the question, by GAI models using the following prompt options: simple prompt, role-playing prompt, “chain of thoughts” prompt, prompt generated by artificial intelligence. Models available for open use were selected for the study: ChatGPT o3-mini, DeepSeek V3, Mistral-Small-3.1-24B-Instruct-2503-IQ4_XS and Grok 3. Testing of the models was carried out on a corpus of student texts collected by teachers of Demidov Yaroslavl State University, from 507 answers to 8 questions. The best quality of answer assessment was shown by the ChatGPT o3-mini model. with the prompt it generated. The accuracy rate was 0.82, the mean square error (MSE) was 0.2, and the F-score reached 0.8, demonstrating the potential of GAI as not only an assessment tool but also a means of automatically generating instructions. The Fleiss coefficient was used to assess the consistency of the model's responses across 10 identical queries. For this model-prompt pair, it ranged from 0.48 for complex questions to 0.69 for simple questions.

Keywords: artificial intelligence, prompt engineering, automatic short answer grading, ChatGPT o-3 mini, DeepSeek V3, Mistral-Small-3.1-24B-Instruct-2503-IQ4_XS, Zero-Shot Prompting, neural networks, NLP, Chain-of-Thought, Role prompting.

UDC: 004.891.3

MSC: 68T50

Received: 30.09.2025
Revised: 02.11.2025
Accepted: 18.11.2025

DOI: 10.18255/1818-1015-2025-4-396-416



© Steklov Math. Inst. of RAS, 2026