Abstract:
The paper analyses the significance of layers of large language models in a question-answer task. The study is conducted using instruct version of LLaMA-2-7B-Chat-GPTQ and instruct version of Vicuna-7B-v1.5-GPTQ, the base version Mistral-7B-v0.1-GPTQ, and the Russian language questionanswer dataset MuSeRC. The models are fine-tuned using the QLoRA method, based on adding adapters to different layers. The GPT-4o model, which showed high agreement with the annotator scores, is used to assess the quality of the answers. The results show that for the instruct models LLaMA and Vicuna, the last four layers are significant, while for the base model Mistral, the first four layers are significant. Meanwhile, models with fine-tuning only of the last layer (LLaMA, Vicuna) or only the first layer (Mistral) show the second highest mean GPT-4o score among all variants of the tested models. Adding an adapter only to the last layer achieves higher quality than adding adapters to all 32 layers of the LLaMA and Vicuna models. Adding an adapter only to the first layer of the Mistral model shows the secondbest result after the model with adapters on all layers.
Keywords:interpretation, role of layers in learning, large language models, question-answer task, MuSeRC, LLM-as-a-Judge.