Abstract:
Transformer-based models, especially BERT-like architectures, have become the standard for solving Natural Language Processing (NLP) tasks: text classification, summarisation, question answering. Their high performance is undoubted, however, the key challenge remains interpretability. Understanding the reasons for the decisions of models is critical for increasing confidence in them, detecting bias, and complying with ethical and legal standards. Existing explanation methods focus on identifying individual important tokens or interactions only between adjacent tokens or their pairs, ignoring the global context. This limits their usefulness because such explanations often do not capture the logic of decision making at a level that humans can understand. To bridge this gap, we introduce an approach that translates model predictions into natural language explanations. This algorithm is fitted through clustering of transformer layers, cluster labels are extracted and indices are formed to select close examples. Given examples are fed into the Large Language Models (LLM) to identify key common features in natural language. Frequency analysis of the features in the examples forms the basis of evidence with certain probability. In the case study of detecting machine-generated texts, our approach reveals how classifiers might rely on stylistic cues or structural anomalies.