RUS  ENG
Full version
JOURNALS // Matematicheskoe modelirovanie // Archive

Mat. Model., 2025 Volume 37, Number 2, Pages 185–198 (Mi mm4606)

Model of statistical dependence of sample distributions using nearest neighbor graphs and Kullback-–Leibler divergence

A. A. Kislitsyn

Keldysh Institute of Applied Mathematics of RAS

Abstract: The paper presents the results of numerical modeling of the distribution of graphs of the nearest neighbors by the number of connectivity components and vertices by degrees for the case when the distances between objects are not symmetric. The objects are discrete probability distributions, the distances between which are calculated as Kullback-Leibler divergence. Estimates of the probability that a set of probability distributions is formed by statistically dependent objects are numerically obtained. The numerical algorithm for constructing a benchmark for the probabilities of implementing a certain structure of the nearest neighbor graph is based on the fact that, up to isomorphism, this structure does not depend on the distribution of distances between objects. An algorithm for collecting sample statistics of nearest neighbor graphs for arbitrary random asymmetric matrices, the elements of which are treated as distances, is described. An example of the analysis of big data distributions obtained as a result of automatic processing of a corpus of more than 100 thousand literary texts by 8.5 thousand authors in Russian is considered. The corpus is structured according to the authors in the form of $n$-grams of letter combinations. The example is interesting because the $n$-gram distributions make it possible to accurately identify the author of a single text, so that the reference author distributions can be considered as a basis. In the exact sense of linear algebra, the vectors of the author's standards are linearly independent. At the same time, it turned out that these vectors are statistically dependent with a probability almost equal to 1, which allows additional structuring of the data array. A comparison was also made with the results obtained in the L1 histogram norm for the same distributions, and it was shown that the benchmark for asymmetric distances allows in this example to obtain an answer at a higher level of confidence.

Keywords: nearest neighbor graph, vertex degree distribution, clustering, Kullback–Leibler divergence, data dependence criterion, text author identification.

Received: 09.09.2024
Revised: 09.09.2024
Accepted: 14.10.2024

DOI: 10.20948/mm-2025-02-14



© Steklov Math. Inst. of RAS, 2026