aPJSC Sberbank,
32 Kutuzovskiy ave., Moscow, 121170, Russia bMY.GAMES,
39/79 Leningradskiy ave., Moscow, 125167, Russia
Abstract:
This paper presents the results of experimental validation of some structural issues concerning the practical use of
methods to overcome catastrophic forgetting of neural networks. A comparison of current effective methods like EWC
(Elastic Weight Consolidation) and WVA (Weight Velocity Attenuation) is made and their advantages and disadvantages are
considered. It is shown that EWC is better for tasks where full retention of learned skills is required on all the tasks in the
training queue, while WVA is more suitable for sequential tasks with very limited computational resources, or when reuse
of representations and acceleration of learning from task to task is required rather than exact retention of the skills. The
attenuation of the WVA method must be applied to the optimization step, i. e. to the increments of neural network weights,
rather than to the loss function gradient itself, and this is true for any gradient optimization method except the simplest
stochastic gradient descent (SGD). The choice of the optimal weights attenuation function between the hyperbolic function
and the exponent is considered. It is shown that hyperbolic attenuation is preferable because, despite comparable quality at
optimal values of the hyperparameter of the WVA method, it is more robust to hyperparameter deviations from the optimal
value (this hyperparameter in the WVA method provides a balance between preservation of old skills and learning a new
skill). Empirical observations are presented that support the hypothesis that the optimal value of this hyperparameter does
not depend on the number of tasks in the sequential learning queue. And, consequently, this hyperparameter can be picked
up on a small number of tasks and used on longer sequences.