Abstract:
The problem of generating high-quality synthetic data is crucial for many data science tasks. A generated dataset can cut the costs on the augmentation of the existing data with additional instances, for example, in physics, or help with its privacy protection, for instance, in banking. However, generating a tabular dataset is challenging, as the data contains both numerical and categorical features. In this paper, we investigate modern approaches for tabular data generation, evaluate several modifications of the state-of-the-art model and whether they affect the quality of synthesized datasets. The modifications include the use of Gaussian diffusion models for both numerical and categorical features and Gaussian noise for the regularization during the training procedure. Comprehensive experiments and estimation of the tabular data generation quality metrics on five publicly available datasets prove that the proposed modified model retains a similar quality of synthesized data compared to the original model while requiring less time to generate synthetic samples.