Optics and Spectroscopy. Laser Physics
Effect of low concentrations of hyaluronic acid on the structure of whey protein isolate during conjugation: Development and optimization of machine learning models based on adaptive boosting for spectroscopic data analysis
S. A. Shevtsova,
M. S. Saveleva,
O. A. Mayorova,
E. S. Prikhozhdenko Saratov State University
Abstract:
Background and Objectives: Multicomponent mixtures with bioactive compounds, such as hyaluronic acid (HA) in protein matrices, are critical in pharmaceuticals, nutraceuticals, and cosmetics. However, detecting low-concentration additives (e.g., 0.1–0.5 wt.
$\%$ HA in whey protein isolate, WPI) remains challenging due to signal interference and matrix complexity. Raman spectroscopy (RS) is a powerful tool for such analyses, but interpreting spectral data requires advanced computational methods. This study leverages adaptive boosting (AdaBoost), an ensemble ML algorithm, to (1) classify WPI-HA mixtures by HA concentration, (2) quantify HA content via regression, and (3) determine the minimal training dataset size needed for robust predictions.
Materials and Methods: WPI (5 wt.
$\%$) was mixed with HA (0.1, 0.25, 0.5 wt.
$\%$) in saline, dialyzed, and dried into thin films. Renishaw inVia spectrometer equipped with a 532 nm laser was implemented to collect 600 spectra/sample (20
$\times$30-point maps). Preprocessing included cosmic-ray removal, baseline correction, and L
$_2$ normalization. AdaBoost models (scikit-learn) were optimized via GridSearchCV (hyperparameters: DecisionTree max
$\_$depth, 1–3; n
$\_$estimators, 50–350). Performance was tested across training set sizes (50–500 spectra/sample). Metrics included accuracy (classification) and R
$^2$/RMSE (regression).
Results: Optimization: 325 DecisionTrees with max
$\_$depth = 3 have been found to be the best hyperparameters of AdaBoost. Classification: 50 spectra/sample have achieved 94.5
$\%$ accuracy; 200/300 spectra have improved this to 97.9
$\%$/98.3
$\%$, respectively. The models have reliably distinguished WPI + 0.1
$\%$ HA from WPI (>96
$\%$ accuracy). Regression: 300 spectra/sample have yielded optimal results (R
$^2$ = 0.910, RMSE = 0.061
$\%$). Larger datasets (400–500 spectra) have reduced performance (R
$^2$ = 0.894), suggesting overfitting. Key bands for analysis: 763 cm
$^{-1}$ (tryptophan), 1003 cm
$^{-1}$ (phenylalanine), and 1240 cm
$^{-1}$ (amide III). Bands at 1450–1667 cm
$^{-1}$ (C–H/amide I/II) have shown negligible importance, indicating minimal HA-induced changes.
Conclusion: AdaBoost models efficiently analyze trace HA in WPI with small training datasets (200 spectra for classification, 300 for regression). The method precision and speed make it ideal for industrial applications, while identified spectral markers have deepen understanding of HAprotein interactions. Future work could extend this framework to other multicomponent systems with low analyte concentrations.
Keywords:
hyaluronic acid, whey protein isolate, Raman spectroscopy, adaptive boosting, machine learning, GridSearchCV, classification, regression.
UDC:
519.688:543.424.2
Received: 02.05.2025
Revised: 29.08.2025
Accepted: 12.06.2025
DOI:
10.18500/1817-3020-2025-25-3-305-315