ISSN : 2663-2187

Optimizing Density Functional Theory Calculations Using a Gradient Boosting Machine for Enhanced Predictive Accuracy

Main Article Content

Y.P. Arul teen, C. Justin dhanaraj
» doi: 10.48047/AFJBS.6.13.2024.6875-6893

Abstract

Density Functional Theory (DFT) calculations are essential for understanding molecular properties and predicting biological activities based on quantum mechanical principles. Predictive performance is limited by the inability of conventional machine learning (ML) models, like random forests and linear regression, to adequately represent the complicated, nonlinear correlations seen in DFT data. These methods often fail to account for the intricate dependencies between molecular descriptors and target variables, resulting in suboptimal accuracy. Additionally, interpreting traditional models in the context of DFT calculations is difficult, hindering the elucidation of structure-property relationships. This research proposes the application of Gradient Boosting Machines (GBM) for the predictive modelling of DFT calculations. GBM is an ensemble learning technique that enhances overall accuracy by combining the predictive power of several weak learners, such decision trees. GBM captures complicated nonlinear interactions in data by iteratively fitting new models to prior models' residuals. This makes it well-suited for analysing DFT calculations and predicting material properties with high accuracy. The research utilizes two datasets: tmQM, containing information on transition metal complexes, and ECD-cubic, focusing on the electronic charge density of inorganic materials with cubic structures. The GBM model is trained iteratively, with each new tree pointing on correcting the errors produced by previous trees. The optimal tree size and learning rate are determined through grid search optimization. The model's performance is assessed using mean squared error (MSE), mean absolute error (MAE), and R-squared (R²) metrics. The GBM model demonstrates high accuracy and low error metrics, indicating its robust performance in capturing the complex relationships inherent in DFT data. For the tmQM dataset, the GBM model achieves a lower MSE and MAE on the testing set (0.018 and 0.092, respectively) compared to the training set (0.021 and 0.105). Similarly, for the ECD-cubic dataset, the model exhibits a lower MSE and MAE on the testing set (0.029 and 0.118) than the training set (0.035 and 0.132). The high R² values (0.995 for tmQM and 0.996 for ECD-cubic) indicate that the model describes a large percentage of the variance in the target properties, demonstrating its predictive power.

Article Details