BENCHMARK DATA GENERATION AND MACHINE LEARNING FOR MOLECULES AND MATERIALS
In this thesis, I will discuss six projects that I participated in during my Ph.D. study, with an emphasis on constructing benchmark database and designing novel molecular descriptors for machine learning. These research projects are organized into the following six chapters: Chapter 1: Quantum mechanical static dipole polarizabilities in the QM7b and AlphaML showcase databases; Chapter 2: Accurate molecular polarizabilities with coupled-cluster theory and machine learning; Chapter 3: Predicting molecular dipole moments by combining atomic partial charges and atomic dipoles; Chapter 4: Machine learning molecular conformational energies using semi-local density fingerprints; Chapter 5: Range-Separated Hybrid Functional Pseudopotentials, and Chapter 6: On the Convergence Properties of a Separable Series Expansion of the Short-Range Coulomb Potential. In Chapter 1, we computed benchmark molecular properties using quantum chemical (QC) and DFT methodologies based on geometries in QM7b and AlphaML databases. To compute the molecular properties including static α tensor, dipole/quadrupole moments, and total energy components, etc, we performed QC calculations with linear-response coupled-cluster theory including single and double excitations (LR-CCSD) and the d-aug-cc-pVDZ basis set. This allows us to properly treat the electron correlation and mitigate basis set incompleteness error. For comparison, we also performed DFT calculations using the B3LYP and SCAN0 hybrid functionals, in conjunction with d-aug-cc-pVDZ (B3LYP and SCAN0) and d-aug-cc-pVTZ (B3LYP). The finite-field method was used to compute the α tensor in DFT cases. This work provides accurate and reliable molecular properties (especially the fundamental response property, α) for use in the development (and assessment) of next-generation force fields, density functionals, and quantum chemical methodologies, as well as machine-learning based approaches. In Chapter 2, by using a symmetry-adapted machine-learning approach, we demonstrated that it is possible to predict the LR-CCSD molecular polarizabilities of these small molecules with an error that is an order of magnitude smaller than that of hybrid density functional theory (DFT) at a negligible computational cost. The resultant model is robust and transferable, yielding molecular polarizabilities for a diverse set of 52 larger molecules from the AlphaML showcase database (including challenging conjugated systems, carbohydrates, small drugs, amino acids, nucleobases, and hydrocarbon isomers) at an accuracy that exceeds that of hybrid DFT. The atom-centered decomposition implicit in our machine-learning approach offers some insight into the shortcomings of DFT in the prediction of this fundamental quantity of interest. In Chapter 3, we represented molecular dipole moments with a physically inspired ML model that captures two distinct physical effects: local atomic polarization is captured within the symmetry-adapted Gaussian process regression (SA-GPR) framework, which assigns a (vector) dipole moment to each atom, while movement of charge across the entire molecule is captured by assigning a partial (scalar) charge to each atom. The resulting "MuML" models are fitted together to reproduce molecular µ computed using high-level coupled-cluster theory (CCSD) and density functional theory (DFT) on the QM7b dataset, achieving more accurate results due to the physics-based combination of these complementary terms.The combined model shows excellent transferability when applied to a showcase dataset of larger and more complex molecules, approaching the accuracy of DFT at a small fraction of the computational cost. We also demonstrated that the uncertainty in the predictions can be estimated reliably using a calibrated committee model. The ultimate performance of the models—and the optimal weighting of their combination—depend, however, on the details of the system at hand, with the scalar model being clearly superior when describing large molecules whose dipole is almost entirely generated by charge separation. These observations point to the importance of simultaneously accounting for the local and non-local effects that contribute to µ; further, they define a challenging task to benchmark future models, particularly those aimed at the description of condensed phases. In Chapter 4, we present a molecular descriptor known as Semi-Local Density Fingerprint (SLDF) for machine learning (ML) to predict accurate quantum chemical (QC) properties of molecules. The SLDF employs B-splines—bell-shaped spline functions with compact support—to represent the characteristic information from the electron density of a specific chemical system. The SLDF is capable of describing various complex chemical systems including many-body non-covalent complexes and chemical reactions. The design of the SLDF holds several advantages when used with ML including: 1) SLDF accounts for molecular symmetry by construction (and is invariant to translations and rotations); 2) SLDF provides a compact (system-size-independent) representation for each molecule; and 3) SLDF describes molecules uniquely with at least DFT accuracy. As proof of principle, we focus on using SLDF to predict accurate conformation energies of molecules, and compared the performance with the recently published ML-HK method using the identical dataset. Afterwards, three different training schemes (ie. test-molecule specific, test-molecule inclusive, and test-molecule exclusive) are designed and employed to test the transferability of SLDF. Lastly, the SLDF is shown to predict the correct fine structure of the potential energy surface (PES) of the oxirene molecule, which is often qualitatively incorrect from density functional theory (DFT) calculations. As such, we state that the SLDF is a promising, efficient, and transferable molecular descriptor for ML to predict chemical properties accurately. In Chapter 5, we present a general scheme for constructing pseudopotentials with range-separated hybrid (RSH) xc functionals. In this regard, we are able to keep the consistency between the exchange-correlation (xc) functional used during pseudopotential construction and planewave-based electronic structure calculations, which is important for an accurate and reliable description of the structure and properties of condensed-phase systems. This entails the solution of the radial integro-differential equation for a spherical atomic configuration. As proof-of-principle, we demonstrate pseudopotential construction with the PBE, PBE0, HSE and sRSH functionals for a select set of atoms, and then investigate the importance of pseudopotential consistency when computing atomization energies, bulk moduli, and band gaps of several solid-state systems. In doing so, we find that pseudopotential consistency errors tend to be systematic and can be as large as 1.4% when computing these properties. In Chapter 6, we explore the convergence properties of a separable series expansion of the short-range Coulomb operator, erfc(µ
r-R
)/(
r-R
), provided by Ángyàn et al. (J. Phys. A 39, 8613 (2006)), and demonstrate that this expansion is an asymptotic series which requires additional precision (beyond standard double precision) during its numerical evaluation for large values of ξ=µr and Ξ=µR. In doing so, we show that this separable series expansion reproduces the analytical (non-separable) expression for the short-range Coulomb potential with high fidelity, and can therefore be used in the generation of pseudopotentials and Poisson-like boundary conditions in planewave-based implementations of range-separated hybrid density functional theory.