Choice of the basis set and the auxiliary basis set

As so-called post-SCF approaches, correlated wave function methods including Moller-Plesset (MP) perturbation theory, the random-phase approximation (RPA) and coupled-cluster (CC) theory require a mean-field theoretical method as a starting point, for which in most cases the Hartree-Fock (HF) method is used

In contrast to HF theory and most density functional approximations, such advanced methods do not only depend on the set of occupied single-particle states but do also require the inclusion of unoccupied, or virtual states as well. Due to the generally different shape and spacial extent of unoccupied states, this circumstance makes the development of special basis sets necessary. As a consequence, calculations involving these post-SCF methods are recommended to be performed with so called valence-correlation consistent (VCC) basis sets instead of the conventional light, intermediate and tight basis sets, which have been developed primarly for use in HF and DFT applications. The NAO-VCC-nZ (n = 2,3,4,5) basis sets, which have originally been developed for MP2 and RPA applications but have since also been employed in CC calculations, are constructed analogously to the polarized VCC Gaussian Type Orbital (GTO) basis sets developed by Dunning, commonly abbreviated as cc-pVnZ, which are widely used in quantum chemistry applications.

The major benefit of using VCC basis sets in combination with MP2, RPA and CC methods comes from the generally observed painfully slow convergence of these methods with respect to the basis set size. Dunning-type basis sets are designed in such a way, that the basis set incompleteness error (BSIE) of the correlation energy decays cubically with respect to n, such that the slow convergence of the BSIE can be circumvented by performing a calculation with different values of n and extrapolating the correlation energy to the so-called complete basis set (CBS) limit. The most widely used technique of that kind is the two-point extrapolation, for which the (MP2, RPA, CC, etc.) correlation energy is computed using the n and n+1 basis set and subsequently extrapolated using the following equation

\[ CBS[n,n+1] = \frac{E_{\mathrm{corr}}(n)^3 - E_{\mathrm{corr}}(n+1)^3}{n^3 - (n+1)^3}\mathrm{.} \]

In this tutorial we will explore the reliability and accuracy of this kind of extrapolation for the paracetamol molecule.

In addition to a special treatment of the atomic basis set, the consideration of an auxiliary basis set is indispensible. All correlated wave function methods require the computation of Coulomb integrals (also often reffered to as 4-center-2-electron integrals or electron repulsion integrals). Without the use of a resolution-of-the-identity (RI) technique, which allows to decompose the tensor of Coulomb integrals into less memory intensive lower-rank tensors, these correlated wave function methods would become computationally intractible very quickly. The probably most common flavor of the RI technique, RI-V, is deemed to be the most accurate one to represent the Coulomb integrals and is available in FHI-aims for molecular calculations. For bigger system sizes, however, the unfavorable scaling of the RI-V approach with respect to system size becomes impractical. A localized version of RI-V, RI-LVL, which is implemented in FHI-aims, offers a more affordable, albeit less accurate approximation. As has been shown in previous studies, the lack of accuracy in the RI-LVL scheme can be significantly improved by manually adding basis functions to the auxiliary basis set.

In this tutorial, we will make use of the memory-efficient RI-LVL approximation to determine the correlation energy of the paracetamol molecule on the MP2 level of theory using FHI-aims in combination with CC4S. The CC4S code is a pure post-SCF code, which relies on another electronic structure package (here: FHI-aims) to perform the initial HF calculation. At the end of the HF calculation, the single-particle energies, the single-particle wave functions and the RI-related quantities need to be post-processes before being fed to CC4S. This job is done by the CC-aims interface. Upon completion of the HF calculation, CC-aims computes a low-rank representation of the Coulomb integrals tensor, the so called Coulomb vertex, which is required by CC4S. In the simplest kind of calculation, CC-aims writes the Coulomb vertex, the single-particle energies and a some additional information to files, that CC4S can process. We will go through the steps one has to perform to obtain the electronic correlation energy on the MP2 level. Note, that the procedure presented here does equally apply for other ground state methods that are implemented in CC4S, which includes, RPA+SOSEX, CCSD and CCSD(T).

The FHI-aims input

To perform any kind of FHI-aims calculation, a minimum of two input files is required: a control.in file, which contains the specifications of the calculation, and the structure of the molecule or material given by the geometry.in file.

The `control.in` file

The control.in file consists of two parts, the calculation settings and the basis set. In terms of calculation settings, we want to perform a HF calculation (xc hf). To reduce the computational cost of the MP2 calculation, one common approach is to only include the outermost shell of valence electrons for the post-SCF calculation and freeze the lower-lying ones via frozen_core_postSCF 1. As noted before, while the RI-V approach is generally the most accurate one, it also becomes computationally impractical for bigger systems, so that we will utilize the RI-LVL method, which in the control.in file can be specified via RI_method LVL. Currently, however, the RI-LVL scheme for post-SCF methods is only available under periodic boundary conditions. While slightly inconvenient for molecular applications, this is not a problem if the molecule is put in a huge simulation cell. Since FHI-aims employs localized basis functions, the addition of vacuum around the molecules comes at virtually zero computational overhead, so that we can easily add 100 angstrom of vacuum or more without any additional cost. In the control.in files this is reflected by specifying k_grid 1 1 1 and symmetry_reduced_k_grid .false.. To invoke CC-aims after the HF calculation is completed, one has to add output cc4s to the control.in file and one has to ensure that the ScaLAPACK version of FHI-aims is employed via KS_method parallel. After additionally disabling relativistic corrections (relativistic none) and ensuring charge neutrality of the paracetamol molecule (charge 0), the minimal control.in file to generate the necessary files for a CC4S calculation looks like this

=== "control.in"

xc                       hf
frozen_core_postSCF      1
RI_method                LVL

k_grid                   1 1 1
symmetry_reduced_k_grid  .false.

output cc4s
KS_method                parallel

relativistic             none
charge                   0

The NAO-VCC-nZ basis sets that we are going to use grow very quickly in size with increasing n. It can happen, particularly for n=4,5, that the basis set becomes slightly ill-conditioned. By specifying override_illconditioning .true., one can avoid that FHI-aims aborts the calculation. However, make sure that the ill-conditioning is indeed a result of the basis set size and not of any other pathological computational parameter choice. Finally, as we intend to use the computed HF quantities in a subsequent MP2 calculation, we want to ensure that sufficiently tight convergence criteria are met, for which we add a few more parameter specifications. As a result the complete set of parameters in the control.in file are

=== "control.in"

xc                       hf
frozen_core_postSCF      1
RI_method                LVL

k_grid                   1 1 1
symmetry_reduced_k_grid  .false.

output cc4s
KS_method                parallel

relativistic             none
charge                   0

override_illconditioning .true.

sc_accuracy_rho          1E-6
sc_accuracy_eev          1E-4
sc_accuracy_etot         1E-6

The second half of the control.in file contains the basis set information. As noted before, we are going to use the NAO-VCC-nZ basis sets which can be found in the species_defaults subdirectory of the FHI-aims code. In the case of paracetamol, which consists of C, H, O and N, the NAO-VCC-2Z basis set can be added to the control.in file via

cat control.in ~/FHIaims_fixed_elsi/FHIaims/species_defaults/NAO-VCC-nZ/NAO-VCC-2Z/06_C_default ~/FHIaims_fixed_elsi/FHIaims/species_defaults/NAO-VCC-nZ/NAO-VCC-2Z/01_H_default  ~/FHIaims_fixed_elsi/FHIaims/species_defaults/NAO-VCC-nZ/NAO-VCC-2Z/08_O_default ~/FHIaims_fixed_elsi/FHIaims/species_defaults/NAO-VCC-nZ/NAO-VCC-2Z/07_N_default > control_CHON.in

mv control_CHON.in control.in

Finally, as noted in the beginning, utilization of the RI-LVL scheme in combination with correlated wave function methods requires manual addition of auxiliary basis functions. We have found that the following auxiliary functions generally suffice for main-group elements, to reach auxiliary basis set completeness:

 for_aux hydro 4 f 1.0
 for_aux hydro 5 g 1.0
 for_aux hydro 6 h 1.0

To add them to your calculation, just copy-paste these three functions at the end of the basis set specification of each chemical element.

The `geometry.in` file

The geometry.in file we are going to use looks like this

=== "geometry.in"

lattice_vector    100.00000000      0.00000000      0.00000000
lattice_vector      0.00000000    100.00000000      0.00000000
lattice_vector      0.00000000      0.00000000    100.00000000
atom_frac      -0.04903565      0.01241169     -0.00008674 C
atom_frac      -0.04348102     -0.00033115      0.00011141 C
atom_frac      -0.04074974      0.02367077     -0.00032200 C
atom_frac      -0.02691010      0.02187815     -0.00035479 C
atom_frac      -0.02135859      0.00910356     -0.00015637 C
atom_frac      -0.02955862     -0.00230022      0.00008209 C
atom_frac      -0.05989439      0.01355379     -0.00005679 H
atom_frac      -0.02044152      0.03063831     -0.00053906 H
atom_frac      -0.01054603      0.00826149     -0.00019262 H
atom_frac      -0.05010819     -0.00899975      0.00029391 H
atom_frac      -0.04555253      0.03652521     -0.00052445 O
atom_frac      -0.05524927      0.03625030     -0.00047415 H
atom_frac      -0.02514637     -0.01567650      0.00029638 N
atom_frac      -0.01292920     -0.02221467      0.00035969 C
atom_frac      -0.03266090     -0.02253960      0.00044532 H
atom_frac      -0.00009404     -0.01425119      0.00017140 C
atom_frac       0.00062493     -0.00800063     -0.00883349 H
atom_frac       0.00819244     -0.02137940      0.00026640 H
atom_frac       0.00069120     -0.00768504      0.00894304 H
atom_frac      -0.01289340     -0.03450510      0.00057082 O

In addition to the atomic coordinates of the paracetamol molecule, our geometry.in file also contains the lattice vectors to enclose the molecule in a 100Ax100Ax100A box, which needs to be done to use the RI-LVL scheme in combination with a post-SCF method.

Running your first MP2 calculation

Note, that due to the increased scaling of the MP2 method with respect to system, most calculation performed in this tutorial require access to a computing cluster, particularly for those calculations involving bigger basis sets.

To maintain some order, you should create a directory called paracetamol_mp2, which should contain a subdirectory 2Z, where our first MP2 calculation will be performed. The calculation will consist of two parts: We will start with a FHI-aims calculation to generate the necessary input files for CC4S, which will be done in the aims subsubdirectory. Subsequently the CC4S calculation will be performed in a separate CC4S subsubdirectory. Start by moving the control.in and geometry.in file assembled in the previous two sections into the aims directory.

mkdir -p paracetamol_mp2/2Z/aims
cp control.in paracetamol_mp2/2Z/aims
cp geometry.in paracetamol_mp2/2Z/aims
cd paracetamol_mp2/2Z/aims

To run the calculation execute the following command

mpiexec -np 4 /path/to/fhi-aims/executable/aims.xxx.scalapack.mpi.x

This calculation will perform a HF SCF cycle and subsequently invoke CC-aims to generate the necessary input files for CC4S. After the calculation finishes, you should find the following additional files in your calculation directory:

EigenEnergies.yaml
EigenEnergies.elements
CoulombVertex.yaml
CoulombVertex.elements
AuxiliaryField.yaml
State.yaml

Now, everything is set to perform an MP2 calculation (or any other method implemented in CC4S for that matter). For that leave the aims directory and create a CC4S directory

cd ..
mkdir CC4S
cd CC4S

Copy or create links to the yaml and elements files created in the aims directory.

ln -s ../aims/*yaml .
ln -s ../aims/*elements .

The last file you need to run the CC4S calculation, is one more yaml file that specifies which method of CC4S shall be used for the calculation. For an MP2 calculation the file looks like this

=== "mp2.yaml"

- name: Read
  in:
    fileName: "EigenEnergies.yaml"
  out:
    destination: EigenEnergies

- name: Read
  in:
    fileName: "CoulombVertex.yaml"
  out:
    destination: CoulombVertex


- name: DefineHolesAndParticles
  in:
    eigenEnergies: EigenEnergies
  out:
    slicedEigenEnergies: EigenEnergies

- name: SliceOperator
  in:
    slicedEigenEnergies: EigenEnergies
    operator: CoulombVertex
  out:
    slicedOperator: CoulombVertex

- name: VertexCoulombIntegrals
  in:
    slicedCoulombVertex: CoulombVertex
  out:
    coulombIntegrals: CoulombIntegrals

- name: SecondOrderPerturbationTheory
  in:
    slicedEigenEnergies: EigenEnergies
    coulombIntegrals: CoulombIntegrals
  out:
    energy: Mp2Energy

To perform the MP2 calculation, execute the following command

mpiexec -np N /path/to/CC4S/Cc4s -i mp2.yaml

where N specifies the number of cores to be used in parallel. The calculation evaluated below was performed on 2 nodes on the raven MPCDF cluster, which equates to N=144.

After the CC4S calculation is completed, the MP2 correlation energy will be printed to the standard output (albeit with a limited precision)

step: 6, SecondOrderPerturbationTheory
Contracting second order energy...
correlation energy: -1.73466

and - as a double precision real number to cc4s.out.yaml

=== "cc4s.out.yaml"

    name: SecondOrderPerturbationTheory
    out:
      energy:
        correlation: -1.7346634404468899

in Hartree energy units.

To obtain the total MP2 energy, the total HF energy needs to be added to the MP2 correlation energy. The total HF energy can be found in the output file of the previous FHI-aims calculation either at the end of the final SCF step

| Total energy                  :        -512.54583727 Ha      -13947.08185176 eV

in both Hartree and eV units

or at the end of the calculation (only in eV units)

------------------------------------------------------------------------------
  Final output of selected total energy values:

  The following output summarizes some interesting total energy values
  at the end of a run (AFTER all relaxation, molecular dynamics, etc.).

  | Total energy of the DFT / Hartree-Fock s.c.f. calculation      :         -13947.081851759 eV
  | Final zero-broadening corrected energy (caution - metals only) :         -13947.081851759 eV
  | For reference only, the value of 1 Hartree used in FHI-aims is :             27.211384500 eV

Hence, we find that for the paracetamol molecule using the NAO-VCC-2Z basis set, the MP2 correlation energy is -1.73466 Ha, while the total MP2 energy is -512.54584 + (-1.73466) = -514.2805 Ha.

Reducing memory overhead - the optimized auxiliary field vertex

If you look at the files that have been generated by the CC-aims interface, the biggest one by far is the CoulombVertex.elements file, which in the case of the previous calculation takes up almost 9 GB. Indeed, for calculations involving bigger basis sets the memory demand of the Coulomb vertex grows very quickly.

To reduce this memory overhead, CC-aims can truncate the Coulomb vertex by essentially performing a principal component analysis. The resulting quantity is called the optimized auxiliary field Coulomb vertex. To invoke this truncation one of two key-value pairs need to be added to the control.in file. One can either specify the smallest singular value <thr> , below which the Coulomb vertex is discarded via

cc4s_screen_thresh <thr>

or one can explicitely state how many of the biggest singular values are to be kept via

cc4s_screen_num <num>

,where <thr> is a double precision real number and <num> is an integer.

If one now adds cc4s_screen_thresh 1.0d-3 to the previous control.in file and repeats the MP2 calculation, one ends up with an MP2 correlation energy of

    name: SecondOrderPerturbationTheory
    out:
      energy:
        correlation: -1.7346596876159019

which is almost identical to the previous value obtained with the full Coulomb vertex. The size of the Coulomb vertex, however, is reduced to 1.6 GB.

In a practical calculation, one can determine the maximal value for <thr> using a small basis set like the NAO-VCC-2Z basis and determine the biggest value of <thr> for which the quantity of interest (here: MP2 correlation energy) changes sufficiently little. In this tutorial, we keep this threshold at a conservative value of 1d-3. Note, however, that while in the present case we only compute the MP2 correlation energy of a single system, oftentimes one is interested in the difference of two energies (e.g reaction energies, adsorption energies, phase transition energies, etc.). Due to error compensation, it is expected that in such cases, the <thr> value can be chosen even bigger without a significant loss in accuracy. Testing the convergence of the correlation energy (difference) as a function of <thr> can be done quickly and can substantially decrease the memory and computational overhead of the calculation.

Basis set convergence and extrapolation

As has been noted in the beginning of this tutorial, correlated wave function methods including MP2, RPA and CC are known to converge very slowly with basis set size. To demonstrate that, we will repeat the previous MP2 calculation with the NAO-VCC-3Z, NAO-VCC-4Z and NAO-VCC-5Z basis set. For each of these calculations you should create a new subdirectory in paracetamol_mp2

mkdir -p paracetamol_mp2/3Z/
mkdir -p paracetamol_mp2/4Z/
mkdir -p paracetamol_mp2/5Z/

and repeat the remaining calculation procedure for the three bigger basis set by including the correct basis set file from the species_defaults directory in the control.in file (and remember to include the manual addition of the three for_aux entries for each element). Be aware, that the FHI-aims calculations for these basis sets will be more demanding and will require more computational resources. A summary of the required resources and the resulting MP2 correlation energies in given in the table below

Basis set	MP2 corr. energy (mHa/atom)	Coulomb vertex size (GB)	#Cores	Time for FHI-aims (min)
NAO-VCC-2Z	-86.733	1.6	144	3.2
NAO-VCC-3Z	-99.368	14	288	5.1
NAO-VCC-4Z	-104.386	83	288	20.0
NAO-VCC-5Z	-106.783	407	576	46.25

These results show very clearly that even going from the NAO-VCC-4Z to the NAO-VCC-5Z basis set still changes the MP2 correlation energy by almost 2.4 mHa/atom (=65 meV/atom). Thanks to the correlation-consistency of the NAO-VCC-nZ basis sets, we can make use of the two-point extrapolation technique introduced in the beginning of this tutorial. Since we have computed the results for n=2,3,4,5, we can compute the CBS extrpolation for CBS[23], CBS[34] and CBS[45], the latter of which we will use as our reference.

Extrapolation	MP2 corr. energy (mHa/atom)
CBS[23]	-104.688
CBS[34]	-108.047
CBS[45]	-109.298

These results show that the extrapolated values of the MP2 correlation energy converge faster, so that the remaining difference between the CBS[34] and the CBS[45] estimate is 1.25 mHa/atom (= 34 meV/atom). It is very important to emphasize again, that in practical applications where one is generally interested in the difference between two energies, the basis set incompleteness error will be even smaller due to error compensation.

One also has to realize, that in many practical applications (even more so for CC calculations), the NAO-VCC-5Z will be unfeasible due to the computational overhead. In these cases, the CBS[34] estimate constitutes the best compromise between accuracy and affordability.