Introduction

HQS Molecules is a tool that can be used in conjunction with other HQStage modules by HQS Quantum Simulations GmbH. HQS Molecules is a user-friendly Python package designed to streamline computational chemistry workflows. It offers functionalities for converting 2D chemical structures to 3D coordinates, accessing the PubChem database, interfacing with the xTB and CREST programs for energy calculations and conformer searches, and handling XYZ files. These functionalities enhance efficiency, accuracy, and ease of use, making it a valuable resource for chemists and researchers.

Associated Functionality

HQS Molecules is providing the following functionalities to the users:

  • Converting chemical structures from two-dimensional connectivity representations (SMILES strings or Molfiles) to three-dimensional coordinates. This is achieved using the RDKit and Open Babel packages.
  • API access to the PubChem database, permitting name-to-structure and structure-to-name searches directly via Python scripts.
  • Providing a Python interface for important functionalities of the semiempirical xTB program, such as single-point energy calculations, geometry optimizations and frequency calculations.
  • Performing conformer searches through a Python interface for the CREST program. New: HQS Molecules includes a procedure to refine conformer ensembles obtained via CREST by reassigning equivalent structures and completing the rotamer ensemble, combined with a screening for saddle points.
  • Reading and writing XYZ files.

Functionality in this package makes extensive use of Pydantic to define classes used for input and output.

Why it is Useful

  • For quantum-chemical calculations, a 3D structure is essential, providing the spatial coordinates of all atoms. The larger the molecule, the more likely it is to have multiple possible conformations that are in thermodynamic equilibrium. HQS Molecules allows chemists and researchers in other fields to easily transition from simple 2D representations of molecules to more complex and informative 3D structures. This is crucial for understanding molecular geometry, predicting behavior and properties, and performing computational chemistry simulations.

  • Direct access to the PubChem database enables users to quickly retrieve chemical information and structures by simply providing a chemical name or SMILES string. This streamlines the process of gathering data for research and development, saving time and reducing the potential for errors in manual searches.

  • Integrating xTB functionalities into Python scripts allows for automated and efficient computational chemistry workflows. Users can perform energy calculations, optimize molecular geometries, and conduct frequency analyses seamlessly within their Python environment, enhancing productivity and enabling more complex simulations.

  • Conformer searches are essential for identifying the most stable molecular conformations and understanding the conformational flexibility of molecules. By providing a Python interface for the CREST program with in-house developed post-processing, HQS Molecules simplifies the process of conformer searching, making it more accessible and easier to integrate into larger computational workflows.

  • XYZ files are a common format for representing molecular structures. The ability to read and write these files ensures compatibility with a wide range of computational chemistry tools and facilitates the exchange of molecular data between different software packages.

  • Pydantic provides data validation and parsing using Python type annotations. By leveraging Pydantic, the package ensures that input and output data are correctly formatted and validated, reducing the likelihood of errors and improving the robustness of the software.

Summary

With HQS Molecules, we utilize multiple existing packages to cover various configurations with a single tool, allowing users to work flexibly without having to choose between different programs. HQS Molecules provides a unified input for all available packages and outputs data in a format that can be used for further processing steps in computational chemistry simulations.

Sample Use Cases

HQS Molecules provides an easy-to-use Python interface to generate three-dimensional structures, optimize geometries, calculate molecular energies or vibration frequencies, and to sample molecular conformations. These functionalities are crucial building blocks in molecular simulation workflows for the chemical and pharmaceutical industries.

Example use cases:

  • The input for quantum chemistry calculations, such as density functional theory (DFT), is rarely identical with the content of chemical in-house databases. With functionality as implemented in this package, HQS computed thermodynamic properties for a customer using DFT, starting from a text file containing thousands of molecule names.
  • Converting chemical structures from 2D to 3D representations, finding and comparing geometries, sampling conformers: computational chemistry workflows require many steps before any DFT calculation can even start. HQS Molecules combines the established expertise of open-source packages such as RDKit, Open Babel, xTB or CREST behind a Python interface, while improving critical aspects. HQS Molecules can be used as part of your own workflow to provide the input for a quantum chemistry software, which simulates chemical reactions or calculates spectroscopic parameters.
  • Quantum algorithms for molecular electronic structure problems normally require a three-dimensional structure, for example in XYZ format, as an input. Obtaining such an XYZ file can be a nuisance for specialists and non-specialists alike. With HQS Molecules, the name of a molecule is all you need to generate three-dimensional geometries for a vast number of catalogued compounds.
  • As many other packages, the open-source Active Space Finder by HQS requires a molecular geometry as part of its input. HQS Molecules provides starting geometries for further calculations with PySCF and the Active Space Finder.

Further Materials

In addition to this manual, please check the example notebooks for a practical demonstrations of the utilities in this module.

Installation

For general information on installing the module and downloading the example notebooks, please refer to the HQStage documentation.

Please note that HQS Molecules is currently only supported on Linux.

Parts of this module require three third-party software packages that are not available for installation via pip: Open Babel, xTB and CREST.

  1. Installation of Open Babel:
    For the user to benefit from the full scope of the 2D-to-3D structure conversion functionality, we recommend installing Open Babel. Open Babel can be installed via the package managers of various Linux distributions (e.g., apt on Debian 12) or alternatively from the conda-forge channel. Note: The 2D-to-3D structure conversion functionality uses the Open Babel package as a fallback option if the structure conversion of a molecule via RDKit fails. It is still possible to use HQS Molecules without Open Babel. In that case, the functionality will only be able to use RDKit for structure conversion.

  2. Installation of xTB: To make use of the interface to the xTB program for semiempirical calculations, the xtb binary must be provided by the user in the PATH variable. We strongly recommend to download the binary release 6.7.1 from GitHub, since Hessian calculations with the GFN-FF method are known to fail with version 6.7.0. Besides that, an installation from conda-forge is possible, in principle, but it is known to fail for specific calculations (like e.g. Hessian calculations for linear molecules).

  3. Installation of CREST: To make use of the interface to the CREST program for conformer search, the crest binary must be provided by the user in the PATH variable. We recommend downloading the binary release 3.0.2 from GitHub.

Molecular Name Lookup

HQS Molecules provides API access to the PubChem database, permitting name-to-structure and structure-to-name searches directly via Python scripts.

IMPORTANT: If you use the PubChem interface, your data will be sent over the internet to the PubChem servers.

Requests to PubChem are made via the PubChem data class, which stores the data retrieved after completion of each request. In the following example, a request is made using the molecule name:

>>> from hqs_molecules import PubChem
>>> pc = PubChem.from_name("2-methylprop-1-ene")
>>> print(pc.name)
Isobutylene
>>> print(pc.cid)
8255
>>> print(pc.formula)
C4H8
>>> print(pc.smiles)
CC(=C)C
>>> 

As shown above, the data retrieved is stored in the attributes

  • name for the compound name,
  • smiles for the SMILES string,
  • cid for the PubChem compound ID, and
  • formula for the molecular formula (as a MolecularFormula object).

Note: the name stored in the name attribute may differ from the argument supplied for the request: the most entries for a compound on PubChem list multiple synonymous names, but the name attribute contains the representative title name selected on PubChem.

Likewise, a reverse search can be performed using a SMILES string:

>>> pc = PubChem.from_smiles("COC1=C(C=CC(=C1)C=O)O")
>>> print(pc.name)
Vanillin
>>> 

Search by PubChem Compound Identifier

Finally, requests can be made using the PubChem compound identifier:

>>> pc = PubChem.from_cid(145742)
>>> print(pc.name)
Proline
>>> print(pc.smiles)
C1C[C@H](NC1)C(=O)O
>>> 

An important feature of the PubChem class is that it verifies the depositor of individual data items retrieved. Data is only made available if PubChem is listed as the data source.

As shown in the section on 2D to 3D structure conversion, a SMILES string can be used to generate a three-dimensional representation.

Structure Conversion

An important functionality of HQS Molecules is the conversion of two-dimensional molecular structure representations into three-dimensional geometries, as shown here for the example of the glucose molecule.

Simple 2D to 3D Conversion

Input: SMILES strings and Molfiles are supported as input for the functions smiles_to_molecule and molfile_to_molecule, respectively.

Output: Both functions return a Molecule object containing three-dimensional atomic coordinates. In addition, the returned object contains the overall molecular charge, which is always included in a Molfile or a SMILES string. The multiplicity field is set to None, as it cannot be derived unambiguously from the input.

The structure conversion employs the respective functionality of RDKit as the first choice, and uses Open Babel as a backup in case that structure conversion with RDKit fails. After generating a three-dimensional structure, a bonding graph is determined using distance criteria and used to verify the generated structure against the input. If the composition and the bonding graphs do not match, then the structure is rejected.

If the input structure is stored in a molfile named "my_molecule.mol", then the conversion is performed by calling:

>>> from hqs_molecules import molfile_to_molecule
>>> mol = molfile_to_molecule("my_molecule.mol")

Likewise, it is possible to perform a conversion of a molecular structure with a SMILES string - for example, one obtained from PubChem:

>>> from hqs_molecules import PubChem, smiles_to_molecule
>>> pc = PubChem.from_name("propane")
>>> mol = smiles_to_molecule(pc.smiles)

An optional check can be carried out with either of the conversion functions by supplying a molecular formula as an argument. The conversion fails if the input does not match the provided formula. In that case, the formula needs to be represented as a MolecularFormula object.

>>> from hqs_molecules import MolecularFormula, PubChem, smiles_to_molecule
>>> pc = PubChem.from_name("propane")
>>> # succeeds
>>> mol = smiles_to_molecule(pc.smiles, formula=pc.formula)
>>> # raises an exception
>>> mol = smiles_to_molecule(pc.smiles, formula=MolecularFormula.from_str("C3H7-"))

Utilities for RDKit

The HQS Molecules module includes convenience functions to create RDKit Mol objects from SMILES strings or Molfiles. These objects represent molecular information within the RDKit package.

Both the smiles_to_rdkit and molfile_to_rdkit functions accept an argument addHs. By default, it is set to True, causing explicit hydrogen atoms to be added in the generated object. Setting addHs = False suppresses the addition of explicit hydrogens; only hydrogens that were already explicitly represented within a Molfile are retained.

>>> from hqs_molecules import smiles_to_rdkit
>>> # The object generated contains 11 atoms.
>>> rdkit_mol = smiles_to_rdkit("CCC")
>>> # The object generated contains 3 atoms.
>>> rdkit_mol = smiles_to_rdkit("CCC", addHs=False)

Expert Usage

The functionalities described in the remainder of this section are only intended for expert usage.

An RDKit Mol object can be converted to a three-dimensional structure by passing it to the function rdkit_to_molecule. It is a low-level function that calls RDKit without resorting to Open Babel as a backup. Nonetheless, it performs a consistency check for the generated structure. If the RDKit Mol object was created without the addition of explicit hydrogens (addHs = False), this conversion may fail due to a composition mismatch.

The low-level functions to perform structure conversion using only Open Babel are available via smiles_to_molecule_obabel and molfile_to_molecule_obabel. These functions require a SMILES string or a Molfile as their input, respectively. A separate consistency check of the generated structure is also performed here.

XYZ File Handling

HQS Molecules provides the functionality for reading and writing XYZ files. XYZ files are a common format for representing molecular structures. The ability to read and write these files ensures compatibility with a wide range of computational chemistry tools and facilitates the exchange of molecular data between different software packages.

Reading and Writing XYZ Files

XYZ files are read using the function xyz.read. It can be called either with a Path object or a string representing a file path; alternatively, it may be provided with a file-like object.

>>> from hqs_molecules import xyz
>>> # providing a file name
>>> molgeom = xyz.read("molecule.xyz")
>>> # providing a file-like object
>>> with open("molecule.xyz", "r") as f:
...     molgeom = xyz.read(f)
... 

XYZ files contain atomic positions and chemical element symbols, but no information on charge and spin multiplicity. Therefore, the function returns a MolecularGeometry object with element symbols and atomic positions. It can be easily converted to a Molecule object by providing information on the charge and, optionally, on the multiplicity:

>>> mol = molgeom.to_molecule(charge=0)
>>> mol = molgeom.to_molecule(charge=0, multiplicity=1)

XYZ files can be written to disk using the function xyz.write by providing either a MolecularGeometry or a Molecule object (the latter is a subclass of the former). The function can either create a file at the specified path, or write to a file-like object.

Converting Strings with XYZ Information

In some cases, the content of an XYZ file may be present in a string, or may need to be converted to a string. Such cases can be handled generally using the aforementioned functions xyz.read and xyz.write operating on a io.StringIO object (instead of a Path object or a string). However, the HQS Moleculesmodule also provides convenience functions xyz.parse_string and xyz.get_string that read and format strings with XYZ content directly:

>>> # creates a string from a MolecularGeometry or Molecule object
>>> s = xyz.get_string(mol)
>>> # create a MolecularGeometry object from a string with XYZ content
>>> molgeom = xyz.parse_string(s)

Working with Comments in XYZ Files

The XYZ file format permits one single line with an arbitrary comment inside a molecular geometry definition. Some programs use this comment line to store specific information. When importing an XYZ file in HQS Molecules the comment can be requested by setting the argument return_comment to True. This causes the function to return a tuple with the molecular geometry and the comment line.

>>> from hqs_molecules import xyz
>>> molgeom, comment = xyz.read("molecule.xyz", return_comment=True)

Likewise, a comment line can be exported to an XYZ file by providing it in the comment argument.

>>> xyz.write("output.xyz", molgeom, comment="comment line ...")

Information on charge and spin multiplicity can be written into the comment line of an XYZ file using the function xyz.write_with_info. Unlike the read and write functions, it requires a Molecule object, as the MolecularGeometry class lacks information on charge and multiplicity.

XYZ Trajectory Files

Trajectories may be stored in XYZ files simply by concatenating XYZ geometries. Such files can be read using the function xyz.read_geometries. File paths (represented by a Path object or a string) as well as file-like objects are supported as its input argument. By default, the function returns a list of molecular geometries. Setting return_comments to True causes it to return a list of tuples instead, each tuple containing a geometry and its associated comment.

>>> # Returns a list of geometries.
>>> geom_list = xyz.read_geometries("trajectory.xyz")
>>> # Return of tuples containing geometries and comments.
>>> tuple_list = xyz.read_geometries("trajectory.xyz", return_comments=True)

In order to write a trajectory file, the function xyz.write is used (as for individual geometries), but with a list of geometries instead of an individual geometry.

>>> xyz.write("output.xyz", geom_list)

If individual comments need to be provided for each geometry, this is best done by working with a file stream, as shown in the example below:

>>> with open("output.xyz", "w") as f:
...     for step, geometry in enumerate(geom_list):
...         xyz.write(f, geometry, comment=f"step number {step}")
... 

Semiempirical Calculations with xTB

This HQS Moleculesmodule contains a Python interface to perform calculations with the xTB and CREST software packages.

The xTB (extended tight binding) program package is developed by the research group of Prof. Stefan Grimme at Universität Bonn. It contains implementations of the semi-empirical quantum-mechanical methods GFNn-xTB (n = 0–2) as well as of the GFN-FF force field ("GFN" stands for Geometries, Frequencies, and Non-covalent interactions, for which it has been parametrized). HQS Molecules contains wrapper functions to perform single-point energy calculations, geometry optimizations, and frequency calculations.

In this section the user will find further details on the associated functions:

  • Single-point energy calculations,
  • Structure optimizations,
  • Gradient calculations,
  • Frequency calculations,
  • Stable optimizations including Hessian calculations,
  • Biased Hessian, and
  • Conformer/Rotamer search.

Single-point energy calculations

The energy of any meaningful set of atomic coordinates with a specific charge and multiplicity can be calculated using the single_point function. Calculations are performed with the GFN2-xTB method by default, or alternatively with the GFN-FF method.

Input

The first mandatory argument of the function is a Molecule object, as the charge and multiplicity need to be known to perform a calculation. Note: the functions for reading an XYZ file return MolecularGeometry objects, which contain neither charge nor spin information. A MolecularGeometry objects can be easily converted to a Molecule object by providing information on the charge and, optionally, on the multiplicity, as described in the XYZ File Handling chapter. On the other hand, the 2D-to-3D structure conversion functions return Molecule objects with charge but without spin information.

Keyword arguments

It is possible to further refine the calculation with the following keyword arguments:

  • force_field: Use GFN-FF force-field (defaults to False, in which case the GFN2-xTB method is used).
  • solvent: By default, calculations are performed in vacuum. If the name of a solvent is provided, the implicit solvation model ALPB (analytical linearized Poisson-Boltzmann) will be used in the calculation. The available solvents are listed here.
  • create_tmpdir: Run the calculation in a temporary folder that is deleted with completion of the calculation (the default is set to True).

Output

The single_point function returns a float with the single-point energy in Hartree.

Example

An example is provided here:

from hqs_molecules import single_point

energy = single_point(molecule, solvent="Water", force_field=False)

Structure optimization

xTB can vary the position of atoms in the system to find the equilibrium structure with the lowest energy. This functionality is embedded into HQS Molecules in the optimization function.

Extra arguments

The function works similarly to the single_point function but it has two extra arguments:

  • opt_level with the possible optimization levels as defined by xTB. In this wrapper, the levels are defined in the XTBOptLevel class. By default, the very tight criterion is employed (XTBOptLevel.VTight), whereas the xTB software itself usesXTBOptLevel.Normal. The options are:

    • Crude
    • Sloppy
    • Loose
    • Lax
    • Normal
    • Tight
    • VTight
    • Extreme
  • max_cycles is the maximum number of optimization iterations. By default, it is set to None, which means that max_cycles is automatically determined by the xTB software depending on the size of the molecule (check above link).

Output: Trajectory

Calls to optimization function return a Trajectory object containing the optimization trajectory (structures) as a list of Molecule objects and the energies (in Hartree) for each step (energies) as a list of floats.

Most of the time, the quantities of interest are the optimized geometry and its energy. To access these, a Trajectory object contains two types of attributes: Trajectory.last and Trajectory.last_energy contain the last geometry and its energy, respectively; Trajectory.lowest_energy and Trajectory.lowest contain the lowest energy and the corresponding geometry. Trajectory.lowest_step returns the index of the energetically lowest geometry. In many cases, the last step may also be the one with the lowest energy.

The trajectory can be written to an XYZ file using the write (or write_with_info) functions of the xyz module.

Example

The following box illustrates some general ideas behind executing a geometry optimization, starting from a Molecule object.

from hqs_molecules import optimization, xyz

# perform optimization
opt_data = optimization(molecule, solvent="water")

# Write the trajectory into an XYZ file
xyz.write(file="trajectory.xyz", mol=opt_data.structures)

# Optimized structure: here it is assumed to be in the last step.
opt_molecule = opt_data.last  # Molecule object
opt_energy = opt_data.last_energy  # float

Gradient calculation

The nuclear gradient (first derivative of the energy with respect to the atomic coordinates) is calculated at each step of an optimization procedure. A necessary condition for an optimization to converge is typically to reduce the (norm of the) gradient to almost zero.

A single-point gradient can be calculated using the gradient function with the same arguments as a single_point procedure. It returns a nested list of floating point numbers, representing an N × 3 matrix (where N is the number of atoms in the molecule). Each value in the list corresponds to the derivative of an atomic position with respect to the x, y or z coordinates, represented in units of Hartree/Bohr.

Frequency calculations

The Hessian matrix (a matrix of second derivatives of the energy with respect to the atomic coordinates) is used to determine the nature of a stationary point, vibrational frequencies, molecular vibrational spectra, and thermochemical properties. In xTB, the Hessian matrix is calculated numerically.

Extra argument

HQS molecules provides the frequencies wrapper function to calculate vibrational frequencies and thermochemical corrections; note that it does not return the Hessian matrix itself. Besides the arguments already mentioned in the single_point wrapper, this function also accepts a temperature argument, which allows the user to define one or multiple temperatures for the calculation of thermochemical corrections. By default, the calculation will be performed at 298.15 K, but other temperature or a list of temperatures can also be specified.

Stationary point

If frequency calculations are meant to be performed for a stationary point (an optimized equilibrium structure or a saddle point), it is important to perform the calculations with the same method as the optimization itself; the potential energy surfaces of different methods have differing stationary points.

Frequency calculations are often used to ascertain if a stationary point is a minimum or a saddle point. A molecule of N atoms has 3N degrees of freedom: three translational degrees, three rotational degrees (two for linear molecules), and 3N − 6 vibrational degrees (or 3N − 5 for linear molecules). For a minimum, those frequencies need to be positive; for a transition state, exactly one of them needs to be imaginary.

Output

The return value is a MolecularFrequencies object, which contains three important fields:

  • total_energy: the total electronic energy of the system (in Hartree).
  • frequencies: a list with the harmonic frequencies (in cm−1). This list contains 3N elements, where 6 values should be zero (or 5 for linear molecules). By convention, imaginary frequencies are represented as negative numbers.
  • thermochem: a list of objects with temperature-dependent thermodynamic properties (BasicThermochemistry). Each of these BasicThermochemistry objects contains the temperature (in K), the total enthalpy and the total gibbs_energy (both in Hartree), and the entropy (in Hartree / K). Note that the electronic energy is included in the enthalpy and in the Gibbs energy; however, the solvation contribution currently is not.

Example

from hqs_molecules import frequencies

thermo_data = frequencies(opt_geom, temperature=[100, 125], solvent="water")

print("Single point energy:", thermo_data.total_energy)
print("Frequencies:", thermo_data.frequencies)
for res in thermo_data.thermochem:
    print(
        f"At {res.temperature} K: "
        f"the enthalpy is {res.enthalpy}, "
        f"the entropy is {res.entropy}, and "
        f"the Gibbs energy is {res.gibbs_energy} (in Hartree)"
    )

Stable optimization with Hessian calculations

With a simple geometry optimization, there is no guarantee of obtaining a minimum on the potential energy surface; a converged structure can also be a saddle point. Vibrational frequencies need to be calculated for a full characterization. Only if all of them are non-negative, a minimum has been found.

If imaginary frequencies are found, a displacement along the associated normal mode is a good starting point for a new optimization. In difficult cases, optimizations followed by frequency calculations may need to be repeated multiple times in order to find a genuine local minimum.

Such a composite calculation is provided by HQS Molecules with the opt_with_hess function. It repeats the Hessian and optimization calculations until no imaginary frequencies are found. If the maximum number of restarts (indicated by the max_restarts keyword, 4 by default) is reached without converging to a minimum, an exception is raised.

Input

Since the function combines optimizations with frequencies, it requires the same arguments as the optimization function, with an option to provide a temperature argument (for the thermodynamic properties) and max_restarts.

Output

The output of optimization with Hessian calculation is a tuple with a Trajectory object (as for an optimization) and a MolecularFrequencies object (as for a frequency calculation). The trajectory concatenates the geometries and energies of all optimizations performed within the procedure. On the other hand, the MolecularFrequencies object contains information for the converged geometry, determined by the overall lowest energy.

Example

A simple call to this procedure can be made as exemplified below:

from hqs_molecules import opt_with_hess, xyz
molecule = xyz.read("input_geometry.xyz").to_molecule(charge=0, multiplicity=1)
trajectory, frequency_data = opt_with_hess(molecule)

Biased Hessian

Hessian calculations for vibrational frequencies or thermochemical contributions are normally performed at stationary points of the potential energy surface. Therefore, the geometry optimization and the Hessian calculation need to be performed with the same method.

However, Spicher and Grimme proposed a method ("single-point Hessian") to calculate the Hessian for stationary points that were optimized with a different method. Therefore, the geometry can be optimized with a more expensive method, while xTB is used to obtain a quantity resembling the Hessian. In a nutshell, it employs the following steps:

  1. The input structure is re-optimized under application of a biasing potential, which constrains the atoms to remain close to their original positions.
  2. The Hessian is calculated at the new, slightly distorted geometry under the application of the biasing potential.
  3. Finally, the frequencies are re-scaled to reduce the impact of the biasing potential.

HQS Moleculesprovides the function biased_hessian to interface the described method in the xTB package. Its usage is analogous to the function frequencies described above.

Conformer/Rotamer Search

Many molecules, especially in organic chemistry, have a large degree of conformational freedom. At relevant temperatures, they are not represented by a single minimum on the potential energy surface, but rather by an equilibrium between rapidly interconverting minimum structures. Thus, comparing experimental measurements with theoretical data often requires consideration not only of a single structure, but of an entire thermodynamically accessible conformational ensemble.

CREST (Conformer–Rotamer Ensemble Sampling Tool) is a program used to perform conformer searches using the methods implemented in the xTB package. The wrapper function find_conformers provided by HQS Molecules interfaces CREST (which uses xTB as its engine) to search for low-energy conformers and rotamers of a given molecule. Afterwards, a post-processing is performed to refine the conformer ensemble.

Conformers and Rotamers

Conformers are distinct minima on the potential energy surface that are separated by sufficiently low barriers, such that they can interconvert at a given temperature. Often, this occurs through torsional rotations around single bonds.

In this context, rotamers are defined as interconverting structures that can be represented as permutations of the atoms in a conformer [1]. A simple example of this structures related by the rotation of a methyl group (–CH3) in an organic molecule. Somewhat less intuitively, the two chair structures of cyclohexane are also regarded as rotamers in this context, as their structures can be overlayed.

Input

The find_conformers wrapper function performs three successive steps:

  1. Conformer/rotamer ensemble search using CREST.
  2. Reassignment of the structural ensemble and completion of the rotamer sets.
  3. Calculation of vibrational frequencies and discarding structures corresponding to saddle points.

It employs the following arguments:

  • molecule: mandatory specification of the input molecule as a Molecule object.
  • solvent: optional specification of the solvent as a string. By default, the calculation is performed in vacuum.
  • create_tmpdir: set to True by default, so that the calculation is performed in a temporary directory that is deleted upon finishing the calculation.
  • force_field: it is set to True by default; set to False to perform the conformer search with GFN2-xTB instead.
  • freq_threshold: the frequency threshold, in cm−1 below which conformers are considered saddle points and thus discarded.
  • allow_mirror_rotamers: it is set to False by default, such that mirror images are considered to be different structures and not equivalent rotamers.
  • similarity_tolerances: the numerical tolerances for bond distances, bond angles, and dihedral angles for structures to be identified as equivalent.

The last argument is of type SimilarityTolerances, which is a data class containing default values for the absolute and relative tolerances for bond distances, bond angles, and dihedral angles to be considered equivalent in the structure-matching procedure. Please note that bond and dihedral angles are compared via the absolute value of the complex exponential of the respective angle difference, and the tolerances refer to those values.

It is important to emphasize that the function uses GFN-FF by default, as conformer searches are very expensive; in contrast, the wrapper functions for the xTB package use GFN2-xTB by default.

Output

The output is an instance of the ConformerEnsemble class.

  • Important methods of the class provide access to flat lists of all conformer structures and energies (get_conformer_structures, get_conformer_energies),

  • as well as the number of rotamers per conformer (get_degeneracies).

  • Detailed information on conformers with their individual rotamers is contained in the items of the conformers list. Each element of this list is of the Conformer type.

    • Apart from having the structure
    • and energy attributes,
    • it also has a list of atom_mappings dictionaries, which represent the atom permutations of the reference structure to the different rotamers.

Note that the outcome of calculations with CREST may differ somewhat, even if otherwise identical calculations are repeated. Moreover, the starting geometry affects the conformer search. However, the conformer refinement procedure vastly reduces this uncertainty.

Algorithm

Different workflows have been implemented for CREST over the years. It is worth noting that find_conformers calls CREST with the static metadynamics simulations algorithm, instead of the computationally less demanding extensive metadynamic sampling utility that is the default when running CREST directly.

Note also that the grouping of conformer and rotamer structures as determined in the CREST calculation is ignored, and all the structures are regrouped by our own procedure as follows:

  • First, a plain list of structures is generated from all conformers and rotamers obtained from CREST.

  • Then, the structures are all compared among each other, grouping them by conformers purely based on their 3D structural parameters (bond lengths, bond angles, and dihedral angles).

  • If they are the same within a given numerical threshold, they are considered to be equivalent structures (rotamers), and the atomic mapping (or permutation) relative to the representative conformer is stored for each of the rotamers instead of all the geometries themselves.

  • Successively, the set of rotamer mappings is checked for completeness by multiplying all permutation matrices and verifying if the result is already present in the set of permutations. That way, a full set of rotamers is determined for each conformer as long as the initial set of structures is sufficiently representative.

  • As a last step of the procedure, structures corresponding to energetic saddle points are discarded.

Example

An example of a conformer/rotamer search, starting from a molecule name:

from hqs_molecules import PubChem, find_conformers, smiles_to_molecule

pc = PubChem.from_name("1-butene")
molecule = smiles_to_molecule(pc.smiles)
conf_data = find_conformers(molecule)

print(f"Found {len(conf_data)} distinct conformers.")  # or conf_data.num_conformers
print(f"Conformer 0 has {conf_data.get_degeneracies()[0]} rotamers.")

Important Data Structures

This section provides a short description of important data structures used for input and output in functions provided by HQS Molecules. Many of these classes are implemented as Pydantic models. Pydantic provides data validation and parsing using Python type annotations. By leveraging Pydantic, the package ensures that input and output data are correctly formatted and validated, reducing the likelihood of errors and improving the robustness of the software.

The objects described in this section are:

  • MolecularGeometry and Molecule representing molecules in 3D,
  • Molecular formulas MolecularFormula,
  • PubChem dataclass for the data from PubChem,
  • Trajectory and MolecularFrequencies representing output of quantum-mechanical calculations that cannot be stored in elementary data types.
  • ConformerEnsemble storing results of a conformer search: it is obtained by combining an initial search performed by CREST with a subsequent refinement of the conformer ensemble using techniques developed at HQS.

Representing Molecules in 3D

Representing Atomic Positions

The MolecularGeometry class contains atomic positions (in Å) and chemical element symbols. Objects of this type are commonly generated in HQS Molecules by reading an XYZ file. However, it lacks information on charge and spin multiplicity, which are typically needed for quantum-chemical calculations.

Important attributes of MolecularGeometry objects are

  • natoms (representing the number of atoms),
  • symbols (returning a list of chemical element symbols),
  • and positions (returning an N × 3 array of atomic positions).

Inspection of the class reveals further methods to update atomic positions and create copies of molecules, possibly with updated positions.

>>> from hqs_molecules import MolecularGeometry
>>> help(MolecularGeometry)

Internally, atoms are represented by a list of Atom objects. These are defined as named tuples containing the element symbol and the position, one tuple per atom. Note that this feature permits the atoms attribute to be used directly as input for PySCF calculations, as shown in the example below.

>>> from hqs_molecules import smiles_to_molecule
>>> from pyscf.gto import Mole
>>> hqs_mol = smiles_to_molecule("C=C")
>>> pyscf_mol = Mole(atom=hqs_mol.atoms)

Molecules with Charge and Spin

Molecule is one of the most important classes in the HQS Molecules package. It is implemented as a subclass of MolecularGeometry, with the addition of charge and multiplicity fields. Objects of Molecule type are commonly returned by functions performing 2D to 3D structure conversion. An additional attribute is nelectrons, containing the number of electrons corresponding to the molecular composition and charge.

Molecular formulas (such as H2O or OH) and molecular structure representations (such as SMILES strings or Molfiles) always contain the total molecular charge, explicitly or implicitly. Therefore, it is vital to preserve the total charge together with three-dimensional representations of molecular structures.

In addition to the charge, quantum-chemical calculations usually also require a specification of the spin multiplicity. Unlike the charge, it is not necessarily straightforward to infer from a molecular structure. Therefore, None is permitted as a value for the field. Indeed, functions such as smiles_to_molecule or molfile_to_molecule never set the field to an integer value themselves. Knowing the value of the spin multiplicity, the value can be set and validated for a Molecule object by using the set_multiplicity method.

>>> from hqs_molecules import smiles_to_molecule
>>> mol = smiles_to_molecule("CCO")
>>> print(mol.multiplicity)
None
>>> mol.set_multiplicity(1)
Molecule(atoms=[...], charge=0, multiplicity=1)
>>> print(mol.multiplicity)
1

Since the set_multiplicity method returns the object itself in addition to modifying it, calls such as mol = smiles_to_molecule("CCO").set_multiplicity(1) are possible.

Objects of type MolecularGeometry can be converted to Molecule instances using the to_molecule method, with the charge being mandatory and the multiplicity optional.

Molecular Formulas

Within HQS Molecules, molecular formulas are represented by MolecularFormula objects containing the elemental composition and the total charge. For example, formulas from PubChem are converted into this format:

>>> from hqs_molecules import PubChem
>>> pc = PubChem.from_name("Bicarbonate")
>>> pc.formula
MolecularFormula(natoms={'C': 1, 'H': 1, 'O': 3}, charge=-1)
>>> 

The class implements __str__ as a conversion of the formula to a string in Hill notation:

>>> f"{pc.formula}"
'CHO3-'

Users can easily create molecular formulas from a string input.

>>> from hqs_molecules import MolecularFormula
>>> formula = MolecularFormula.from_str("MnO4-")
>>> formula
MolecularFormula(natoms={'Mn': 1, 'O': 4}, charge=-1)
>>> 

The from_str constructor can handle some degree of complexity (for example, "CH3COOH" is interpreted equivalently to "C2H4O2"), but it cannot process arbitrarily complicated semi-structural formulas. Note that isomers cannot be distinguished, as they have identical elemental compositions.

Data from PubChem

Results from PubChem queries are stored within instances of the PubChem class. Unlike most other classes described in this section, it is implemented as a dataclass and not as a Pydantic model.

In practical use, instances of this class would normally be created using methods such as from_name or from_smiles. The retrieved data is stored in the fields of the class. A description can be found by executing:

>>> from hqs_molecules import PubChem
>>> help(PubChem)

Output of Quantum-Mechanical Calculations

Molecular Trajectories

Instances of the Trajectory class, as returned by geometry optimizations with xTB, contain two fields:

  • a list of Molecule objects that is labeled structures,
  • and the energies of each structure in a list labeled energies.

Convenience attributes are implemented for the following properties:

  • Obtaining the number of structures through the length attribute.
  • Obtaining the last structure and its energy via the attributes last and last_energy, respectively.
  • Identifying the structure with the lowest energy and accessing the structure, its energy and its position in the trajectory with the attributes lowest, lowest_energy and lowest_step, respectively.

Vibrational Frequencies

A generic representation of computed vibrational frequencies and basic thermochemical properties is contained within the class MolecularFrequencies. Instances contain

  • the total electronic energy (in the total_energy field)
  • and a list of vibrational frequencies (in the field frequencies).

Note that the latter are represented as (real) floating-point numbers; by convention, imaginary frequencies are represented as negative numbers.

Additionally, the MolecularFrequencies class defines

  • a field thermochem with a list of BasicThermochemistry objects.

In addition to vibrational frequencies, programs such as xTB can calculate thermodynamic contributions via a rigid rotor and harmonic oscillator approximation. These contributions are temperature-dependent (while harmonic frequencies are not). Therefore, thermochemical corrections are stored in a list with one item per temperature value. Each BasicThermochemistry object contains fields enthalpy, entropy, gibbs_energy, and temperature (representing the temperature used to evaluate the aforementioned properties). Since these quantities are interdependent, only enthalpy, entropy and temperature are stored explicitly, while the Gibbs energy is recomputed upon being accessed.

Note that the Hessian matrix itself is not represented in the MolecularFrequencies class.

Conformer Search Results

Structures and energies of conformers determined via CREST are stored by HQS Molecules in a class ConformerEnsemble, which contains a list of Conformer objects.

Note that the grouping of conformer and rotamer structures as determined in the CREST calculation is ignored, and all the structures are regrouped by our own procedure, as described in the section on conformer search.

Further information on the attributes of the respective classes can be accessed from within Python:

>>> from hqs_molecules import Conformer, ConformerEnsemble
>>> help(Conformer)
>>> help(ConformerEnsemble)

List of Available Solvents for xTB

By default, quantum-chemical calculations are performed in vacuum. If the name of a solvent is provided, the implicit solvation model will be used in the calculation. The xTB interface uses the ALPB (analytical linearized Poisson-Boltzmann) method for implicit solvation. The reference academic publication can be found here.

ALPB has been parametrized for the following solvents:

  • "Vacuum"
  • "Acetone"
  • "Acetonitrile"
  • "Aniline"
  • "Benzaldehyde"
  • "Benzene"
  • "Dioxane"
  • "Ethylacetate"
  • "Furane"
  • "Hexadecane"
  • "Nitromethane"
  • "Octanol"
  • "Octanol (wet)"
  • "Phenol"
  • "Dichlormethane"
  • "Chloroform"
  • "CS2"
  • "DMSO"
  • "Ether"
  • "Water"
  • "Methanol"
  • "THF"
  • "Toluene"
  • "DMF"
  • "Ethanol"
  • "Hexane"

The solvent names as given in quotation marks can be provided in the solvent argument of the xTB/CREST wrapper functions implemented in HQS Molecules.

Background Information on Structure Conversion

This section provides further information on:

  • Background,
  • Distance geometry,
  • Metric matrix method,
  • Distance bounds matrix,
  • RDKit Variants.

Background

The given list of references provides a summary of more technical aspects of the 3D structure conversion. More details can be found in the following sources:

  • J. M. Blaney, J. S. Nixon, Distance Geometry in Molecular Modeling. Rev. Comput. Chem. 1994, 5, 299–335 (DOI:10.1002/9780470125823.ch6).
  • S. Riniker, G. A. Landrum, Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation. J. Chem. Inf. Model. 2015, 55, 2562–2574 (DOI:10.1021/acs.jcim.5b00654).
  • S. Wang, J. Witek, G. A. Landrum, S. Riniker, Improving Conformer Generation for Small Rings and Macrocycles Based on Distance Geometry and Experimental Torsional-Angle Preferences. J. Chem. Inf. Model. 2020, 60, 2044–2058 (DOI:10.1021/acs.jcim.0c00025).
  • RDKit documentation

Distance Geometry

The structure conversion routine of HQS Molecules makes use of the open-source cheminformatics software RDKit, whose source code is available on GitHub. Given the SMILES string or Molfile, an RDKit molecule object is created and hydrogen atoms are added explicitly if needed. Assuming a valid SMILES string is provided in the variable smiles, the most basic steps are shown in the following code snipped:

from rdkit.Chem import AllChem

mol = AllChem.MolFromSmiles(smiles)
molH = AllChem.AddHs(mol)

status = AllChem.EmbedMolecule(molH)

As can be seen in the code above, in a first step a molecule object is created (MolFromSmiles function), which contains 2D information about the molecule, such as the connectivity. Since most hydrogen atoms are implicit in skeletal formulas and not included in the 2D structure, but are indispensable for a 3D representation of the molecule, they need to be added in a second step (AddHs function). However, the central function for generating 3D structures is the EmbedMolecule, which returns 0 if the structure was generated successfully, and -1 otherwise. This function computes a "3D embedding" based on variants of a stochastic approach referred to as distance geometry (DG). DG differs from standard Monte Carlo or molecular dynamics techniques in that it directly generates structures to satisfy the geometric constraints of the model instead of searching (almost) the entire conformational space to find appropriate structures. As such, it can rapidly find one or more possible solutions but shares with other random methods the inability to guarantee finding all of them.

Metric Matrix Method

Distance geometry (DG) approach described above is based on the assumption that purely geometric constraints can describe all possible conformations of a molecule, for which upper and lower distance bounds for all pairs of atoms are determined and represented in a distance bounds matrix. However, the key element of the DG technique is the metric matrix, \(\mathbf{G}\), whose elements \(g_{ij}\) can be calculated from the vector product of the coordinates of atoms \(i\) and \(j\), \[ \mathbf{G} = \mathbf{XX}^\text{T} , \] where \(\mathbf{X}\) is a matrix containing the Cartesian coordinates of the atoms. \(\mathbf{G}\) is a square symmetric matrix, and as such it can be decomposed as \[ \mathbf{G} = \mathbf{VL}^2\mathbf{V}^\text{T} , \] where the elements of the diagonal matrix \(\mathbf{L}^2\) are the eigenvalues and the columns of \(\mathbf{V}\) are the eigenvectors of \(\mathbf{G}\). From the last two equations and the fact that \(\mathbf{L}\) has only diagonal entries it can be seen that \[ \mathbf{X} = \mathbf{VL} . \] This means that the Cartesian coordinates \(\mathbf{X}\) can be regenerated by multiplying the square root \(\mathbf{L}\) of the eigenvalues of \(\mathbf{G}\) with the eigenvectors \(\mathbf{V}\). It has thus been shown that it is possible to convert three-dimensional atomic coordinates to the metric matrix \(\mathbf{G}\) and then regenerate them from it.

A crucial aspect in DG is that \(\mathbf{G}\) can also be derived directly from the distance matrix, \(\mathbf{D}\), in which each component \(d_{ij}\) is the distance between atoms \(i\) and \(j\). To this end, \(\mathbf{D}\) is first converted to \(\mathbf{D}_o\), where each element \(d_{io}\) is the distance between atom \( i \) and the center of mass \(o\), \[ d_{io}^2 = \frac1N \sum_{j=1}^{N} d_{ij}^2 - \frac{1}{N^2} \sum_{j=2}^{N} \sum_{k=1}^{j-1} d_{jk}^2 , \] where \(N\) is the number of atoms. \(\mathbf{G}\) can then be calculated using the cosine rule, \( g_{ij} = ( d_{io}^2 + d_{jo}^2 - d_{ij}^2 ) / 2 \), and as seen before, the Cartesian coordinates \(\mathbf{X}\) can be obtained from \(\mathbf{G}\).

Distance Bounds Matrix

Therefore, in order to get conformations of a molecule, it is necessary to provide a distance matrix, \(\mathbf{D}\). Its components can be derived from the distance bounds matrix, which, as stated previously, contains upper and lower distance bounds for all pairs of atoms in a given molecule, determining the geometric constraints of the DG technique. But how can one define those bounds?

For example, imagine a square-shaped molecule with bond lengths of 1.0 Å. The upper and lower bounds for the neighboring atoms can thus be set to 1.0. Assuming that the atoms have van der Waals (vdW) radii of 0.25 Å, then all other lower bounds can be set to 0.5 (the default lower bound between any two atoms is the sum of of their vdW radii). However, the remaining upper bounds are not known, so a default of 100 is commonly chosen. This completes the entries of the distance bounds matrix, and a distance matrix \(\mathbf{D}\) can be generated from it by choosing distances randomly between the corresponding lower and upper bounds and entering them at appropriate places in \(\mathbf{D}\).

However, the upper bound of 100 is problematic since most distances between 0.5 and 100 Å are geometrically impossible given the other distances. Here, one can make use of the triangle inequality relationship, which states that for three points A, B, and C, distance AC can at most be AB + BC, meaning that the length of one side of a triangle must be less than or equal to the sum of the lengths of the other two sides. The inverse triangle inequality states that the minimum distance AC must be at least |ABBC|.

While these relationships are valid for exact distances, very similar ones hold for upper and lower bounds. For example, applying the triangle inequality relationship to the three atoms A, B, and D of the square molecule indicates that the upper bound between D and B can be at most 2.0 Å, so the default value of 100 is lowered significantly. In general, the triangle inequality lowers some upper bounds, and the inverse triangle inequality raises some lower bounds.

RDKit Variants

The RDKit variants of the Distance Geometry (DG) approach are referred to as "experimental-torsion distance geometry" (ETDG), which includes experimental torsional-angle preferences, and "ETKDG," with additional "basic knowledge" (K) terms such as flat aromatic rings or linear triple bonds. ETKDG performs the following steps:

  1. Build a distance bounds matrix based on the topology, including 1–5 distances, but without van der Waals scaling.
  2. Triangle smooth this bounds matrix.
  3. If step 2 fails, repeat step 1, this time without 1–5 bounds but with vdW scaling, then repeat step 2.
  4. Pick a distance matrix at random using the bounds matrix.
  5. Compute initial coordinates from the distance matrix.
  6. Repeat steps 3 and 4 until the maximum number of iterations is reached or embedding is successful.
  7. Adjust the initial coordinates by minimizing a "distance violation" error function.

If the molecule consists of multiple fragments, they are embedded separately by default, which means that they are likely to occupy the same region of space.