Important Data Structures

This section provides a short description of important data structures used for input and output in functions provided by HQS Molecules. Many of these classes are implemented as Pydantic models. Pydantic provides data validation and parsing using Python type annotations. By leveraging Pydantic, the package ensures that input and output data are correctly formatted and validated, reducing the likelihood of errors and improving the robustness of the software.

The objects described in this section are:

  • MolecularGeometry and Molecule representing molecules in 3D,
  • Molecular formulas MolecularFormula,
  • PubChem dataclass for the data from PubChem,
  • Trajectory and MolecularFrequencies representing output of quantum-mechanical calculations that cannot be stored in elementary data types.
  • ConformerEnsemble storing results of a conformer search: it is obtained by combining an initial search performed by CREST with a subsequent refinement of the conformer ensemble using techniques developed at HQS.

Representing Molecules in 3D

Representing Atomic Positions

The MolecularGeometry class contains atomic positions (in Å) and chemical element symbols. Objects of this type are commonly generated in HQS Molecules by reading an XYZ file. However, it lacks information on charge and spin multiplicity, which are typically needed for quantum-chemical calculations.

Important attributes of MolecularGeometry objects are

  • natoms (representing the number of atoms),
  • symbols (returning a list of chemical element symbols),
  • and positions (returning an N × 3 array of atomic positions).

Inspection of the class reveals further methods to update atomic positions and create copies of molecules, possibly with updated positions.

from hqs_molecules import MolecularGeometry
help(MolecularGeometry)

Internally, atoms are represented by a list of Atom objects. These are defined as named tuples containing the element symbol and the position, one tuple per atom. Note that this feature permits the atoms attribute to be used directly as input for PySCF calculations, as shown in the example below.

from hqs_molecules import smiles_to_molecule
from pyscf.gto import Mole
hqs_mol = smiles_to_molecule("C=C")
pyscf_mol = Mole(atom=hqs_mol.atoms)

Molecules with Charge and Spin

Molecule is one of the most important classes in the HQS Molecules package. It is implemented as a subclass of MolecularGeometry, with the addition of charge and multiplicity fields. Objects of Molecule type are commonly returned by functions performing 2D to 3D structure conversion. An additional attribute is nelectrons, containing the number of electrons corresponding to the molecular composition and charge.

Molecular formulas (such as H2O or OH) and molecular structure representations (such as SMILES strings or Molfiles) always contain the total molecular charge, explicitly or implicitly. Therefore, it is vital to preserve the total charge together with three-dimensional representations of molecular structures.

In addition to the charge, quantum-chemical calculations usually also require a specification of the spin multiplicity. Unlike the charge, it is not necessarily straightforward to infer from a molecular structure. Therefore, None is permitted as a value for the field. Indeed, functions such as smiles_to_molecule or molfile_to_molecule never set the field to an integer value themselves. Knowing the value of the spin multiplicity, the value can be set and validated for a Molecule object by using the set_multiplicity method.

from hqs_molecules import smiles_to_molecule
mol = smiles_to_molecule("CCO")
print(mol.multiplicity)
# None
mol.set_multiplicity(1)
print(mol.multiplicity)
# 1

Since the set_multiplicity method returns the object itself in addition to modifying it, calls such as mol = smiles_to_molecule("CCO").set_multiplicity(1) are possible.

Objects of type MolecularGeometry can be converted to Molecule instances using the to_molecule method, with the charge being mandatory and the multiplicity optional.

Molecular Formulas

Within HQS Molecules, molecular formulas are represented by MolecularFormula objects containing the elemental composition and the total charge. For example, formulas from PubChem are converted into this format:

from hqs_molecules import PubChem
pc = PubChem.from_name("Bicarbonate")
print(pc.formula.model_dump())
# {'natoms': {'C': 1, 'H': 1, 'O': 3}, 'charge': -1}

The class implements __str__ as a conversion of the formula to a string in Hill notation:

print(pc.formula)
# CHO3-
#
# equivalent with:
print(str(pc.formula))
print(f"{pc.formula}")

Users can easily create molecular formulas from a string input.

from hqs_molecules import MolecularFormula
formula = MolecularFormula.from_str("MnO4-")
print(formula.model_dump())
# {'natoms': {'Mn': 1, 'O': 4}, 'charge': -1}

The from_str constructor can handle some degree of complexity (for example, "CH3COOH" is interpreted equivalently to "C2H4O2"), but it cannot process arbitrarily complicated semi-structural formulas. Note that isomers cannot be distinguished, as they have identical elemental compositions.

Having created a Molecule object, for instance using the smiles_to_molecule function described above, its molecular formula can be represented using the MolecularFormula.from_mol class method.

Data from PubChem

Results from PubChem queries are stored within instances of the PubChem class. Unlike most other classes described in this section, it is implemented as a dataclass and not as a Pydantic model.

In practical use, instances of this class would normally be created using methods such as from_name or from_smiles. The retrieved data is stored in the fields of the class. A description can be found by executing:

from hqs_molecules import PubChem
help(PubChem)

Output of Quantum-Mechanical Calculations

Molecular Trajectories

Instances of the Trajectory class, as returned by geometry optimizations with xTB, contain two fields:

  • a list of Molecule objects that is labeled structures,
  • and the energies of each structure in a list labeled energies.

Convenience attributes are implemented for the following properties:

  • Obtaining the number of structures through the length attribute.
  • Obtaining the last structure and its energy via the attributes last and last_energy, respectively.
  • Identifying the structure with the lowest energy and accessing the structure, its energy and its position in the trajectory with the attributes lowest, lowest_energy and lowest_step, respectively.

Vibrational Frequencies

A generic representation of computed vibrational modes and basic thermochemical properties is contained within the class VibrationalAnalysis. Instances contain

  • a list of vibrational modes (in the modes field)
  • and the nuclear Hessian (in the hessian field). The latter may be an empty list if the Hessian matrix is missing.

Please note that the normal modes and frequencies stored in the object amount to 3N − 6 (or 3N − 5 for linear molecules, where N is the number of atoms), rather than 3N.

Each entry in the modes field is of type VibrationalMode, which contains fields for

  • the vibrational frequency,
  • the Cartesian normal-mode displacements,
  • the reduced mass associated with the normal mode in reduced_mass,
  • and the intensity of a normal mode excitation (in the ir_intensity field).

The values of the fields reduced_mass and ir_intensity may be None, which is appropriate if the respective values are not available as part of the vibrational analysis.

Convenience properties of the VibrationalAnalysis class give access to

  • a list of vibrational frequencies and
  • a list of all Cartesian displacements.
  • reduced_masses returns a list of reduced masses of the normal modes, or an empty list if the values are undefined, and
  • ir_intensities returns a list of all infrared intensities or an empty list.
  • The is_linear flag indicates whether a molecule is linear, thus having 3N − 5 vibrational degrees of freedom instead of 3N − 6.
  • The number of atoms can be obtained via the natoms property. Note that the frequencies are represented as (real) floating-point numbers; by convention, imaginary frequencies are represented as negative numbers. An empty list modes is assumed to imply a system with one atom, while no special provisions are made for an empty system with zero atoms.

In addition to vibrational frequencies, programs such as xTB can calculate thermodynamic contributions via a rigid rotor and harmonic oscillator approximation. These contributions are temperature-dependent (while harmonic frequencies are not). Therefore, thermochemical corrections are stored in a Thermochemistry object, which contains the fields enthalpy, entropy, gibbs_energy, and temperature (representing the temperature used to evaluate the aforementioned properties). Since these quantities are interdependent, only enthalpy, entropy and temperature are stored explicitly, while the Gibbs energy is recomputed upon being accessed. Despite not being temperature-dependent, the electronic energy is also present in the field energy. Using the update_energy method, the electronic energy can be updated and the thermodynamic properties are recomputed accordingly, which can be used if one wants to combine a high-level single-point energy with a lower-level frequency calculation.

Conformer Search Results

Structures and energies of conformers determined via CREST are stored by HQS Molecules in a class ConformerEnsemble, which contains a list of Conformer objects.

Note that the grouping of conformer and rotamer structures as determined in the CREST calculation is ignored, and all the structures are regrouped by our own procedure, as described in the section on conformer search.

Further information on the attributes of the respective classes can be accessed from within Python:

from hqs_molecules import Conformer, ConformerEnsemble
help(Conformer)
help(ConformerEnsemble)