Representation in Python of molecular data structures

Inside the hqs_nmr_parameters package, the Pydantic MolecularData class has been implemented to describe a molecule. It contains the following attributes:

  • name: The name of the molecule. When name is not provided in the YAML input, the name of the YAML file is used.
  • isotopes: List containing pairs of an atom index and the associated isotope. Atom indices are associated with the order in which atoms appear in the chemical structure representations, starting by index 0.
  • shifts: List containing pairs of an atom index and the associated chemical shift.
  • j_couplings: List containing pairs of atomic indices and the associated J-coupling values. Note that atom index pairs are unique: if a value is provided for an atom pair (k, l), then no value is provided for pair (l, k).
  • structures: Contains a dictionary with chemical structure representations. It accepts entries for a SMILES string ("SMILES"), a Molfile ("Molfile"), and an XYZ file ("XYZ"). Each value will correspond to a ChemicalStructure object. See below a full explanation of the chemical structure representations and the atom numeration.
  • formula: The molecular formula of the molecule.
  • temperature: An optional temperature definition.
  • solvent: Name of the solvent. An empty string represents an unknown or undefined solvent, or the absence of a solvent.
  • description: Optional further information.
  • method_json: Stores a JSON serialization of computational method settings. An empty string indicates that the field is not applicable. Creating and interpreting the content is the responsibility of the user of the model.

This is best illustrated by an example from one of our available data sets:

from hqs_nmr_parameters.examples import molecules

parameters = molecules["C2H5OH"]

print(type(parameters))
print("Chemical name:", parameters.name)
print("Molecular Formula:", parameters.formula)
print("Type(s) of representations:", parameters.structures.keys())
print("Information about solvent:", parameters.solvent)
print("Shifts:", parameters.shifts)
print("###")
print(parameters.description)
<class 'hqs_nmr_parameters.code.data_classes.MolecularData'>
Chemical name: Ethanol
Molecular Formula: C2H6O
Type(s) of representations: dict_keys(['SMILES'])
Information about solvent: chloroform
Shifts: [(3, 1.25), (4, 1.25), (5, 1.25), (6, 3.72), (7, 3.72), (8, 1.32)]
###
1H parameters for ethanol in CDCl3.
Shifts from: https://doi.org/10.1021/om100106e
J-couplings estimated.

In this case, we are getting information for the ethanol molecule, for which we have stored a SMILES representation (check below for details) and according to the description experimental 1H-NMR parameters in chloroform.

NMR parameters subset

The best way to check the NMR data is getting only the information necessary to perform an NMR calculation, i.e., a subset of parameters with the attributes isotopes, shifts, and j_couplings. Those can be obtained using the spin_system function of MolecularData:

from pprint import pprint
from hqs_nmr_parameters.examples import molecules

parameters = molecules["C2H5OH"]
nmr_parameters = parameters.spin_system()

print(type(nmr_parameters))
pprint(nmr_parameters.model_dump())
<class 'hqs_nmr_parameters.code.data_classes.NMRParameters'>
{'isotopes': [(1, 'H'), (1, 'H'), (1, 'H'), (1, 'H'), (1, 'H'), (1, 'H')],
 'j_couplings': [((0, 3), 7.0),
                 ((0, 4), 7.0),
                 ((1, 3), 7.0),
                 ((1, 4), 7.0),
                 ((2, 3), 7.0),
                 ((2, 4), 7.0)],
 'shifts': [1.25, 1.25, 1.25, 3.72, 3.72, 1.32]}

As we can see, nmr_parameters is an instance of the Pydantic NMRParameters class, where the active nuclei have been renumbered starting from 0. It is used as input for several functions to calculate an NMR spectrum. To perform an NMR simulation, we can do the following:

from hqs_nmr_parameters.examples import molecules
from hqs_nmr import calculate_spectrum

parameters = molecules["C2H5OH"]
nmr_parameters = parameters.spin_system()
print(type(nmr_parameters))  # NMRParameters
spectrum = calculate_spectrum(molecule_parms=nmr_parameters, frequency=400.0)
print(type(spectrum)) # Spectrum
<class 'hqs_nmr_parameters.code.data_classes.NMRParameters'>
<class 'hqs_nmr.datatypes.Spectrum'>

For getting a better insight into NMRParameters and the output object Spectrum have a look here.

Isotopes/atoms selection

MolecularData objects can contain data for several isotopes. However, we are often interested in creating a spin system for a given nuclei only. For example, if we want to simulate a 1H-NMR spectrum but we also have 13C-NMR parameters in our data, it is possible either to select only 1H using keep_nuclei(isotopes=["1H"]) or to drop 13C using drop_nuclei(isotopes=["13C"]). Let us take a look at this in more detail:

from pprint import pprint
from hqs_nmr_parameters.examples import molecules

parameters = molecules["CH3Cl_13C"]
pprint(parameters.isotopes)

drop_C = parameters.drop_nuclei(isotopes=["13C"])
keep_H = parameters.keep_nuclei(isotopes=["1H"])
print(drop_C == keep_H)

nmr_parameters_1H = drop_C.spin_system()
pprint(nmr_parameters_1H.model_dump())
[(0, Isotope(mass_number=13, symbol='C')),
 (2, Isotope(mass_number=1, symbol='H')),
 (3, Isotope(mass_number=1, symbol='H')),
 (4, Isotope(mass_number=1, symbol='H'))]
True
{'isotopes': [(1, 'H'), (1, 'H'), (1, 'H')],
 'j_couplings': [((0, 1), -10.8), ((0, 2), -10.8), ((1, 2), -10.8)],
 'shifts': [3.05, 3.05, 3.05]}

We should take into account that the dropping/keeping of isotopes needs to be done carefully. The 13C nucleus has a nuclear spin of 1/2 (as 1H) and is NMR-activate but has a very low natural abundance, and the 13C-1H coupling pattern is only barely visible in 1H-NMR spectra. Therefore, it is common to simulate only 13 C-decoupled 1H-NMR spectra. However, the coupling of 1H with other isotopes such as 19F or 31P should not be ignored in order to get the correct peak pattern.

keep_nuclei can also take an atoms argument, which accepts a list of atom indices for which the NMR parameters are to be kept. In the case of drop_nuclei, specified nuclei can be dropped via atoms, but a keep_atoms argument prevents certain atoms from being dropped (useful in combination with isotopes).

Structure representations

As shown above a MolecularData object has the attribute structures, which is a dictionary that can accept three entries (keys): "SMILES", "Molfile" and "XYZ". The value of each of these entries is a ChemicalStructure object that stores chemical structure representations of the type defined in the key:

  • SMILES (Simplified Molecular-Input Line-Entry System): Strings representing the connectivity of all non-hydrogen atoms in a molecule. They become laborious to understand for complex molecules.
  • Molfiles: Text files with a two-dimensional (2D) structure of a molecule (skeletal representation). It could omit hydrogens. The number of hydrogen atoms is inferred from the atomic valencies, of the heavy atoms.
  • XYZ files: Text files containing as first line an integer with the total number of atoms, followed by a comment line, and then atomic positions and chemical element symbols for all the atoms explicitly.

Connectivity and charge information can be extracted from SMILES strings or Molfiles, but not from XYZ files. More information about Molfiles and Strings can be found here.

The attributes of the ChemicalStructure class are:

  • representation: Type of chemical structure representation present in the content attribute. It could be "SMILES", "Molfile" or "XYZ".
  • content: It corresponds to the raw structure representation (as an string), i.e., a SMILES string or the content of an XYZ file or a Molfile.
  • charge: Charge of the molecule.
  • symbols: List of chemical element symbols for the full set of atoms.
  • atom_map: As we have seen, 2D representations can omit hydrogens, this attribute contains information about how the full representation would map into this reduced representation. It is a list of the length of the molecule (full set of atoms), where the atom counting starts from zero and takes into account:
    • Each non-hydrogen atom maps onto the respective index in the reduced representation.
    • Each explicit hydrogen maps onto the index of its explicitly represented counterpart.
    • Each implicit hydrogen maps onto the reduced index of the respective backbone atom.

In the case of ethanol, a full structures representation (dictionary saved here in the variable ethanol_representation) may look like:

from hqs_nmr_parameters import ChemicalStructure

ethanol_representation = {
"XYZ": ChemicalStructure(representation="XYZ", content="9\n\nC       -0.88708900     0.17506400    -0.01253500\nC        0.46048900    -0.51551600    -0.04653500\nO        1.44296500     0.30726700     0.56557200\nH       -0.84747800     1.12776800    -0.55081700\nH       -1.65878200    -0.45332700    -0.46584200\nH       -1.17694400     0.40367600     1.01830600\nH        0.76871200    -0.72432800    -1.07546000\nH        0.41948600    -1.46207300     0.50017700\nH        1.47864000     1.14146800     0.06713500\n", charge=0, symbols=["C", "C", "O", "H", "H", "H", "H", "H", "H"], atom_map=[0, 1, 2, 3, 4, 5, 6, 7, 8]),
"Molfile": ChemicalStructure(representation="Molfile", content="\n     RDKit          2D\n\n  0  0  0  0  0  0  0  0  0  0999 V3000\nM  V30 BEGIN CTAB\nM  V30 COUNTS 3 2 0 0 0\nM  V30 BEGIN ATOM\nM  V30 1 C -1.299038 -0.250000 0.000000 0\nM  V30 2 C 0.000000 0.500000 0.000000 0\nM  V30 3 O 1.299038 -0.250000 0.000000 0\nM  V30 END ATOM\nM  V30 BEGIN BOND\nM  V30 1 1 1 2 CFG=3\nM  V30 2 1 2 3 CFG=3\nM  V30 END BOND\nM  V30 END CTAB\nM  END\n", charge=0, symbols=["C", "C", "O", "H", "H", "H", "H", "H", "H"], atom_map=[0, 1, 2, 0, 0, 0, 1, 1, 2]),
"SMILES": ChemicalStructure(representation="SMILES", content="CCO", charge=0, symbols=["C", "C", "O", "H", "H", "H", "H", "H", "H"], atom_map=[0, 1, 2, 0, 0, 0, 1, 1, 2])
}

Each of the entries corresponds to a ChemicalStructure object:

from pprint import pprint

smiles_representation = ethanol_representation["SMILES"]
print(type(smiles_representation)) # ChemicalStructure
pprint(smiles_representation.model_dump())
<class 'hqs_nmr_parameters.code.data_classes.ChemicalStructure'>
{'atom_map': [0, 1, 2, 0, 0, 0, 1, 1, 2],
 'charge': 0,
 'content': 'CCO',
 'representation': 'SMILES',
 'symbols': ['C', 'C', 'O', 'H', 'H', 'H', 'H', 'H', 'H']}

ethanol_representation["Molfile"].atom_map and ethanol_representation["SMILES"].atom_map will return:

[0, 1, 2, 0, 0, 0, 1, 1, 2]

In both cases, only the carbon and the oxygen atoms have been indicated explicitly. Therefore, the first three atoms are listed expressly (carbon atoms 0 and 1, and oxygen atom 2, i.e. the atom counting starts from zero) and the six following hydrogens are assigned to one of those backbone atoms. The full atomic indices associated with each reduced index can be obtained using the inverted_map method:

from pprint import pprint

smiles_representation = ethanol_representation["SMILES"]
print(smiles_representation.atom_map)
print(smiles_representation.inverted_map)
[0, 1, 2, 0, 0, 0, 1, 1, 2]
[{0, 3, 4, 5}, {1, 6, 7}, {8, 2}]

Hence, C (index = 0) is connected to H atoms 3, 4, 5, while C (index = 1) is connected to H atoms 6, 7, and the O (index = 2), to H 8.

However, ethanol_representation["XYZ"].atom_map will return:

[0, 1, 2, 3, 4, 5, 6, 7, 8]

Since all the atoms are explicitly provided in an XYZ format.

The ChemicalStructure class can also yield the chemical formula in Hill notation using the formula method. smiles_representation.formula will return:

'C2H6O'

Serialization (saving in a JSON file) of a MolecularData object and deserialization

We can save a MolecularData instance in a JSON file for future uses. To do so, employ the write_file method of the MolecularData class:

from hqs_nmr_parameters.examples import molecules

parameters = molecules["C2H5OH"]
parameters.write_file("etoh_molecular_data.json")

This JSON file can be easily loaded and validated against the MolecularData model using the read_file method:

from hqs_nmr_parameters import MolecularData

loaded_parameters = MolecularData.read_file("etoh_molecular_data.json")
print(type(loaded_parameters))
<class 'hqs_nmr_parameters.code.data_classes.MolecularData'>