Representation in Python of molecular data structures

Molecular data

Inside the hqs-nmr-parameters package, the MolecularData class, which is derived from Pydantic's BaseModel, has been implemented to describe a molecule. It contains the following attributes:

name: The name of the molecule (just a simple string).
isotopes: List containing pairs of an atom index and the associated isotope. Atom indices are associated with the order in which atoms appear in the chemical structure representations, starting by index 0.
shifts: List containing pairs of an atom index and the associated chemical shift in ppm.
j_couplings: List containing pairs of atomic indices and the associated J-coupling values in Hz. Note that atom index pairs are unique: if a value is provided for an atom pair (k, l), then no value is provided for pair (l, k).
structures: Contains a dictionary with chemical structure representations. It accepts entries for a SMILES string ("SMILES"), a Molfile ("Molfile"), and an XYZ file ("XYZ"). Each value will correspond to a ChemicalStructure object. See below for a full explanation of the chemical structure representations and the atom numeration.
formula: The molecular formula of the corresponding molecule.
temperature: An optional temperature in K for which the parameters have been calculated or at which the experiment has been performed.
solvent: Name of the solvent. An empty string represents an unknown or undefined solvent, or the absence of a solvent.
description: Optional further information.
method_json: Stores a JSON serialization of computational method settings. An empty string indicates that the field is not applicable. Creating and interpreting the content is the responsibility of the user of the model.

This is best illustrated by an example from one of our available datasets:

# Import a MolecularDataSet
from hqs_nmr_parameters.examples import molecules

# Choose ethanol as an example from the available molecules
parameters = molecules["C2H5OH"]

print(type(parameters))
print("Chemical name:", parameters.name)
print("Molecular Formula:", parameters.formula)
print("Type(s) of representations:", parameters.structures.keys())
print("Information about solvent:", parameters.solvent)
print("Shifts:", parameters.shifts)
print("###")
print(parameters.description)

<class 'hqs_nmr_parameters.code.data_classes.MolecularData'>
Chemical name: Ethanol
Molecular Formula: C2H6O
Type(s) of representations: dict_keys(['SMILES'])
Information about solvent: chloroform
Shifts: [(3, 1.25), (4, 1.25), (5, 1.25), (6, 3.72), (7, 3.72), (8, 1.32)]
###
1H parameters for ethanol in CDCl3.
Shifts from: https://doi.org/10.1021/om100106e
J-couplings estimated.

In this case, we are getting information for the ethanol molecule, for which we have stored a SMILES representation (check below for details) and according to the description experimental ¹H NMR parameters in chloroform.

Note that the number of H atoms in a molecule is not necessarily the same as the number of spins in its ¹H NMR spectrum simulation. For instance, H atoms attached to electronegative elements such as N, O, or S tend to exchange rapidly with the deuterium atoms in protic solvents such as deuterated water (D₂O), which results in an extreme line broadening in the NMR spectrum or even makes the signal disappear entirely. That means, if the ethanol entry shown above was recorded in water instead of chloroform, the chemical shift value for the hydroxyl proton (OH, index 8) would probably be missing.

NMR parameters subset

The easiest way to check the NMR data is getting only the information necessary to perform an NMR calculation, i.e., a subset of the parameters object containing only the attributes isotopes, shifts, and j_couplings. Those can be obtained using the spin_system method of MolecularData, which returns an object of type NMRParameters:

from pprint import pprint
from hqs_nmr_parameters.examples import molecules

parameters = molecules["C2H5OH"]
nmr_parameters = parameters.spin_system()

print(type(nmr_parameters))
pprint(nmr_parameters.model_dump())

<class 'hqs_nmr_parameters.code.data_classes.NMRParameters'>
{'isotopes': [(1, 'H'), (1, 'H'), (1, 'H'), (1, 'H'), (1, 'H'), (1, 'H')],
 'j_couplings': [((0, 3), 7.0),
                 ((0, 4), 7.0),
                 ((1, 3), 7.0),
                 ((1, 4), 7.0),
                 ((2, 3), 7.0),
                 ((2, 4), 7.0)],
 'shifts': [1.25, 1.25, 1.25, 3.72, 3.72, 1.32]}

Here, nmr_parameters is an instance of the Pydantic NMRParameters class, where the atomic indices always start from 0. The total number of spins, determined from the number of chemical shifts, can be accessed with the nspins property. The NMRParameters class can be used as input for several functions to calculate an NMR spectrum, which is explained here in detail. The following code block shows how to perform a ¹H NMR simulation (default case), providing the molecular data either as NMRParameters or directly as MolecularData object:

from hqs_nmr_parameters.examples import molecules
from hqs_nmr import calculate_spectrum, NMRCalculationParameters

parameters = molecules["C2H5OH"]
nmr_parameters = parameters.spin_system()
print(type(nmr_parameters))
print(nmr_parameters.nspins)
calculation_parameters = NMRCalculationParameters(field_T=9.395)

# Using parameters instead of nmr_parameters would also be valid
spectrum = calculate_spectrum(nmr_parameters, calculation_parameters)
print(type(spectrum))

<class 'hqs_nmr_parameters.code.data_classes.NMRParameters'>
6
<class 'hqs_nmr.datatypes.NMRResultSpectrum1D'>

For information on the output class NMRResultSpectrum1D have a look here.

Isotopes/atoms selection

MolecularData objects can contain data for several isotopes. However, we are often interested in creating a spin system for a given nucleus only. For example, if we want to simulate a ¹H NMR spectrum but we also have ¹³C NMR parameters in our data, it is possible either to select only ¹H using keep_nuclei(isotopes=["1H"]) or to drop ¹³C using drop_nuclei(isotopes=["13C"]). Let us take a look at this in more detail:

from pprint import pprint
from hqs_nmr_parameters.examples import molecules

parameters = molecules["CH3Cl_13C"]
pprint(parameters.isotopes)

drop_C = parameters.drop_nuclei(isotopes=["13C"])
keep_H = parameters.keep_nuclei(isotopes=["1H"])
print(drop_C == keep_H)

nmr_parameters_1H = drop_C.spin_system()
pprint(nmr_parameters_1H.model_dump())

[(0, Isotope(mass_number=13, symbol='C')),
 (2, Isotope(mass_number=1, symbol='H')),
 (3, Isotope(mass_number=1, symbol='H')),
 (4, Isotope(mass_number=1, symbol='H'))]
True
{'isotopes': [(1, 'H'), (1, 'H'), (1, 'H')],
 'j_couplings': [((0, 1), -10.8), ((0, 2), -10.8), ((1, 2), -10.8)],
 'shifts': [3.05, 3.05, 3.05]}

We should take into account that the dropping/keeping of isotopes needs to be done carefully. The ¹³C nucleus has a nuclear spin of 1/2 (like ¹H) and is NMR-activate but has a very low natural abundance, and the ¹³C–¹H coupling pattern is only barely visible in ¹H NMR spectra. Therefore, it is common to simulate only ¹³C-decoupled ¹H NMR spectra. However, the coupling of ¹H with other isotopes such as ¹⁹F or ³¹P should not be ignored in order to get the correct peak pattern.

keep_nuclei can also take an atoms argument, which accepts a list of atom indices for which the NMR parameters are to be kept. In the case of drop_nuclei, specified nuclei can be dropped via atoms, but a keep_atoms argument prevents certain atoms from being dropped (useful in combination with isotopes).

Note that many datasets provided by hqs-nmr-parameters already only contain the data that is needed to simulate ¹H NMR spectra. More information on each dataset can be found here.

Molecular structure representations

As shown above a MolecularData object has the attribute structures, which is a dictionary that can accept three entries (keys): "SMILES", "Molfile" and "XYZ". The value of each of these entries is a ChemicalStructure object that stores chemical structure representations of the type defined in the key:

SMILES (Simplified Molecular-Input Line-Entry System): Strings representing the connectivity of all non-hydrogen atoms in a molecule. They become laborious to understand for complex molecules.
Molfiles: Text files with a two-dimensional (2D) structure of a molecule (skeletal representation). It may omit hydrogen atoms. The number of hydrogen atoms is inferred from the atomic valencies of the heavy atoms.
XYZ files: Text files containing an integer with the total number of atoms in the first line, followed by a comment line, and then the chemical element symbols and three-dimensional atomic positions for all the atoms explicitly.

Connectivity and charge information can be extracted from SMILES strings or Molfiles, but not from XYZ files. More information about Molfiles and SMILES can be found here.

The attributes of the ChemicalStructure class are:

representation: Type of chemical structure representation present in the content attribute. It can be "SMILES", "Molfile" or "XYZ".
content: The raw structure representation (as a string), i.e., a SMILES string or the content of an XYZ file or a Molfile.
charge: Total net electric charge of the molecule.
symbols: List of chemical element symbols for the full set of atoms.
atom_map: As we have seen, 2D representations can omit hydrogen atoms, this attribute contains information about how the full representation would map into this reduced representation. It is a list of the length of the molecule (full set of atoms), where the atom counting starts from zero and takes into account:
- Each non-hydrogen atom maps onto the respective index in the reduced representation.
- Each explicit hydrogen maps onto the index of its explicitly represented counterpart.
- Each implicit hydrogen maps onto the reduced index of the respective backbone atom.

In the case of ethanol, a full structures representation may look like the following. It might look a bit crowded as the content attribute contains the full content of the corresponding files (or SMILES string):

from hqs_nmr_parameters import ChemicalStructure

ethanol_representation = {
"XYZ": ChemicalStructure(representation="XYZ", content="9\n\nC       -0.88708900     0.17506400    -0.01253500\nC        0.46048900    -0.51551600    -0.04653500\nO        1.44296500     0.30726700     0.56557200\nH       -0.84747800     1.12776800    -0.55081700\nH       -1.65878200    -0.45332700    -0.46584200\nH       -1.17694400     0.40367600     1.01830600\nH        0.76871200    -0.72432800    -1.07546000\nH        0.41948600    -1.46207300     0.50017700\nH        1.47864000     1.14146800     0.06713500\n", charge=0, symbols=["C", "C", "O", "H", "H", "H", "H", "H", "H"], atom_map=[0, 1, 2, 3, 4, 5, 6, 7, 8]),
"Molfile": ChemicalStructure(representation="Molfile", content="\n     RDKit          2D\n\n  0  0  0  0  0  0  0  0  0  0999 V3000\nM  V30 BEGIN CTAB\nM  V30 COUNTS 3 2 0 0 0\nM  V30 BEGIN ATOM\nM  V30 1 C -1.299038 -0.250000 0.000000 0\nM  V30 2 C 0.000000 0.500000 0.000000 0\nM  V30 3 O 1.299038 -0.250000 0.000000 0\nM  V30 END ATOM\nM  V30 BEGIN BOND\nM  V30 1 1 1 2 CFG=3\nM  V30 2 1 2 3 CFG=3\nM  V30 END BOND\nM  V30 END CTAB\nM  END\n", charge=0, symbols=["C", "C", "O", "H", "H", "H", "H", "H", "H"], atom_map=[0, 1, 2, 0, 0, 0, 1, 1, 2]),
"SMILES": ChemicalStructure(representation="SMILES", content="CCO", charge=0, symbols=["C", "C", "O", "H", "H", "H", "H", "H", "H"], atom_map=[0, 1, 2, 0, 0, 0, 1, 1, 2])
}

Each of the entries corresponds to a ChemicalStructure object:

from pprint import pprint

smiles_representation = ethanol_representation["SMILES"]
print(type(smiles_representation))
pprint(smiles_representation.model_dump())

<class 'hqs_nmr_parameters.code.data_classes.ChemicalStructure'>
{'atom_map': [0, 1, 2, 0, 0, 0, 1, 1, 2],
 'charge': 0,
 'content': 'CCO',
 'representation': 'SMILES',
 'symbols': ['C', 'C', 'O', 'H', 'H', 'H', 'H', 'H', 'H']}

ethanol_representation["Molfile"].atom_map and ethanol_representation["SMILES"].atom_map will return:

[0, 1, 2, 0, 0, 0, 1, 1, 2]

In both cases, only the carbon and the oxygen atoms have been indicated explicitly. Therefore, the first three atoms are listed expressly (carbon atoms 0 and 1, and oxygen atom 2, i.e., the atom counting starts from zero) and the six following hydrogens are assigned to one of those backbone atoms. The full atomic indices associated with each reduced index can be obtained using the inverted_map method:

from pprint import pprint

smiles_representation = ethanol_representation["SMILES"]
print(smiles_representation.atom_map)
print(smiles_representation.inverted_map)

[0, 1, 2, 0, 0, 0, 1, 1, 2]
[{0, 3, 4, 5}, {1, 6, 7}, {8, 2}]

Hence, C (index = 0) is connected to H atoms 3, 4, 5, while C (index = 1) is connected to H atoms 6, 7, and the O (index = 2), to H 8.

However, ethanol_representation["XYZ"].atom_map will return:

[0, 1, 2, 3, 4, 5, 6, 7, 8]

Since all the atoms are explicitly provided in an XYZ format.

The ChemicalStructure class can also yield the chemical formula in Hill notation using the formula method. smiles_representation.formula will return:

'C2H6O'

Serialization (saving in a JSON file) of a `MolecularData` object and deserialization

We can save a MolecularData instance in a JSON file for future uses. To do so, employ the write_file method of the MolecularData class:

from hqs_nmr_parameters.examples import molecules

parameters = molecules["C2H5OH"]
parameters.write_file("etoh_molecular_data.json")

This JSON file can easily be loaded and validated against the MolecularData model using the read_file method:

from hqs_nmr_parameters import MolecularData

loaded_parameters = MolecularData.read_file("etoh_molecular_data.json")
print(type(loaded_parameters))

<class 'hqs_nmr_parameters.code.data_classes.MolecularData'>

Keyboard shortcuts