Representation in Python of molecular data structures
Molecular data
Inside the hqs-nmr-parameters
package, the MolecularData
class, which is derived from Pydantic's BaseModel
, has been implemented to
describe a molecule. It contains the following attributes:
name
: The name of the molecule (just a simple string).isotopes
: List containing pairs of an atom index and the associated isotope. Atom indices are associated with the order in which atoms appear in the chemical structure representations, starting by index 0.shifts
: List containing pairs of an atom index and the associated chemical shift in ppm.j_couplings
: List containing pairs of atomic indices and the associated J-coupling values in Hz. Note that atom index pairs are unique: if a value is provided for an atom pair (k, l), then no value is provided for pair (l, k).structures
: Contains a dictionary with chemical structure representations. It accepts entries for a SMILES string ("SMILES"), a Molfile ("Molfile"), and an XYZ file ("XYZ"). Each value will correspond to aChemicalStructure
object. See below for a full explanation of the chemical structure representations and the atom numeration.formula
: The molecular formula of the corresponding molecule.temperature
: An optional temperature in K for which the parameters have been calculated or at which the experiment has been performed.solvent
: Name of the solvent. An empty string represents an unknown or undefined solvent, or the absence of a solvent.description
: Optional further information.method_json
: Stores a JSON serialization of computational method settings. An empty string indicates that the field is not applicable. Creating and interpreting the content is the responsibility of the user of the model.
This is best illustrated by an example from one of our available datasets:
# Import a MolecularDataSet
from hqs_nmr_parameters.examples import molecules
# Choose ethanol as an example from the available molecules
parameters = molecules["C2H5OH"]
print(type(parameters))
print("Chemical name:", parameters.name)
print("Molecular Formula:", parameters.formula)
print("Type(s) of representations:", parameters.structures.keys())
print("Information about solvent:", parameters.solvent)
print("Shifts:", parameters.shifts)
print("###")
print(parameters.description)
<class 'hqs_nmr_parameters.code.data_classes.MolecularData'>
Chemical name: Ethanol
Molecular Formula: C2H6O
Type(s) of representations: dict_keys(['SMILES'])
Information about solvent: chloroform
Shifts: [(3, 1.25), (4, 1.25), (5, 1.25), (6, 3.72), (7, 3.72), (8, 1.32)]
###
1H parameters for ethanol in CDCl3.
Shifts from: https://doi.org/10.1021/om100106e
J-couplings estimated.
In this case, we are getting information for the ethanol molecule, for which we have stored a SMILES representation (check below for details) and according to the description experimental 1H NMR parameters in chloroform.
Note that the number of H atoms in a molecule is not necessarily the same as the number of spins in its 1H NMR spectrum simulation. For instance, H atoms attached to electronegative elements such as N, O, or S tend to exchange rapidly with the deuterium atoms in protic solvents such as deuterated water (D2O), which results in an extreme line broadening in the NMR spectrum or even makes the signal disappear entirely. That means, if the ethanol entry shown above was recorded in water instead of chloroform, the chemical shift value for the hydroxyl proton (OH, index 8) would probably be missing.
NMR parameters subset
The easiest way to check the NMR data is getting only the information necessary to perform an NMR calculation, i.e., a subset of the parameters
object containing only the attributes isotopes
, shifts
, and j_couplings
. Those can be obtained using the spin_system
method of MolecularData
, which returns an object of type NMRParameters
:
from pprint import pprint
from hqs_nmr_parameters.examples import molecules
parameters = molecules["C2H5OH"]
nmr_parameters = parameters.spin_system()
print(type(nmr_parameters))
pprint(nmr_parameters.model_dump())
<class 'hqs_nmr_parameters.code.data_classes.NMRParameters'>
{'isotopes': [(1, 'H'), (1, 'H'), (1, 'H'), (1, 'H'), (1, 'H'), (1, 'H')],
'j_couplings': [((0, 3), 7.0),
((0, 4), 7.0),
((1, 3), 7.0),
((1, 4), 7.0),
((2, 3), 7.0),
((2, 4), 7.0)],
'shifts': [1.25, 1.25, 1.25, 3.72, 3.72, 1.32]}
Here, nmr_parameters
is an instance of the Pydantic NMRParameters
class, where the atomic indices always start from 0.
The total number of spins, determined from the number of chemical shifts, can be accessed with the nspins
property. The NMRParameters
class can be used as input for several functions to calculate an NMR spectrum, which is explained here in detail. The following code block shows how to perform a 1H NMR simulation (default case), providing the molecular data either as NMRParameters
or directly as MolecularData
object:
from hqs_nmr_parameters.examples import molecules
from hqs_nmr import calculate_spectrum, NMRCalculationParameters
parameters = molecules["C2H5OH"]
nmr_parameters = parameters.spin_system()
print(type(nmr_parameters))
print(nmr_parameters.nspins)
calculation_parameters = NMRCalculationParameters(field_T=9.395)
# Using parameters instead of nmr_parameters would also be valid
spectrum = calculate_spectrum(nmr_parameters, calculation_parameters)
print(type(spectrum))
<class 'hqs_nmr_parameters.code.data_classes.NMRParameters'>
6
<class 'hqs_nmr.datatypes.NMRResultSpectrum1D'>
For information on the output class NMRResultSpectrum1D
have a look here.
Isotopes/atoms selection
MolecularData
objects can contain data for several isotopes. However, we are often interested in creating a spin system for a given nucleus only. For example, if we want to simulate a 1H NMR spectrum but we also have 13C NMR parameters in our data, it is possible either to select only 1H using keep_nuclei(isotopes=["1H"])
or to drop 13C using drop_nuclei(isotopes=["13C"])
. Let us take a look at this in more detail:
from pprint import pprint
from hqs_nmr_parameters.examples import molecules
parameters = molecules["CH3Cl_13C"]
pprint(parameters.isotopes)
drop_C = parameters.drop_nuclei(isotopes=["13C"])
keep_H = parameters.keep_nuclei(isotopes=["1H"])
print(drop_C == keep_H)
nmr_parameters_1H = drop_C.spin_system()
pprint(nmr_parameters_1H.model_dump())
[(0, Isotope(mass_number=13, symbol='C')),
(2, Isotope(mass_number=1, symbol='H')),
(3, Isotope(mass_number=1, symbol='H')),
(4, Isotope(mass_number=1, symbol='H'))]
True
{'isotopes': [(1, 'H'), (1, 'H'), (1, 'H')],
'j_couplings': [((0, 1), -10.8), ((0, 2), -10.8), ((1, 2), -10.8)],
'shifts': [3.05, 3.05, 3.05]}
We should take into account that the dropping/keeping of isotopes needs to be done carefully. The 13C nucleus has a nuclear spin of 1/2 (like 1H) and is NMR-activate but has a very low natural abundance, and the 13Cā1H coupling pattern is only barely visible in 1H NMR spectra. Therefore, it is common to simulate only 13 C-decoupled 1H NMR spectra. However, the coupling of 1H with other isotopes such as 19F or 31P should not be ignored in order to get the correct peak pattern.
keep_nuclei
can also take an atoms
argument, which accepts a list of atom indices for which the NMR parameters are to be kept.
In the case of drop_nuclei
, specified nuclei can be dropped via atoms
, but a keep_atoms
argument prevents certain atoms from being dropped (useful in combination with isotopes
).
Note that many datasets provided by hqs-nmr-parameters
already only contain the data that is needed to simulate 1H NMR spectra. More information on each dataset can be found here.
Molecular structure representations
As shown above a MolecularData
object has the attribute structures
, which is a dictionary that can accept three entries
(keys): "SMILES", "Molfile" and "XYZ". The value of each of these entries is a ChemicalStructure
object that stores chemical structure representations of the type defined in the key:
- SMILES (Simplified Molecular-Input Line-Entry System): Strings representing the connectivity of all non-hydrogen atoms in a molecule. They become laborious to understand for complex molecules.
- Molfiles: Text files with a two-dimensional (2D) structure of a molecule (skeletal representation). It may omit hydrogen atoms. The number of hydrogen atoms is inferred from the atomic valencies of the heavy atoms.
- XYZ files: Text files containing an integer with the total number of atoms in the first line, followed by a comment line, and then the chemical element symbols and three-dimensional atomic positions for all the atoms explicitly.
Connectivity and charge information can be extracted from SMILES strings or Molfiles, but not from XYZ files. More information about Molfiles and SMILES can be found here.
The attributes of the ChemicalStructure
class are:
representation
: Type of chemical structure representation present in thecontent
attribute. It can be "SMILES", "Molfile" or "XYZ".content
: The raw structure representation (as a string), i.e., a SMILES string or the content of an XYZ file or a Molfile.charge
: Total net electric charge of the molecule.symbols
: List of chemical element symbols for the full set of atoms.atom_map
: As we have seen, 2D representations can omit hydrogen atoms, this attribute contains information about how the full representation would map into this reduced representation. It is a list of the length of the molecule (full set of atoms), where the atom counting starts from zero and takes into account:- Each non-hydrogen atom maps onto the respective index in the reduced representation.
- Each explicit hydrogen maps onto the index of its explicitly represented counterpart.
- Each implicit hydrogen maps onto the reduced index of the respective backbone atom.
In the case of ethanol, a full structures
representation may look like the following. It might look a bit crowded as the content
attribute contains the full content of the corresponding files (or SMILES string):
from hqs_nmr_parameters import ChemicalStructure
ethanol_representation = {
"XYZ": ChemicalStructure(representation="XYZ", content="9\n\nC -0.88708900 0.17506400 -0.01253500\nC 0.46048900 -0.51551600 -0.04653500\nO 1.44296500 0.30726700 0.56557200\nH -0.84747800 1.12776800 -0.55081700\nH -1.65878200 -0.45332700 -0.46584200\nH -1.17694400 0.40367600 1.01830600\nH 0.76871200 -0.72432800 -1.07546000\nH 0.41948600 -1.46207300 0.50017700\nH 1.47864000 1.14146800 0.06713500\n", charge=0, symbols=["C", "C", "O", "H", "H", "H", "H", "H", "H"], atom_map=[0, 1, 2, 3, 4, 5, 6, 7, 8]),
"Molfile": ChemicalStructure(representation="Molfile", content="\n RDKit 2D\n\n 0 0 0 0 0 0 0 0 0 0999 V3000\nM V30 BEGIN CTAB\nM V30 COUNTS 3 2 0 0 0\nM V30 BEGIN ATOM\nM V30 1 C -1.299038 -0.250000 0.000000 0\nM V30 2 C 0.000000 0.500000 0.000000 0\nM V30 3 O 1.299038 -0.250000 0.000000 0\nM V30 END ATOM\nM V30 BEGIN BOND\nM V30 1 1 1 2 CFG=3\nM V30 2 1 2 3 CFG=3\nM V30 END BOND\nM V30 END CTAB\nM END\n", charge=0, symbols=["C", "C", "O", "H", "H", "H", "H", "H", "H"], atom_map=[0, 1, 2, 0, 0, 0, 1, 1, 2]),
"SMILES": ChemicalStructure(representation="SMILES", content="CCO", charge=0, symbols=["C", "C", "O", "H", "H", "H", "H", "H", "H"], atom_map=[0, 1, 2, 0, 0, 0, 1, 1, 2])
}
Each of the entries corresponds to a ChemicalStructure
object:
from pprint import pprint
smiles_representation = ethanol_representation["SMILES"]
print(type(smiles_representation))
pprint(smiles_representation.model_dump())
<class 'hqs_nmr_parameters.code.data_classes.ChemicalStructure'>
{'atom_map': [0, 1, 2, 0, 0, 0, 1, 1, 2],
'charge': 0,
'content': 'CCO',
'representation': 'SMILES',
'symbols': ['C', 'C', 'O', 'H', 'H', 'H', 'H', 'H', 'H']}
ethanol_representation["Molfile"].atom_map
and ethanol_representation["SMILES"].atom_map
will return:
[0, 1, 2, 0, 0, 0, 1, 1, 2]
In both cases, only the carbon and the oxygen atoms have been indicated explicitly. Therefore, the first three atoms are listed expressly (carbon atoms 0
and 1
, and oxygen atom 2
, i.e., the atom counting starts from zero) and the six following hydrogens are assigned to one of those backbone atoms. The full atomic indices associated with each reduced index can be obtained using the inverted_map
method:
from pprint import pprint
smiles_representation = ethanol_representation["SMILES"]
print(smiles_representation.atom_map)
print(smiles_representation.inverted_map)
[0, 1, 2, 0, 0, 0, 1, 1, 2]
[{0, 3, 4, 5}, {1, 6, 7}, {8, 2}]
Hence, C (index = 0
) is connected to H atoms 3
, 4
, 5
, while C (index = 1
) is connected to H atoms 6
, 7
, and the O (index = 2
), to H 8
.
However, ethanol_representation["XYZ"].atom_map
will return:
[0, 1, 2, 3, 4, 5, 6, 7, 8]
Since all the atoms are explicitly provided in an XYZ format.
The ChemicalStructure
class can also yield the chemical formula in Hill notation using the formula
method. smiles_representation.formula
will return:
'C2H6O'
Serialization (saving in a JSON file) of a MolecularData
object and deserialization
We can save a MolecularData
instance in a JSON file for future uses. To do so, employ the write_file
method of the
MolecularData
class:
from hqs_nmr_parameters.examples import molecules
parameters = molecules["C2H5OH"]
parameters.write_file("etoh_molecular_data.json")
This JSON file can easily be loaded and validated against the MolecularData
model using the read_file
method:
from hqs_nmr_parameters import MolecularData
loaded_parameters = MolecularData.read_file("etoh_molecular_data.json")
print(type(loaded_parameters))
<class 'hqs_nmr_parameters.code.data_classes.MolecularData'>