Data sets

Molecular data for a group of molecules can be collected using the MolecularDataSet Pydantic class, which contains instances of type MolecularData. It is composed of two attributes:

  • description: Summary of the content of the data set.
  • dataset: Dictionary where the keys are identifiers for the molecules (e.g., molecule names) and the values correspond to a MolecularData object per molecule.

A list with the keys of the dataset can be obtained directly from the MolecularDataSet.keys property. This allows us to conveniently access the molecular data of each molecule using its key as string.

In oder to have a brief summary of the molecules belonging to a data set, we can use the get_names method. It retrieves a dictionary where the keys correspond to the keys of the dataset and the values are the chemical names. An equivalent dictionary providing the chemical formulas can be accessed via the get_formulas method.

Similar to keep_nuclei/drop_nuclei in MolecularData, keep_isotopes, drop_isotopes keeps/drops selected isotopes for all molecules in a data set. An extra description can be added with updated information about the set (whitespace for separation from the original description content must be included).

As in MolecularData, it is possible to save or load data sets thanks to the read_file and write_file methods that (de)serialize JSON files.

Data sets implemented in the hqs_nmr_parameters package will be explained in the follow.

Examples module

The first data set that is worth to mention is the examples module which has a set of molecule definitions encapsulated in the molecules MolecularDataSet object. This set can be accessed via:

from pprint import pprint
from hqs_nmr_parameters.examples import molecules

print(type(molecules)) # MolecularDataSet
# Keys of the data set:
print(molecules.keys)
<class 'hqs_nmr_parameters.code.data_classes.MolecularDataSet'>
['CH3Cl', 'limonene_DFT', '1,2,4-trichlorobenzene', 'Anethole', 'Artemisinin_exp', 'endo-dicyclopentadiene_DFT', 'CH3Cl_13C', 'C2H3CN', 'Artemisinin', 'camphor_DFT', 'C6H6', 'C10H8', 'Triphenylphosphine_oxide', 'H2CCF2', 'C2H5Cl', 'Androstenedione', 'Cinnamaldehyde', 'CHCl3_13C', 'C6H5NO2', 'C2H5OH', 'C2H6', 'C10H7Br', 'CHCl3', 'camphor_exp', 'exo-dicyclopentadiene_DFT', '1,2-di-tert-butyl-diphosphane', 'C2H3NC', 'cyclopentadiene_DFT', 'cis-3-chloroacrylic_acid_exp', 'C3H8']

Note that the content of this list of keys is just an example and might appear in a different order or with different entries depending on the installed version of hqs_nmr_parameters. The same holds for the dictionary of molecule names which can be obtained as follows:

pprint(molecules.get_names())
{'1,2,4-trichlorobenzene': '1,2,4-Trichlorobenzene',
 '1,2-di-tert-butyl-diphosphane': 'tert-Butyl(tert-butylphosphanyl)phosphane',
 'Androstenedione': 'Androstenedione',
 'Anethole': 'Anethole',
 'Artemisinin': 'Artemisinin',
 'Artemisinin_exp': 'Artemisinin',
 'C10H7Br': '2-Bromonaphthalene',
 'C10H8': 'Naphthalene',
 'C2H3CN': 'Acrylonitrile',
 'C2H3NC': 'Vinyl isocyanide',
 'C2H5Cl': 'Chloroethane',
 'C2H5OH': 'Ethanol',
 'C2H6': 'Ethane',
 'C3H8': 'Propane',
 'C6H5NO2': 'Nitrobenzene',
 'C6H6': 'Benzene',
 'CH3Cl': 'Chloromethane',
 'CH3Cl_13C': 'Chloromethane',
 'CHCl3': 'Chloroform',
 'CHCl3_13C': 'Chloroform',
 'Cinnamaldehyde': 'Cinnamaldehyde',
 'H2CCF2': '1,1-Difluoroethene',
 'Triphenylphosphine_oxide': 'Triphenylphosphine oxide',
 'camphor_DFT': 'Camphor',
 'camphor_exp': 'Camphor',
 'cis-3-chloroacrylic_acid_exp': 'cis-3-Chloroacrylic acid',
 'cyclopentadiene_DFT': 'Cyclopentadiene',
 'endo-dicyclopentadiene_DFT': 'endo-Dicyclopentadiene',
 'exo-dicyclopentadiene_DFT': 'exo-Dicyclopentadiene',
 'limonene_DFT': 'Limonene'}

In addition, if we want to have a feeling of the size of the molecules in the set, we could print their formula using the get_formulas method.

The full molecular definition for a given molecule can be loaded using its string key. Each entry of this data set includes a 2D representation (Molfile or SMILES string) of the molecule. Let us consider an example:

from pprint import pprint
from hqs_nmr_parameters.examples import molecules

# Obtain the MolecularData object for acrylonitrile
parameters = molecules["C2H3CN"]
# Print parameters
pprint(parameters.model_dump())
{'description': '1H parameters for acrylonitrile.\n'
                "Values were obtained from Hans Reich's Collection, NMR "
                'Spectroscopy.\n'
                'https://organicchemistrydata.org\n',
 'formula': 'C3H3N',
 'isotopes': [(3, (1, 'H')), (4, (1, 'H')), (5, (1, 'H'))],
 'j_couplings': [((3, 4), 0.9), ((3, 5), 11.8), ((4, 5), 17.9)],
 'method_json': '',
 'name': 'Acrylonitrile',
 'shifts': [(3, 5.79), (4, 5.97), (5, 5.48)],
 'solvent': '',
 'structures': {'Molfile': {'atom_map': [0, 1, 2, 3, 4, 5, 6],
                            'charge': 0,
                            'content': '\n'
                                       'JME 2022-02-26 Wed Sep 07 15:54:28 '
                                       'GMT+200 2022\n'
                                       '\n'
                                       '  0  0  0  0  0  0  0  0  0  0999 '
                                       'V3000\n'
                                       'M  V30 BEGIN CTAB\n'
                                       'M  V30 COUNTS 7 6 0 0 0\n'
                                       'M  V30 BEGIN ATOM\n'
                                       'M  V30 1 C 2.4249 2.1000 0.0000 0\n'
                                       'M  V30 2 C 3.6373 1.4000 0.0000 0\n'
                                       'M  V30 3 C 1.2124 1.4000 0.0000 0\n'
                                       'M  V30 4 H 0.0000 2.1000 0.0000 0\n'
                                       'M  V30 5 H 1.2124 0.0000 0.0000 0\n'
                                       'M  V30 6 H 2.4249 3.5000 0.0000 0\n'
                                       'M  V30 7 N 4.8497 0.7000 0.0000 0\n'
                                       'M  V30 END ATOM\n'
                                       'M  V30 BEGIN BOND\n'
                                       'M  V30 1 1 1 2\n'
                                       'M  V30 2 2 1 3\n'
                                       'M  V30 3 1 3 4\n'
                                       'M  V30 4 1 3 5\n'
                                       'M  V30 5 1 1 6\n'
                                       'M  V30 6 3 2 7\n'
                                       'M  V30 END BOND\n'
                                       'M  V30 END CTAB\n'
                                       'M  END\n',
                            'representation': 'Molfile',
                            'symbols': ['C', 'C', 'C', 'H', 'H', 'H', 'N']}},
 'temperature': None}

As we can see, data for setting up 1H-NMR spectrum of the acrylonitrile molecule has been stored together with its Molfile.

To set up an NMR calculation we are only interested in some of the previous data. To retrieve it, use the spin_system method:

nmr_parameters = parameters.spin_system()
pprint(nmr_parameters.model_dump())
{'isotopes': [(1, 'H'), (1, 'H'), (1, 'H')],
 'j_couplings': [((0, 1), 0.9), ((0, 2), 11.8), ((1, 2), 17.9)],
 'shifts': [5.79, 5.97, 5.48]}

CHESHIRE module

In the cheshire module, one can find molecular data for molecules belonging to the CHESHIRE database. Five data sets (MolecularDataSet objects) have been created from this database depending on the NMR data:

  • experimental_shifts_only: It includes the experimental shifts (for 13C and 1H) of 105 molecules.
  • calculated_full: It has theoretical NMR data for the 65 rigid molecules (molecules with only one conformer) of the previous set. Details of the calculations can be found under the description attribute of each item (see below).
  • combined_full: It contains experimental shifts and theoretical J-couplings for the rigid molecules (in this case only 60 molecules due to incomplete number of shifts in experimental_shifts_only).
  • The calculated and combined data sets are the reduced versions of the aforementioned sets and contain only the NMR data required for simulating 1H-NMR spectra.

In addition, for non-expert users, we have included the alias molecules, which returns the combined set, i.e., 1H-NMR data for the rigid molecules, with experimental shifts and calculated J-couplings.

These data sets can be imported from hqs_nmr_parameters.cheshire as: experimental_shifts_only, calculated_full, combined_full, calculated, and combined. Here, we will focus on the molecules set that it will be imported as cheshire_molecules to avoid confusion with the examples module.

from hqs_nmr_parameters.cheshire import molecules as cheshire_molecules

In the follow, we will retrieve some interesting information from the set, as a brief explanation of the set:

from pprint import pprint

pprint(cheshire_molecules.description)
('Experimental shifts and theoretical J-couplings for the rigid molecules of '
 "the Cheshire set, except for ['Cyclopropanone', 'Bicyclobutane', "
 "'Cyclopentanone', 'Fluorobenzene', 'Indole'] due to incompatible data.\n"
 "Shifts and couplings only for nuclei ['1H', '19F', '31P', '29Si'].")

Or the keys of the molecules that give access to the molecular data. For simplicity, they correspond to a string representation of integers that go from 1 to 105. For the molecule set, where only rigid molecules are included, some numbers are missing:

print(cheshire_molecules.keys)
['1', '2', '4', '5', '6', '9', '10', '11', '12', '13', '15', '16', '17', '18', '20', '23', '29', '30', '32', '33', '34', '36', '39', '41', '42', '44', '46', '47', '48', '49', '50', '51', '52', '54', '55', '59', '60', '61', '66', '68', '71', '73', '74', '75', '76', '77', '81', '84', '85', '86', '87', '88', '91', '92', '93', '95', '96', '99', '100', '105']

As before, the molecule names could be obtained using the get_names function. But here, we will focus on a single entry:

print(cheshire_molecules["1"].name)
'Dichloromethane'

In the description of each entry we find important information about how the NMR parameters were obtained.

print(cheshire_molecules["1"].description)
Geometries in chloroform at B97-3c.
Experimental shifts from CHESHIRE: http://cheshirenmr.info/.
J-couplings (gas-phase) at PBE/pcJ-3.
Parameters averaged over rotamers using permutations.

Each entry in the data set includes both a 2D (Molfile) and a 3D (XYZ) representation of the molecule.

pprint(cheshire_molecules["1"].structures)
{'XYZ': ChemicalStructure(representation='XYZ', content='5\n\nC       -0.00000000     0.00000000     0.77788868\nCl       0.00000000     1.49340832    -0.21847658\nCl       0.00000000    -1.49340833    -0.21847658\nH       -0.89835401     0.00000000     1.37677523\nH        0.89835401     0.00000000     1.37677523\n', charge=0, symbols=['C', 'Cl', 'Cl', 'H', 'H'], atom_map=[0, 1, 2, 3, 4]),
 'Molfile': ChemicalStructure(representation='Molfile', content='\n     RDKit          2D\n\n  0  0  0  0  0  0  0  0  0  0999 V3000\nM  V30 BEGIN CTAB\nM  V30 COUNTS 5 4 0 0 0\nM  V30 BEGIN ATOM\nM  V30 1 C 0.000000 -0.000000 0.000000 0\nM  V30 2 Cl 0.000000 1.500000 0.000000 0\nM  V30 3 Cl -0.000000 -1.500000 0.000000 0\nM  V30 4 H 1.500000 -0.000000 0.000000 0\nM  V30 5 H -1.500000 0.000000 0.000000 0\nM  V30 END ATOM\nM  V30 BEGIN BOND\nM  V30 1 1 2 1\nM  V30 2 1 3 1\nM  V30 3 1 1 4 CFG=3\nM  V30 4 1 5 1\nM  V30 END BOND\nM  V30 END CTAB\nM  END\n', charge=0, symbols=['C', 'Cl', 'Cl', 'H', 'H'], atom_map=[0, 1, 2, 3, 4])}

To access the NMR data:

pprint(cheshire_molecules["1"].spin_system().model_dump())
{'isotopes': [(1, 'H'), (1, 'H')],
 'j_couplings': [((0, 1), -5.171)],
 'shifts': [5.28, 5.28]}

Assignments module

The assignments module contains example data of other complex molecules.

Patchoulol

The patchoulol data set contains two molecules, the originally proposed structure for patchouli alcohol and the correct structure. We can access it as:

from hqs_nmr_parameters.assignments import patchoulol

To get an overview of the set, we could print its description:

print(patchoulol.description)
Theoretical <sup>1</sup>H-NMR data (shifts and J-couplings) for patchouli alcohol (patchoulol).
This set contains molecular data (with NMR parameters) related to the experimentally confirmed (correct) structure of patchoulol as well as for the structure initially (erroneous) attributed to patchoulol (see Scheme 1 of https://doi.org/10.1002/anie.200460864 for details).
The experimental <sup>1</sup>H-NMR spectrum of correct patchoulol is available at https://doi.org/10.13018/BMSE001312.

Only the two mentioned molecules are present in the set, we can access them via their keys:

print(patchoulol.keys)
print(patchoulol["correct"].name,"\n",patchoulol["erroneous"].name)
['correct', 'erroneous']
Patchouli alcohol 
 4,10,11,11-Tetramethyltricyclo[5.3.1.01,5]undecan-10-ol

With this data, we can now use the HQS NMR Tool to simulate both spectra and see the differences between these two similar molecules as well as compare with the experimental spectrum.

Menthol isomers

The menthol_isomers data set is a collection of the four possible diastereomers of menthol (5-methyl-2-(propan-2-yl)cyclohexan-1-ol). With three chiral centers at positions 1, 2, and 5 (in IUPAC convention), there are the following eight possible structures:

  • Menthol:

    • (+)-enantiomer, with stereocenters 1S, 2R, 5S.
    • (-)-enantiomer, with stereocenters 1R, 2S, 5R.
  • Neomenthol:

    • (+)-enantiomer, with stereocenters 1S, 2S, 5R.
    • (-)-enantiomer, with stereocenters 1R, 2R, 5S.
  • Isomenthol:

    • (+)-enantiomer, with stereocenters 1S, 2R, 5R.
    • (-)-enantiomer, with stereocenters 1R, 2S, 5S.
  • Neoisomenthol:

    • (+)-enantiomer, with stereocenters 1R, 2R, 5R.
    • (-)-enantiomer, with stereocenters 1S, 2S, 5S.

Since enantiomers are not distinguishable by conventional NMR spectroscopy, there are four different possible NMR spectra. The given data set contains NMR parameters calculated with density functional theory (DFT) for one enantiomer of each pair and can be imported from the assignments module as menthol_isomers_full for 1H- and 13C-NMR parameters or as menthol_isomers for only 1H-NMR data.

The description gives an overview of the data set:

from hqs_nmr_parameters.assignments import menthol_isomers

print(menthol_isomers.description)
Calculated NMR parameters (only 1H) for all four stereoisomers of menthol.
Structures are (absolute stereochemistry indicated by chiral centers at positions 1, 2, and 5 in IUPAC convention):
(1S,2S,5R)-(+)-Neomenthol (SSR): data averaged over 3 conformers.
(1R,2S,5R)-(-)-Menthol (RSR): data averaged over 3 conformers.
(1S,2R,5R)-(+)-Isomenthol (SRR): data averaged over 6 conformers.
(1S,2S,5S)-(-)-Neoisomenthol (SSS): data averaged over 8 conformers.
Various experimental NMR spectra of (1S,2S,5R)-(+)-neomenthol are available at https://doi.org/10.13018/BMSE000498.

The keys of the structures in the data set can be listed as:

for key in menthol_isomers.keys:
    print(f"{key}: {menthol_isomers[key].name}")
SSR: (+)-Neomenthol (SSR)
RSR: (-)-Menthol (RSR)
SRR: (+)-Isomenthol (SRR)
SSS: (-)-Neoisomenthol (SSS)

For more information on the applied computational level of theory, please inspect the individual element descriptions with the description attribute.

This data can be used to simulate the NMR spectra of all diastereomers as explained earlier and compare them to experimental ones, e.g., to that of neomenthol available here. Due to the limited accuracy of DFT calculations, it is not always straightforward to identify the correct isomer if the exact structure of the experimental measurement is unknown, but the comparison with all four possibilities will provide valuable insights for structure elucidation.