Data sets
Molecular data for a group of molecules can be collected using the MolecularDataSet
Pydantic class, which contains instances of type MolecularData
. It is composed of two attributes:
description
: Summary of the content of the data set.dataset
: Dictionary where the keys are identifiers for the molecules (e.g., molecule names) and the values correspond to aMolecularData
object per molecule.
A list with the keys of the dataset
can be obtained directly from the MolecularDataSet.keys
property. This allows us to conveniently access the molecular data of each molecule using its key as string.
In oder to have a brief summary of the molecules belonging to a data set, we can use the get_names
method. It retrieves a dictionary where the keys correspond to the keys of the dataset
and the values are the chemical names. An equivalent dictionary providing the chemical formulas can be accessed via the get_formulas
method.
Similar to keep_nuclei
/drop_nuclei
in MolecularData
, keep_isotopes
, drop_isotopes
keeps/drops selected isotopes for all molecules in a data set. An extra description
can be added with updated information about the set (whitespace for separation from the original description content must be included).
As in MolecularData
, it is possible to save or load data sets thanks to the read_file
and write_file
methods that
(de)serialize JSON files.
Data sets implemented in the hqs-nmr-parameters
package will be explained in the follow.
General remarks
The hqs-nmr-parameters
package contains several data sets divided into modules with data from different origins and for different purposes. In general, the MolecularDataSet
object of a specific set can be imported with:
from hqs_nmr_parameters.<dataset> import <variant>
With the exception of assignments
, the data sets have a variant called molecules
which contains the recommended data to be used.
To import the different modules in a single MolecularDataSet
object, one can do:
from hqs_nmr_parameters import molecules
which is equivalent to:
from hqs_nmr_parameters.merged import molecules
Including the one imported above, there are three variants that integrate the data available in hqs-nmr-parameters
in some way:
molecules
: Includes all available molecules with any data, using experimental shifts if possible.calculated
: Includes all available molecules that contain purely calculated parameters (shifts and J-couplings).combined
: Includes all available molecules for which a combination of experimental shifts and calculated J-couplings is available.
The data sets used here are described in more detail below along with the most important methods to access their contents. These methods (and properties) are accessible for each data set, including the merged ones presented here.
Examples module
The first data set that is worth to mention is the examples
module which has a set of molecule definitions
encapsulated in the MolecularDataSet
object molecules
. This set can be accessed via:
from pprint import pprint
from hqs_nmr_parameters.examples import molecules
print(type(molecules)) # MolecularDataSet
# Keys of the data set:
print(molecules.keys)
<class 'hqs_nmr_parameters.code.data_classes.MolecularDataSet'>
['CH3Cl', 'limonene_DFT', '1,2,4-trichlorobenzene', 'Anethole', 'Artemisinin_exp', 'endo-dicyclopentadiene_DFT', 'CH3Cl_13C', 'C2H3CN', 'Artemisinin', 'camphor_DFT', 'C6H6', 'C10H8', 'Triphenylphosphine_oxide', 'H2CCF2', 'C2H5Cl', 'Androstenedione', 'Cinnamaldehyde', 'CHCl3_13C', 'C6H5NO2', 'C2H5OH', 'C2H6', 'C10H7Br', 'CHCl3', 'camphor_exp', 'exo-dicyclopentadiene_DFT', '1,2-di-tert-butyl-diphosphane', 'C2H3NC', 'cyclopentadiene_DFT', 'cis-3-chloroacrylic_acid_exp', 'C3H8']
Note that the content of this list of keys is just an example and might appear in a different order or with different entries depending on the installed version of hqs-nmr-parameters
. The same holds for the dictionary of molecule names which can be obtained as follows:
pprint(molecules.get_names())
{'1,2,4-trichlorobenzene': '1,2,4-Trichlorobenzene',
'1,2-di-tert-butyl-diphosphane': 'tert-Butyl(tert-butylphosphanyl)phosphane',
'Androstenedione': 'Androstenedione',
'Anethole': 'Anethole',
'Artemisinin': 'Artemisinin',
'Artemisinin_exp': 'Artemisinin',
'C10H7Br': '2-Bromonaphthalene',
'C10H8': 'Naphthalene',
'C2H3CN': 'Acrylonitrile',
'C2H3NC': 'Vinyl isocyanide',
'C2H5Cl': 'Chloroethane',
'C2H5OH': 'Ethanol',
'C2H6': 'Ethane',
'C3H8': 'Propane',
'C6H5NO2': 'Nitrobenzene',
'C6H6': 'Benzene',
'CH3Cl': 'Chloromethane',
'CH3Cl_13C': 'Chloromethane',
'CHCl3': 'Chloroform',
'CHCl3_13C': 'Chloroform',
'Cinnamaldehyde': 'Cinnamaldehyde',
'H2CCF2': '1,1-Difluoroethene',
'Triphenylphosphine_oxide': 'Triphenylphosphine oxide',
'camphor_DFT': 'Camphor',
'camphor_exp': 'Camphor',
'cis-3-chloroacrylic_acid_exp': 'cis-3-Chloroacrylic acid',
'cyclopentadiene_DFT': 'Cyclopentadiene',
'endo-dicyclopentadiene_DFT': 'endo-Dicyclopentadiene',
'exo-dicyclopentadiene_DFT': 'exo-Dicyclopentadiene',
'limonene_DFT': 'Limonene'}
In addition, if we want to have a feeling of the size of the molecules in the set, we can print their formulas using the
get_formulas
method.
The full molecular definition for a given molecule can be loaded using its string key. Each entry of this data set includes a 2D representation (Molfile or SMILES string) of the molecule. Let us consider an example:
from pprint import pprint
from hqs_nmr_parameters.examples import molecules
# Obtain the MolecularData object for acrylonitrile
parameters = molecules["C2H3CN"]
# Print parameters
pprint(parameters.model_dump())
{'description': '1H parameters for acrylonitrile.\n'
"Values were obtained from Hans Reich's Collection, NMR "
'Spectroscopy.\n'
'https://organicchemistrydata.org\n',
'formula': 'C3H3N',
'isotopes': [(3, (1, 'H')), (4, (1, 'H')), (5, (1, 'H'))],
'j_couplings': [((3, 4), 0.9), ((3, 5), 11.8), ((4, 5), 17.9)],
'method_json': '',
'name': 'Acrylonitrile',
'shifts': [(3, 5.79), (4, 5.97), (5, 5.48)],
'solvent': '',
'structures': {'Molfile': {'atom_map': [0, 1, 2, 3, 4, 5, 6],
'charge': 0,
'content': '\n'
'JME 2022-02-26 Wed Sep 07 15:54:28 '
'GMT+200 2022\n'
'\n'
' 0 0 0 0 0 0 0 0 0 0999 '
'V3000\n'
'M V30 BEGIN CTAB\n'
'M V30 COUNTS 7 6 0 0 0\n'
'M V30 BEGIN ATOM\n'
'M V30 1 C 2.4249 2.1000 0.0000 0\n'
'M V30 2 C 3.6373 1.4000 0.0000 0\n'
'M V30 3 C 1.2124 1.4000 0.0000 0\n'
'M V30 4 H 0.0000 2.1000 0.0000 0\n'
'M V30 5 H 1.2124 0.0000 0.0000 0\n'
'M V30 6 H 2.4249 3.5000 0.0000 0\n'
'M V30 7 N 4.8497 0.7000 0.0000 0\n'
'M V30 END ATOM\n'
'M V30 BEGIN BOND\n'
'M V30 1 1 1 2\n'
'M V30 2 2 1 3\n'
'M V30 3 1 3 4\n'
'M V30 4 1 3 5\n'
'M V30 5 1 1 6\n'
'M V30 6 3 2 7\n'
'M V30 END BOND\n'
'M V30 END CTAB\n'
'M END\n',
'representation': 'Molfile',
'symbols': ['C', 'C', 'C', 'H', 'H', 'H', 'N']}},
'temperature': None}
As we can see, data for setting up a 1H-NMR spectrum of the acrylonitrile molecule has been stored together with its Molfile.
To set up an NMR calculation we are only interested in some of the previous data. To retrieve it, use the spin_system
method (for more information, see the section on the NMRParameters
class):
nmr_parameters = parameters.spin_system()
pprint(nmr_parameters.model_dump())
{'isotopes': [(1, 'H'), (1, 'H'), (1, 'H')],
'j_couplings': [((0, 1), 0.9), ((0, 2), 11.8), ((1, 2), 17.9)],
'shifts': [5.79, 5.97, 5.48]}
CHESHIRE module
In the cheshire
module, one can find molecular data for molecules belonging to the CHESHIRE database. Five data sets (MolecularDataSet
objects) have been created from this database depending on the collected NMR data:
experimental_shifts_only
: It includes the experimental shifts (for 13C and 1H) of all 105 molecules, but no J-coupling values.calculated_full
: It has theoretical NMR data for a selection of rigid molecules (molecules with only one conformer) of the previous set. Details of the calculations can be found under thedescription
attribute of each item (see below).combined_full
: It contains experimental shifts and theoretical J-couplings for the rigid molecules (in this case, a few more molecules were sorted out due to an incomplete number of shifts inexperimental_shifts_only
).- The
calculated
andcombined
data sets are the reduced versions of the aforementioned sets and contain only the NMR data required for simulating 1H-NMR spectra.
In addition, for non-expert users, we have included the alias molecules
, which returns the combined
set, i.e., 1H-NMR data with experimental shifts and calculated J-couplings.
These data sets can be imported as follows (we will focus on the molecules
set that will be imported as
cheshire_molecules
to avoid confusion with the examples
module):
from hqs_nmr_parameters.cheshire import molecules as cheshire_molecules
As a brief illustration of the type of data included in the sets, we will retrieve some important information:
print(cheshire_molecules.description)
Experimental shifts and theoretical J-couplings for the rigid molecules of the Cheshire set, except for ['Cyclopropanone', 'Bicyclobutane', 'Cyclopentanone', 'Fluorobenzene', 'Indole'] due to incompatible data.
Shifts and couplings only for nuclei ['1H', '19F', '31P', '29Si'].
The keys of the molecules give access to the molecular data. For simplicity, they correspond to a string representation of integers that go from 1 to 105. Note that depending on the imported data set, some numbers might be missing. For instance, only rigid molecules are included in the molecule
set.
print(cheshire_molecules.keys[:10])
['1', '2', '4', '5', '6', '9', '10', '11', '12', '13']
As before, the molecule names can be obtained using the get_names
function. But here, we will focus on a single entry:
print(cheshire_molecules["1"].name)
'Dichloromethane'
In the description of each entry we find important information about how the NMR parameters were obtained.
print(cheshire_molecules["1"].description)
Geometries in chloroform at B97-3c.
Experimental shifts from CHESHIRE: http://cheshirenmr.info/.
J-couplings (gas-phase) at PBE/pcJ-3.
Parameters averaged over rotamers using permutations.
Each entry in the data set includes both a 2D (Molfile) and a 3D (XYZ) representation of the molecule.
print(cheshire_molecules["1"].structures.keys())
dict_keys(['XYZ', 'Molfile'])
To access the NMR data, use the spin_system
method:
pprint(cheshire_molecules["1"].spin_system().model_dump())
{'isotopes': [(1, 'H'), (1, 'H')],
'j_couplings': [((0, 1), -5.171)],
'shifts': [5.28, 5.28]}
Phytolab module
The phytolab
module contains some selected molecules from a catalogue by Phytolab. It includes three variants of data:
calculated_full
: All NMR parameters in this set are computed. Details can be obtained from the description of each item in the set. Where 13C data has also been calculated, shifts and couplings are included for each nucleus.calculated
: This set of computed parameters is intended for the calculation of one-dimensional 1H-NMR spectra, as the parameters for 13C are omitted.combined
: Where possible, the chemical shifts have been adjusted manually to achieve a better match with experimental 1H-NMR spectra. Therefore, this set contains a combination of adjusted or computed shifts, and computed J-couplings. This set is recommended to simulate 1H-NMR spectra to obtain the closest agreement with experiment.
In addition, the module includes the set molecules
, which is an alias for combined
as described above.
To access these data sets, import them analogously to the other modules:
from hqs_nmr_parameters.phytolab import molecules as phytolab_molecules
print("Dataset content:")
print(phytolab_molecules.description + "\n")
print(f"Entries of the set: {phytolab_molecules.keys}\n")
print("Details on the NMR parameters for Psoralen:")
print(phytolab_molecules["psoralen"].description)
Dataset content:
NMR parameters for selected natural products from a catalogue by Phytolab.
Where possible, chemical shifts have been adjusted to match experimental spectra. The remaining parameters are computed.
For further details, please refer to descriptions of the individual items in the set.
Entries of the set: ['angelicin', 'bakuchicin', 'psoralen', 'friedelin']
Details on the NMR parameters for Psoralen:
Geometry in chloroform at B97-3c.
Shifts manually adjusted to match 1H-NMR spectrum at 80 MHz in CDCl3 provided by Phytolab.
J-couplings (gas-phase) at PBE/pcJ-3.
As for the examples
module, the content of the sets will depend on the installed version of hqs-nmr-parameters
.
Assignments module
The assignments
module contains example data of other complex molecules.
Patchoulol
The patchoulol
data set contains two molecules, the originally proposed structure for patchouli alcohol
and the correct structure. We can access it as:
from hqs_nmr_parameters.assignments import patchoulol
To get an overview of the set, we can access a brief summary with print(patchoulol.description)
.
Only the two mentioned molecules are present in the set, we can access them via their keys
:
for key in patchoulol.keys:
print(f"{key}: {patchoulol[key].name}")
correct: Patchouli alcohol
erroneous: 4,10,11,11-Tetramethyltricyclo[5.3.1.01,5]undecan-10-ol
With this data, we can now use the HQS NMR Tool to simulate both spectra and see the differences between these two similar molecules as well as compare with the experimental spectrum.
Menthol isomers
The menthol_isomers
data set is a collection of the four possible diastereomers of menthol
(5-methyl-2-(propan-2-yl)cyclohexan-1-ol). With three chiral centers at positions 1, 2, and 5 (in IUPAC convention), there are the following eight possible structures:
-
Menthol:
- (+)-enantiomer, with stereocenters 1S, 2R, 5S.
- (−)-enantiomer, with stereocenters 1R, 2S, 5R.
-
Neomenthol:
- (+)-enantiomer, with stereocenters 1S, 2S, 5R.
- (−)-enantiomer, with stereocenters 1R, 2R, 5S.
-
Isomenthol:
- (+)-enantiomer, with stereocenters 1S, 2R, 5R.
- (−)-enantiomer, with stereocenters 1R, 2S, 5S.
-
Neoisomenthol:
- (+)-enantiomer, with stereocenters 1R, 2R, 5R.
- (−)-enantiomer, with stereocenters 1S, 2S, 5S.
Since enantiomers are not distinguishable by conventional NMR spectroscopy, there are four different possible NMR spectra. The given data set contains NMR parameters calculated with density functional theory (DFT) for one enantiomer of each pair and can be imported from the assignments
module as menthol_isomers_full
for 1H- and 13C-NMR parameters or as menthol_isomers
for only 1H-NMR data.
For an overview of the data set, just print its description:
from hqs_nmr_parameters.assignments import menthol_isomers
print(menthol_isomers.description)
The molecular keys and names of the structures in the data set can be listed as:
for key in menthol_isomers.keys:
print(f"{key}: {menthol_isomers[key].name}")
SSR: (+)-Neomenthol (SSR)
RSR: (-)-Menthol (RSR)
SRR: (+)-Isomenthol (SRR)
SSS: (-)-Neoisomenthol (SSS)
For more information on the applied computational level of theory, please inspect the individual descriptions with the description
attribute.
This data can be used to simulate the NMR spectra of all diastereomers as explained earlier and compare them to experimental ones, e.g., to that of neomenthol available here. Due to the limited accuracy of DFT calculations, it is not always straightforward to identify the correct isomer if the exact structure of the experimental measurement is unknown, but the comparison with all four possibilities will provide valuable insights for structure elucidation. Furthermore, the postprocessing
module of the HQS NMR Tool allows the user to modify the simulated spectrum to better match an experimental reference which will help to reduce the number of reasonable candidate structures.