Datasets

This chapter gives an introduction to all datasets provided by hqs-nmr-parameters and the data structure that is used to retrieve them.

Data types

`MolecularDataSet`

Molecular data for a group of molecules can be collected using the MolecularDataSet Pydantic class, which contains instances of type MolecularData. It is composed of two attributes:

description: Summary of the content of the dataset.
dataset: Dictionary where the keys are identifiers for the molecules (e.g., molecule names) and the values correspond to a MolecularData object per molecule.

A list with the keys of the dataset can be obtained directly from the MolecularDataSet.keys property. This allows us to conveniently access the molecular data of each molecule using its key as string.

In oder to have a brief summary of the molecules belonging to a dataset, we can use the get_names method. It retrieves a dictionary where the keys correspond to the keys of the dataset and the values are the chemical names. An equivalent dictionary providing the chemical formulas can be accessed via the get_formulas method.

Similar to keep_nuclei/drop_nuclei in MolecularData, keep_isotopes, drop_isotopes keeps/drops selected isotopes for all molecules in a dataset. An extra description can be added with updated information about the set (whitespace for separation from the original description content must be included).

As in MolecularData, it is possible to save or load datasets thanks to the read_file and write_file methods that (de)serialize JSON files.

`MolecularDataTable`

It is possible to store multiple molecular datasets in one MolecularDataTable object. This is particularly useful if the datasets contain the same or partially the same molecular structures, but different NMR parameters, e.g., from theoretical calculations using different DFT approaches. A MolecularDataTable object has two attributes:

description: Summary of the content of the data table.
content: A dictionary with keys being row labels (e.g., molecular identifiers as in a MolecularDataSet) and the values being objects of type MolecularDataTableRow that store different NMR parameters for a single molecular structure.

The most important properties of this class are row_labels and column_labels providing the respective labels, where the rows correspond to molecular structures and the columns to a specific data source. With the get_column and get_dataset methods, a dictionary or a MolecularDataSet can be obtained from a table column by specifying its label. Other functions that are known from the MolecularDataSet also exist for the MolecularDataTable. These are, for instance, get_names, get_formulas, read_file, and write_file.

General remarks

The hqs-nmr-parameters package contains several datasets organized in modules with data from different origins and for different purposes. In general, the MolecularDataSet object of a specific set can be imported with:

from hqs_nmr_parameters.<module> import <dataset>

With the exception of assignments, all modules have a dataset called molecules, which is an alias for the dataset that is presumably the most interesting one for the majority of users and is therefore recommended for general use.

To import the different modules in a single MolecularDataSet object, one can do:

from hqs_nmr_parameters import molecules

which is equivalent to:

from hqs_nmr_parameters.merged import molecules

Including the one imported above, there are four datasets that integrate the data available in hqs-nmr-parameters in some way:

molecules: Includes all available molecules with any data, using experimental shifts if possible.
calculated: Includes all available molecules that contain purely calculated parameters (shifts and J-couplings).
combined: Includes all available molecules for which a combination of experimental shifts and calculated J-couplings is available.
free_trial: Includes data available in the free trial version of HQSpectrum (calculated data from CHESHIRE and examples).

The datasets used here are described in more detail below along with the most important methods to access their contents. These methods (and properties) are accessible for each dataset, including the merged ones presented here.

Examples module

The first dataset that is worth mentioning is the examples module which has a set of molecule definitions encapsulated in the MolecularDataSet object molecules. This set can be accessed via:

from pprint import pprint
from hqs_nmr_parameters.examples import molecules

print(type(molecules)) # MolecularDataSet
# Keys of the dataset:
print(molecules.keys)

<class 'hqs_nmr_parameters.code.data_classes.MolecularDataSet'>
['CH3Cl', 'limonene_DFT', '1,2,4-trichlorobenzene', 'Anethole', 'Artemisinin_exp', 'endo-dicyclopentadiene_DFT', 'CH3Cl_13C', 'C2H3CN', 'Artemisinin', 'camphor_DFT', 'C6H6', 'C10H8', 'Triphenylphosphine_oxide', 'H2CCF2', 'C2H5Cl', 'Androstenedione', 'Cinnamaldehyde', 'CHCl3_13C', 'C6H5NO2', 'C2H5OH', 'C2H6', 'C10H7Br', 'CHCl3', 'camphor_exp', 'exo-dicyclopentadiene_DFT', '1,2-di-tert-butyl-diphosphane', 'C2H3NC', 'cyclopentadiene_DFT', 'cis-3-chloroacrylic_acid_exp', 'C3H8']

Note that the content of this list of keys is just an example and might appear in a different order or with different entries depending on the installed version of hqs-nmr-parameters. The same holds for the dictionary of molecule names which can be obtained as follows:

pprint(molecules.get_names())

{'1,2,4-trichlorobenzene': '1,2,4-Trichlorobenzene',
 '1,2-di-tert-butyl-diphosphane': 'tert-Butyl(tert-butylphosphanyl)phosphane',
 'Androstenedione': 'Androstenedione',
 'Anethole': 'Anethole',
 'Artemisinin': 'Artemisinin',
 'Artemisinin_exp': 'Artemisinin',
 'C10H7Br': '2-Bromonaphthalene',
 'C10H8': 'Naphthalene',
 'C2H3CN': 'Acrylonitrile',
 'C2H3NC': 'Vinyl isocyanide',
 'C2H5Cl': 'Chloroethane',
 'C2H5OH': 'Ethanol',
 'C2H6': 'Ethane',
 'C3H8': 'Propane',
 'C6H5NO2': 'Nitrobenzene',
 'C6H6': 'Benzene',
 'CH3Cl': 'Chloromethane',
 'CH3Cl_13C': 'Chloromethane',
 'CHCl3': 'Chloroform',
 'CHCl3_13C': 'Chloroform',
 'Cinnamaldehyde': 'Cinnamaldehyde',
 'H2CCF2': '1,1-Difluoroethene',
 'Triphenylphosphine_oxide': 'Triphenylphosphine oxide',
 'camphor_DFT': 'Camphor',
 'camphor_exp': 'Camphor',
 'cis-3-chloroacrylic_acid_exp': 'cis-3-Chloroacrylic acid',
 'cyclopentadiene_DFT': 'Cyclopentadiene',
 'endo-dicyclopentadiene_DFT': 'endo-Dicyclopentadiene',
 'exo-dicyclopentadiene_DFT': 'exo-Dicyclopentadiene',
 'limonene_DFT': 'Limonene'}

In addition, if we want to have a feeling of the size of the molecules in the set, we can print their formulas using the get_formulas method.

The full molecular definition for a given molecule can be loaded using its string key. Each entry of this dataset includes a 2D representation (Molfile or SMILES string) of the molecule. Let us consider an example:

from pprint import pprint
from hqs_nmr_parameters.examples import molecules

# Obtain the MolecularData object for acrylonitrile
parameters = molecules["C2H3CN"]
# Print parameters
pprint(parameters.model_dump())

{'description': '1H parameters for acrylonitrile.\n'
                "Values were obtained from Hans Reich's Collection, NMR "
                'Spectroscopy.\n'
                'https://organicchemistrydata.org\n',
 'formula': 'C3H3N',
 'isotopes': [(3, (1, 'H')), (4, (1, 'H')), (5, (1, 'H'))],
 'j_couplings': [((3, 4), 0.9), ((3, 5), 11.8), ((4, 5), 17.9)],
 'method_json': '',
 'name': 'Acrylonitrile',
 'shifts': [(3, 5.79), (4, 5.97), (5, 5.48)],
 'solvent': '',
 'structures': {'Molfile': {'atom_map': [0, 1, 2, 3, 4, 5, 6],
                            'charge': 0,
                            'content': '\n'
                                       'JME 2022-02-26 Wed Sep 07 15:54:28 '
                                       'GMT+200 2022\n'
                                       '\n'
                                       '  0  0  0  0  0  0  0  0  0  0999 '
                                       'V3000\n'
                                       'M  V30 BEGIN CTAB\n'
                                       'M  V30 COUNTS 7 6 0 0 0\n'
                                       'M  V30 BEGIN ATOM\n'
                                       'M  V30 1 C 2.4249 2.1000 0.0000 0\n'
                                       'M  V30 2 C 3.6373 1.4000 0.0000 0\n'
                                       'M  V30 3 C 1.2124 1.4000 0.0000 0\n'
                                       'M  V30 4 H 0.0000 2.1000 0.0000 0\n'
                                       'M  V30 5 H 1.2124 0.0000 0.0000 0\n'
                                       'M  V30 6 H 2.4249 3.5000 0.0000 0\n'
                                       'M  V30 7 N 4.8497 0.7000 0.0000 0\n'
                                       'M  V30 END ATOM\n'
                                       'M  V30 BEGIN BOND\n'
                                       'M  V30 1 1 1 2\n'
                                       'M  V30 2 2 1 3\n'
                                       'M  V30 3 1 3 4\n'
                                       'M  V30 4 1 3 5\n'
                                       'M  V30 5 1 1 6\n'
                                       'M  V30 6 3 2 7\n'
                                       'M  V30 END BOND\n'
                                       'M  V30 END CTAB\n'
                                       'M  END\n',
                            'representation': 'Molfile',
                            'symbols': ['C', 'C', 'C', 'H', 'H', 'H', 'N']}},
 'temperature': None}

In this MolecularData object, data for setting up a ¹H NMR spectrum of the acrylonitrile molecule has been stored together with its Molfile.

To set up an NMR calculation, only a part of the previous data is needed. To retrieve only the essential values, the spin_system method can be used as follows (for more information, see the section on the NMRParameters class):

nmr_parameters = parameters.spin_system()
pprint(nmr_parameters.model_dump())

{'isotopes': [(1, 'H'), (1, 'H'), (1, 'H')],
 'j_couplings': [((0, 1), 0.9), ((0, 2), 11.8), ((1, 2), 17.9)],
 'shifts': [5.79, 5.97, 5.48]}

CHESHIRE module

In the cheshire module, one can find molecular data for molecules belonging to the CHESHIRE database. Five datasets (MolecularDataSet objects) have been created from this database depending on the collected NMR data:

experimental_shifts_only: It includes the experimental shifts (for ¹³C and ¹H) of all 105 molecules, but no J-coupling values.
calculated_full: It has theoretical NMR data for all molecules. Details of the calculations can be found under the description attribute of each item (see below).
combined_full: It contains experimental shifts and theoretical J-couplings for all molecules in the set.
The calculated and combined datasets are the reduced versions of the aforementioned sets and contain only the NMR data required for simulating ¹H NMR spectra.

In addition, for non-expert users, we have included the alias molecules, which returns the combined set, i.e., ¹H NMR data with experimental shifts and calculated J-couplings.

These datasets can be imported as follows (we will focus on the molecules set that will be imported as cheshire_molecules to avoid confusion with the examples module):

from hqs_nmr_parameters.cheshire import molecules as cheshire_molecules

As a brief illustration of the type of data included in the sets, the most important information can be retrieved from the dataset's description attribute:

print(cheshire_molecules.description)

The keys of the molecules give access to the molecular data. Since chemical structure names are not practical for a large amount of compounds, the keys correspond to string representations of numbers from 1 to 106 (a list of all keys can be accessed with the cheshire_molecules.keys property, 63 is missing as it contains duplicate data). Note that some molecule entries may be missing in CHESHIRE datasets. For information on that, please refer to the dataset's description.

As before, the molecule names can be obtained using the get_names function. But here, we will focus on a single entry:

print(cheshire_molecules["1"].name)

'Dichloromethane'

In the description of each entry we find important information about how the NMR parameters were obtained:

print(cheshire_molecules["1"].description)

Each entry in the dataset includes three structure representations of the molecule: a SMILES string, a two-dimensional Molfile, and a three-dimensional XYZ representation.

print(cheshire_molecules["1"].structures.keys())

dict_keys(['XYZ', 'Molfile', 'SMILES'])

To access only the NMR data, use the spin_system method:

pprint(cheshire_molecules["1"].spin_system().model_dump())

{'isotopes': [(1, 'H'), (1, 'H')],
 'j_couplings': [((0, 1), -5.171)],
 'shifts': [5.28, 5.28]}

Benchmark

The CHESHIRE module contains a submodule called benchmark with additional datasets, which can be imported as from hqs_nmr_parameters.cheshire.benchmark import .... These contain the results of different computational predictions of NMR parameters for all or parts of the molecules in the CHESHIRE set. Since the content of this folder may grow and/or change, we refer to the respective dataset and molecule descriptions. Currently, the following additional datasets are available:

experimental_shifts_only: Same dataset as in the cheshire module directly.
calculated_full: Calculated parameters. Geometries optimized at B97-3c, chemical shifts at PBE0/pcSseg-2, and J-coupling constants at PBE/pcJ-2 level of theory. Same dataset as in the cheshire module directly.
calculated: Reduced version of calculated_full with only the NMR data required for simulating ¹H NMR spectra.
predicted: ¹H chemical shifts and J-coupling constants obtained from HQS's empirical prediction method (method still under development).

Furthermore, the benchmark_1h molecular data table can be imported and combines the datasets experimental_shifts_only (without ¹³C NMR data), calculated, and predicted.

GISSMO module

Similar to CHESHIRE, the GISSMO module contains molecules from an external database called GISSMO, which contains experimental ¹H NMR chemical shifts and J-coupling constants for ¹H–¹H, ¹H–¹⁹F, and ¹H–³¹P couplings of more than 1200 organic molecules.

As there is no chemical shift data for any other isotopes than ¹H, it is possible to import the following slightly different datasets:

experimental_dummy_hetero_shifts: Contains all available experimental shift and J-coupling data. The chemical shift values of ¹⁹F and ³¹P are set to dummy values of 0.0 ppm.
experimental_no_hetero_nuclei: Contains only the available ¹H NMR shifts and ¹H–¹H J-coupling constants.

Here, the molecules dataset is an alias for experimental_dummy_hetero_shifts. It can be imported as follows.

from hqs_nmr_parameters.gissmo import molecules as gissmo_molecules

The molecule keys in the GISSMO dataset are either of the form bmseXXXXXX or Maybridge_XX_YXX (where X are numbers and Y are letters). When searching for a specific molecule entry, we recommend accessing the names in the dictionary returned by gissmo_molecules.get_names() or taking a look at the online library, where the same tags are used. For entries with a bmseXXXXXX key, experimental NMR spectra are available at the BMRB website.

Like the other datasets, the description of the GISSMO dataset and the individual molecule descriptions provide the most important information on the stored data. Additional notes in the descriptions point to special features of the molecular data, e.g., the presence of ¹H–¹⁹F J-coupling constant data in the following example entry (p-fluorobenzoic acid). In the output of this code snippet, there will be a ¹⁹F isotope in the isotopes list, which has a shift value of 0.0 ppm but reasonable J-coupling values to other ¹H atoms. The molecule description contains a note on the included values.

example_gissmo = gissmo_molecules["bmse000739"]
print(example_gissmo.isotopes)
print(example_gissmo.shifts)
print(example_gissmo.j_couplings)
print(example_gissmo.description)

Phytolab module

The phytolab module contains some selected molecules from a catalogue by Phytolab. It includes three datasets:

calculated_full: All NMR parameters in this set are computed. Details can be obtained from the description of each item in the set. Where ¹³C data has also been calculated, shifts and couplings are included for each nucleus.
calculated: This set of computed parameters is intended for the calculation of one-dimensional ¹H NMR spectra, as the parameters for ¹³C are omitted.
combined: Where possible, the chemical shifts have been adjusted manually to achieve a better match with experimental ¹H NMR spectra. Therefore, this set contains a combination of adjusted or computed shifts, and computed J-couplings. This set is recommended to simulate ¹H NMR spectra to obtain the closest agreement with experiment.

In addition, the module includes the set molecules, which is an alias for combined as described above.

To access these datasets, import them analogously to the other modules:

from hqs_nmr_parameters.phytolab import molecules as phytolab_molecules

print("Dataset content:")
print(phytolab_molecules.description + "\n")
print(f"Entries of the set: {phytolab_molecules.keys}\n")
print("Details on the NMR parameters for Psoralen:")
print(phytolab_molecules["psoralen"].description)

NMR parameters for selected natural products from a catalogue by Phytolab.
Where possible, chemical shifts have been adjusted to match experimental spectra. The remaining parameters are computed.
For further details, please refer to descriptions of the individual items in the set.
Shifts and couplings only for nuclei ['1H'].

Entries of the set: ['angelicin', 'psoralen', 'friedelin']

Details on the NMR parameters for Psoralen:
Geometry in chloroform at B97-3c.
Shifts manually adjusted to match 1H-NMR spectrum at 80 MHz in CDCl3 provided by Phytolab.
J-couplings (gas-phase) at PBE/pcJ-3.

Like for all other modules, the content of the sets and especially their descriptions will depend on the installed version of hqs-nmr-parameters.

Assignments module

The assignments module contains example data of other complex molecules.

Patchoulol

The patchoulol dataset contains two molecules, the originally proposed structure for patchouli alcohol and the correct structure. We can access it as:

from hqs_nmr_parameters.assignments import patchoulol

To get an overview of the set, we can access a brief summary with print(patchoulol.description).

Only the two mentioned molecules are present in the set, we can access them via their keys:

for key in patchoulol.keys:
    print(f"{key}: {patchoulol[key].name}")

correct: Patchouli alcohol
erroneous: 4,10,11,11-Tetramethyltricyclo[5.3.1.01,5]undecan-10-ol

With this data, we can now use HQS Spectrum Tools to simulate both spectra and see the differences between these two similar molecules as well as compare with the experimental spectrum.

Menthol isomers

The menthol_isomers dataset is a collection of the four possible diastereomers of menthol (5-methyl-2-(propan-2-yl)cyclohexan-1-ol). With three chiral centers at positions 1, 2, and 5 (in IUPAC convention), there are the following eight possible structures:

Menthol:
- (+)-enantiomer, with stereocenters (1S,2R,5S).
- (−)-enantiomer, with stereocenters (1R,2S,5R.)
Neomenthol:
- (+)-enantiomer, with stereocenters (1S,2S,5R).
- (−)-enantiomer, with stereocenters (1R,2R,5S).
Isomenthol:
- (+)-enantiomer, with stereocenters (1S,2R,5R).
- (−)-enantiomer, with stereocenters (1R,2S,5S).
Neoisomenthol:
- (+)-enantiomer, with stereocenters (1R,2R,5R).
- (−)-enantiomer, with stereocenters (1S,2S,5S).

Since enantiomers are not distinguishable by conventional NMR spectroscopy, there are four different possible NMR spectra. The given dataset contains NMR parameters calculated with density functional theory (DFT) for one enantiomer of each pair and can be imported from the assignments module as menthol_isomers_full for ¹H and ¹³C NMR parameters or as menthol_isomers for only ¹H NMR data.

For an overview of the dataset, just print its description:

from hqs_nmr_parameters.assignments import menthol_isomers

print(menthol_isomers.description)

The molecular keys and names of the structures in the dataset can be listed as:

for key in menthol_isomers.keys:
    print(f"{key}: {menthol_isomers[key].name}")

SSR: (+)-Neomenthol (SSR)
RSR: (-)-Menthol (RSR)
SRR: (+)-Isomenthol (SRR)
SSS: (-)-Neoisomenthol (SSS)

For more information on the applied computational level of theory, please inspect the individual descriptions with the description attribute.

This data can be used to simulate the NMR spectra of all diastereomers as explained earlier and compare them to experimental ones, e.g., to that of neomenthol available here. Due to the limited accuracy of DFT calculations, it is not always straightforward to identify the correct isomer if the exact structure of the experimental measurement is unknown, but the comparison with all four possibilities will provide valuable insights for structure elucidation. Furthermore, the postprocessing module of HQS Spectrum Tools allows the user to modify the simulated spectrum to better match an experimental reference which will help to reduce the number of reasonable candidate structures.

Statistical evaluations

If, for the same set of molecular structures, two datasets with probe (e.g., calculations) and reference (e.g., experiments) NMR parameter data are available, the probe data can be evaluated quickly using the evaluate_shifts and evaluate_couplings functions to obtain the errors with respect to the reference data and resulting statistical measures. The functions take two MolecularDataSet objects, the first one being the probe dataset to be evaluated and the second one being the reference dataset to be evaluated against. The following basic example shows how to evaluate the calculated chemical shifts from CHESHIRE against the experimental data.

from hqs_nmr_parameters import evaluate_shifts
from hqs_nmr_parameters.cheshire import calculated, experimental_shifts_only

evaluation = evaluate_shifts(calculated, experimental_shifts_only)

The returned object (here evaluation) is an instance of type BenchmarkEvaluation and contains an attribute errors with a list of errors between the values from the probe and the reference dataset along with various properties providing statistical measures. For convenience, all relevant data from the evaluation can be printed with the print_all function.

evaluation.print_all()

The evaluate_couplings function can be used analogously for evaluating J-coupling constants. By default, all data points from both datasets are considered in the evaluation. However, for instance, if chemical shift values are available for more than one isotope, it makes sense to perform the evaluation only for data of a specific isotope. Therefore, the following isotope selection arguments are available:

evaluate_shifts
- isotope: Only data for this specified isotope (string or Isotope object) is considered in the evaluation.
evaluate_couplings
- isotopes: Only data for couplings between atoms of these specified isotopes (list of strings/Isotopes) is considered in the evaluation.
- isotope_pairs: Only data for couplings between these specified isotope pairs (list of tuples of strings/Isotopes) is considered in the evaluation.
- Both arguments can be combined, for example, the call
```
evaluate_couplings(probe_dataset, ref_dataset, isotopes=["1H", "19F"], isotope_pairs=[("1H", "31P")])
```
  performs a J-coupling constant evaluation of the data in probe_dataset against that in ref_dataset for all of the following isotope pairs: ¹H–¹H, ¹H–¹⁹F, ¹⁹F–¹⁹F, and ¹H–³¹P.

These statistical evaluations can also directly be performed on a MolecularDataTable object. In this case, the functions evaluate_table_shifts and evaluate_table_couplings can be used by specifying the table and the label of the column that shall be used as reference dataset. They return a dictionary with the column labels of all other columns as keys and the corresponding BenchmarkEvaluation objects as values. The isotope selection options are the same as those shown above. An example for evaluating the calcualted and predicted chemical shifts in the CHESHIRE benchmark submodule is shown in the following:

from hqs_nmr_parameters import evaluate_table_shifts
from hqs_nmr_parameters.cheshire.benchmark import benchmark_1h

# The column labels are: "experimental shifts", "DFT", and "empirical prediction".
print("Column labels:", benchmark_1h.column_labels)
evaluations = evaluate_table_shifts(benchmark_1h, "experimental shifts", isotope="1H")

print("\nStatistics for calculated 1H NMR shifts:")
evaluations["DFT"].print_all()
print("\nStatistics for empirically predicted 1H NMR shifts:")
evaluations["empirical prediction"].print_all()

Keyboard shortcuts

HQS Spectrum Tools Documentation