MQS Molecules Database¶
Figure: REST API - Author: Seobility - License: CC BY-SA 4.0
Microservices and containers¶
Right from the start when MQS was founded we defined a highly modular and microservice based architecture for our cloud software stack. The reasoning behind this decision was that due to the nature of a start-up company we wanted to remain flexible with our portfolio of algorithms and tools. For that reason we containerize the components of our stack (algorithms, database, API, UI) as much as possible and combine them for the needed pipelines/workflows. The first microservice we are releasing as a beta version is our Search API which connects to our database system. The database holds data of molecules and will be extended over time with additional data sets published under free licenses such as the GPL, MIT or Creative Commons licenses.
We have made our current database freely accessible through the user interface within the MQS Dashboard and on a subscription basis via the MQS Search API.
In the following we are describing how you can make use of the Search API when you have subscribed to the basic plan. You can find the subscription overview at the top right of the user menu in the dashboard.
Authentication to the MQS Search API¶
To authenticate via the Search API you need to make use of your email address which you have used for logging into the dashboard.
In the below code you can see where you have to provide your email address and your password in the credentials dictionaries' key-value pairs "username":"YOUR_EMAIL" and "password": "YOUR_PW".
import requests
credentials = {"username": "YOUR_EMAIL", "password": "YOUR_PW"}
api_url = "https://api.mqs.dk"
token = requests.post(f"{api_url}/token", data=credentials).json()["access_token"]
headers = {"Accept": "application/json", "Authorization": f"Bearer {token}"}
query = "isomeric_smiles:CC"
data = requests.get(f"{api_url}/search", params={"q": query}, headers=headers).json()
print("data:", data)
By running the above Python script a token is retrieved from the API post request. This token is being stored in the headers dictionary which will be utilized for any other request to the API.
The last lines in the above script showcase how you can search for a molecule in the database with an isomeric SMILES string.
When printing the retrieved information with the print() function one can see that the API has sent back data and it is being resolved as a dictionary by the requests package, due to the applied json() function. The data dictionary holds a lot of nested information. You can run the above example in this Python script or Jupyter Notebook which we have compiled covering the examples in this tutorial.
As you will see, the last print() function gives an overview of the complete data set for the ethane (CC) molecule.
On the molecule search page (Figure 1) we provide a documentation of the Search API.
Figure 1: "Search for chemicals (via API) tab on the molecule search page in the MQS Dashboard"
Jupyter notebooks¶
You can find Jupyter notebooks which show how to use the API and SDK in the notebooks folder of the Python SDK: https://gitlab.com/mqsdk/python-sdk/-/tree/main/notebooks
Get individual properties of molecules¶
The next code example shows how one can retrieve specific information of a molecule such as the main title (main compound name) of the molecule, the number of atoms, the geometry optimized xyz coordinates of the individual atoms of the molecule, the vibrational spectra and orbitals information.
Important to acknowledge is that first the molecule id has to be retrieved from the data one has obtained from a previous request. The below code starts where the last code example has ended:
#Use the ID from the last search result to get the molecule's complete datasheet
id = results["response"]["docs"][0]["id"]
#id = "626745bf5297cf60c40d9459"
data = requests.get(f"{api_url}/compound/{id}", headers=headers).json()
#Print information about the molecule
#title
print("title:", data["title"], "\n")
#number of atoms
atom_count = data['pubchemqc_pm6'][0]['atoms']['elements']['atom_count']
#3d coordinates
print("Geometry optimized xyz coordinates:\n", np.asarray(data["pubchemqc_pm6"][0]["atoms"]["coords"]["3d"]).reshape(atom_count,3), "\n")
#vibrational spectra
print("Vibrational spectra data:", data["pubchemqc_pm6"][0]["vibrations"], "\n")
#orbitals
print("Orbitals data:", data["pubchemqc_pm6"][0]['properties']["orbitals"], "\n")
As you might have now noted, we have introduced a new request endpoint: {api_url}/compound
and made use of the id retrieved from the data which we in turn requested from the {api_url}/search
endpoint.
We are interacting with the API as we would do in the user interface: first we search for a compound and then we request further information on a searched molecule.
How to obtain detailed information could be quite confusing in the beginning since one has not yet seen the complete schema of the database. For a start we will go through the five examples depicted in the above code: title, number of atoms, 3d coordinates, vibrational spectra and orbitals.
To access the title we can make a first level look-up: data["title"]. That was easy.
The number of atoms (atom count) is currently stored in a specific data set which the database holds and therefore one needs to first reference the data set and then the state of the molecule (e.g. 0: anion, 1: cation, 2: S0, singlet state 3: T0, triplet state), followed by
['atoms']['elements']['atom_count']`.
It is important to acknowledge that the PubchemQC PM6 data set holds information for various states and sometimes no data for a specific state is available. Therefore one can not just automate going through the index assuming that the indices 0, 1, 2 and 3 hold information about the anion, cation, singlet and triplet state. If for a molecule no information is available for the cation state, then index 1 would hold information about the singlet state and index 2 about the triplet state. In the last section of this tutorial we show how one can easily check and automate how to high-throughput screen through specific state information.
Back to the above code example: data["pubchemqc_pm6"][0]["atoms"]["coords"]["3d"]
let's one access geometry optimized 3d coordinates data which is being stored in a numpy array and reshaped so that one can view the xyz columns element by element (rows) in the output.
Now it should be clear that one needs to have some idea how the data set has been stored in the database and for this the schema is of importance for a user. In the following you can see the full schema of the PubchemQC PM6 data set which helps to request specific data from the API:
pubchemqc_pm6:
[
{
state: string
vibrations: {
frequencies: [number]
intensities: {
IR: [number]
}
}
atoms: {
elements: {
atom_count: number
number: [number]
heavy_atom_count: number
}
coords: {
3d: [number]
}
core_electrons: [number]
}
properties: {
number_of_atoms: number
temperature: number
total_dipole_moment: number
multiplicity: number
energy: {
alpha: {
homo: number
gap: number
lumo: number
}
beta: {
homo: number
gap: number
lumo: number
}
total: number
}
orbitals: {
energies: [[number]]
basis_number: number
MO_number: number
homos: [number]
}
enthalpy: number
charge: number
partial_charges: {
mulliken: [number]
}
}
openbabel_inchi: string
openbabel_canonical_smiles: string
}
]
How the vibrational spectra and orbitals information have been referenced in the above code example can be comprehended by comparing the Python code with the schema.
Other identifiers to search with¶
In the previous example a search was performed with a SMILES string. But there are several other identifiers you can make use of for searching the database. For example, the Pubchem CID, the main name title (main compound name) of the molecule, synonyms of the molecule, InChi and InChi key.
Here a complete overview what one can use as search queries via the API:
pubchemqc_pm6.state and pubchemqc_pm6.openbabel_canonical_smiles
show how you can go directly into a data set to search for example via the available molecule states (e.g. S0, anion, cation) or for the openbabel_canonical_smiles
which have been generated in this data set.
Get sets of molecules information¶
The following Python script example shows how one can high-throughput screen via the API to retrieve sets of molecule information. We apply a list comprehension instead of a for loop to iterate over API requests. The authentication procedure is now defined with a function and the main part shows how the ids of the found compounds are requested from the Search API and stored in the id set list. The queries list is generated by iterating over carbon chains from C2 to C10 and different identifiers are utilized. Isomeric SMILES, CID, InChI key and CAS work the best to receive an exact match with the respective identifier.
Get sets of molecules with different identifiers
import requests
import json
def get_headers(credentials, api_url):
"""
Get an access token which is valid for 2 hours
"""
token = requests.post(f"{api_url}/token", data=credentials).json()["access_token"]
headers = {"Accept": "application/json", "Authorization": f"Bearer {token}"}
return headers
if __name__ == "__main__":
credentials = {
"username": "YOUR_EMAIL",
"password": "YOUR_PW"
}
api_url = "https://api.mqs.dk"
headers = get_headers(credentials, api_url)
queries_list = ["isomeric_smiles:CC", #ethane (C2)
"cid:6334", #propane (C3)
"isomeric_smiles:CCCC", #butane (C4)
"isomeric_smiles:CCCCC", #pentane (C5)
"isomeric_smiles:CCCCCC", #hexane (C6)
"inchi_key:IMNFDUFMRHMDMM-UHFFFAOYSA-N", #heptane (C7)
"isomeric_smiles:CCCCCCCC", #octane (C8)
"cas:111-84-2", #nonane (C9)
"isomeric_smiles:CCCCCCCCCC"] #decane (C10)
results = [requests.get(f"{api_url}/search", params={"q": query}, headers=headers).json() for query in queries_list]
number_of_compounds = range(len(queries_list))
id_set_list = []
for i in number_of_compounds:
print(results[i]['response']['docs'][0], "\n")
id_set_list.append(results[i]["response"]["docs"][0]["id"])
print("id_set_list:", id_set_list)
data_set = []
for id in id_set_list:
data_set.append(requests.get(f"{API_URL}/compound/{id}", headers=headers).json())
vibrational_spectra_data = [data_set[i]["pubchemqc_pm6"][0]["vibrations"] for i in number_of_compounds]
print("vibrational_spectra_data:", vibrational_spectra_data)
Example: State look-up¶
Anion, cation, S0, D0, T0, Q0 states look-up
The PubchemQC PM6 data set indices for the states can change because some compounds do not hold data for all states.
The following code snippet shows how one can read out the data from different states for a set of molecules.
for i in number_of_compounds:
for dict_ in data_set[i]['pubchemqc_pm6']:
if dict_['state'] == "anion":
print("Compound:", data_set[i]['title'], "\n", "Anion data:", dict_, "\n")
if dict_['state'] == "cation":
print("Compound:", data_set[i]['title'], "\n", "Cation data:", dict_, "\n")
if dict_['state'] == "S0":
print("Compound:", data_set[i]['title'], "\n", "S0 data:", dict_, "\n")
if dict_['state'] == "D0":
print("Compound:", data_set[i]['title'], "\n", "D0 data:", dict_, "\n")
if dict_['state'] == "T0":
print("Compound:", data_set[i]['title'], "\n", "T0 data:", dict_, "\n")
if dict_['state'] == "Q0":
print("Compound:", data_set[i]['title'], "\n", "Q0 data:", dict_, "\n")
Before designing a high-throughput pipeline via the MQS Search API and tailored scripts, one can also make use of the MQS Dashboard Search UI to easily check what kind of data one can find in the database.
For example you can use the dashboard search field to screen for all available S0 state data by typing pubchemqc_pm6.state:S0
into the search input field.
```
Overview of API endpoints¶
When interacting with a REST-API we can define sending data via the API with a POST call and receiving data with a GET call to the server. The following POST and GET methods exist currently with the MQS Search API:
POST /token
: This endpoint generates an access token that is required to call the other two endpoints. The token is generated using the user's email and password.GET /search
: This endpoint is used to search the chemical compounds data in the database.GET /compound/{id}
: This endpoint is used to retrieve detailed information of a specific compound.
Authentication is required for accessing the /search
and /compound/{id}
endpoints. The user must provide a valid bearer token in the Authorization HTTP header.
Generating the access token¶
To generate the access token, use the POST /token
endpoint. The following parameters should be included in the request body:
* email
: The email address of the user which has been used to set up an account via the MQS Dashboard.
* password
: The password of the user.
See Tutorial I for a code example.
Searching for compounds¶
To search for compounds, use the GET /search
endpoint. The following parameter should be included in the request URL:
* q
: The search query string.
One can optimize the search query string by making use of the MQS Dashboard search interface to test the query string by seeing which kind of molecules are found and how to further refine the query string.
For example if you would like to search for the chemical class of polychlorinated biphenyls (PCBs) you can start with typing "polychlorinated biphenyls" or "PCB" in the search field.
Retrieving a compound datasheet¶
To retrieve the full datasheet of a single compound, use the GET /compound/{id}
endpoint. The following parameter should be included in the request URL:
* id
: The ID of the compound.
The differentiation between the two endpoints GET /search
and the GET /compound/{id}
can be exemplified with imagining how the search dashboard interface is being used:
The search field element of the dashboard makes use of the GET /search
method while when a user clicks on a specific compound, the GET /compound/{id}
method is applied.
Dashboard search field query¶
The search query field of the dashboard allows to tailor your search queries first before retrieving all the data of the molecules. As with many other search engines one can make use of operators (AND, OR, NOT) and wildcards (*, #, ?, %) to refine your search query to retrieve a search result with only a set of structures of interest.
For example to search for polychlorinated biphenyls (PCBs) one can search for the term:
"PCB"
but will then also retrieve a search result with structures including S, N, or O atoms.
A refinement of the search string to:
"PCB AND C12 NOT formula:S NOT formula:N NOT formula:*O"
allows to only retrieve two rings connected to each other and with Cl atoms bounded with the C-ring atoms in different configurations.
Figure 2: Search query results in MQS Dashboard
Tip: If you are not satisfied with a search because a SMILES string gives you for example only a single atom as the top result or the exact match does not show up at the top, then try to make use of double quotation marks enclosing your SMILES string you are trying to search:
"CC(=O)OC1=CC=CC=C1C(=O)O"
instead of searching for the SMILES without quotation marks:
CC(=O)OC1=CC=CC=C1C(=O)O
try it yourself and see how the results differ.
MQS Database: Currently available data sets¶
The current beta release of the MQS database holds the PubChemQC PM6 and QMugs data sets where the PubchemQC PM6 calculation pipeline has been based on the Pubchem database to screen all available molecular structures whereas the QMugs pipeline makes use of the ChEMBL database.
We refer to both publications of the individual data sets to understand how the data has been generated and what kind of valuable data they hold. Here a comparison table between PubchemQC PM6 and QMugs:
PubchemQC PM6 | QMugs | |
---|---|---|
Data mining source | Pubchem database | ChEMBL database |
Semi-empirical method | PM6 | GFN2-xTB |
Software applied | NWChem | RDKit, xtb, PSI4 |
Availability of conformers data | No | Yes |
Availability of different electronic states | Yes | No |
Number of molecular structures | 221.190.415 | 685.917 |
The Search UI & API allow to search through both data sets at the same time via the following identifiers:
One can also search through each individual data set and zoom further into the schema of the data sets to evaluate what can be searched for.
The PubchemQC PM6 data set holds the following data:
pubchemqc_pm6:
[
{
state: string
vibrations: {
frequencies: [number]
intensities: {
IR: [number]
}
}
atoms: {
elements: {
atom_count: number
number: [number]
heavy_atom_count: number
}
coords: {
3d: [number]
}
core_electrons: [number]
}
properties: {
number_of_atoms: number
temperature: number
total_dipole_moment: number
multiplicity: number
energy: {
alpha: {
homo: number
gap: number
lumo: number
}
beta: {
homo: number
gap: number
lumo: number
}
total: number
}
orbitals: {
energies: [[number]]
basis_number: number
MO_number: number
homos: [number]
}
enthalpy: number
charge: number
partial_charges: {
mulliken: [number]
}
}
openbabel_inchi: string
openbabel_canonical_smiles: string
}
]
The QMugs data set consists of:
qmugs:
[
{
conf: string
atom_count: number
heavy_atom_count: number
hetero_atom_count: number
rotabable_bounds: number
stereocenters: number
rings: number
hbond_acceptors: number
hbond_donors: number
significant_negative_wavenumbers: boolean
nonunique_smiles: boolean
GFN2_TOTAL_ENERGY: number
GFN2_ATOMIC_ENERGY: number
GFN2_FORMATION_ENERGY: number
GFN2_TOTAL_ENTHALPY: number
GFN2_TOTAL_FREE_ENERGY: number
GFN2_DIPOLE_X: number
GFN2_DIPOLE_Y: number
GFN2_DIPOLE_Z: number
GFN2_DIPOLE_TOT: number
GFN2_QUADRUPOLE_XX: number
GFN2_QUADRUPOLE_XY: number
GFN2_QUADRUPOLE_YY: number
GFN2_QUADRUPOLE_XZ: number
GFN2_QUADRUPOLE_yz: number
GFN2_QUADRUPOLE_ZZ: number
GFN2_ROT_CONSTANT_A: number
GFN2_ROT_CONSTANT_B: number
GFN2_ROT_CONSTANT_C: number
GFN2_ENTHALPY_VIB: number
GFN2_ENTHALPY_ROT: number
GFN2_ENTHALPY_TRANSL: number
GFN2_ENTHALPY_TOT: number
GFN2_HEAT_CAPACITY_VIB: number
GFN2_HEAT_CAPACITY_ROT: number
GFN2_HEAT_CAPACITY_TRANSL: number
GFN2_HEAT_CAPACITY_TOT: number
GFN2_ENTROPY_VIB: number
GFN2_ENTROPY_ROT: number
GFN2_ENTROPY_TRANSL: number
GFN2_ENTROPY_TOT: number
GFN2_HOMO_ENERGY: number
GFN2_LUMO_ENERGY: number
GFN2_HOMO_LUMO_GAP: number
GFN2_FERMI_LEVEL: number
GFN2_DISPERSION_COEFFICIENT_MOLECULAR: number
GFN2_POLARIZABILITY_MOLECULAR: number
DFT_TOTAL_ENERGY: number
DFT_ATOMIC_ENERGY: number
DFT_FORMATION_ENERGY: number
DFT_DIPOLE_X: number
DFT_DIPOLE_Y: number
DFT_DIPOLE_Z: number
DFT_DIPOLE_TOT: number
DFT_ROT_CONSTANT_A: number
DFT_ROT_CONSTANT_B: number
DFT_ROT_CONSTANT_C: number
DFT_XC_ENERGY: number
DFT_NUCLEAR_REPULSION_ENERGY: number
DFT_ONE_ELECTRON_ENERGY: number
DFT_TWO_ELECTRON_ENERGY: number
DFT_HOMO_ENERGY: number
DFT_LUMO_ENERGY: number
DFT_HOMO_LUMO_GAP: number
}
]
See also these tables from the QMugs paper to get an overview of the data:
https://www.nature.com/articles/s41597-022-01390-7/tables/3
https://www.nature.com/articles/s41597-022-01390-7/tables/4
The user interface of the MQS Dashboard provides an overview of the QMugs data if available for the specific molecule:
Figure 3: Tabs and table view for QMugs data in the MQS Dashboard
Sometimes you will not find the possibility to select the QMugs data in the dropdown because only PubchemQC PM6 data is available for this specific molecule. Or vice versa.
The available data view for the PubchemQC PM6 dataset can be seen in the following figure:
Figure 4: Tabs and table view for PubchemQC PM6 data in the MQS Dashboard
Example: Get HOMO-LUMO gap data from PubchemQC and QMugs¶
HOMO-LUMO gap data from the PubchemQC and QMugs data set
Similar to PubchemQC PM6, QMugs holds HOMO-LUMO gap data calculated via the gfn2 method and with DFT. In the following code snippet you can see how to retrieve HOMO-LUMO gap data from both data sets. QMugs contains property data for different conformers (local minima on the potential energy surface of the molecule) whereas PubchemQC PM6 holds data for different states.
smiles_list = []
homo_lumo_gap_list_pubchemqc = []
homo_lumo_gap_list_qmugs_gfn2 = []
homo_lumo_gap_list_qmugs_dft = []
for i in range(len(molecule_id_set)):
smiles = data_set[i]['isomeric_smiles']
print("smiles:", smiles)
smiles_list.append(smiles)
try:
homo_lumo_gap_pubchemqc = data_set[i]["pubchemqc_pm6"][0]["properties"]["energy"]["alpha"]["gap"]
homo_lumo_gap_list_pubchemqc.append(homo_lumo_gap_pubchemqc)
except: print("No pubchemqc pm6 data available.")
try:
homo_lumo_gap_qmugs_gfn2 = data_set[i]["qmugs"]["confs"][0]["GFN2_HOMO_LUMO_gap"]
homo_lumo_gap_qmugs_dft = data_set[i]["qmugs"]["confs"][0]["DFT_HOMO_LUMO_gap"]
homo_lumo_gap_list_qmugs_gfn2.append(homo_lumo_gap_qmugs_gfn2)
homo_lumo_gap_list_qmugs_dft.append(homo_lumo_gap_qmugs_dft)
except: print("No qmugs data available.")
print("homo_lumo_gap_list_pubchemqc:", homo_lumo_gap_list_pubchemqc)
print("homo_lumo_gap_list_qmugs_gfn2:", homo_lumo_gap_list_qmugs_gfn2)
print("homo_lumo_gap_list_qmugs_dft:", homo_lumo_gap_list_qmugs_dft)
PubchemQC PM6 holds HOMO-LUMO gap data for different molecule states (anion, cation, singlet state, triplet state) which one can select after having selected the "pubchemqc_pm6" data set. It is important to be aware of that not all molecules hold data for every state. In the original PubchemQC PM6 schema this is reflected by the available entries which can be selected via [0], [1], [2], [3]. But these are not fixed to the individual states. Some molecules might only have entries for [0] and [1], where [0] would hold the singlet state data but [1] could hold any other data (anion or cation or triplet state). In the MQS database we have fixed the states data to non-numeric entries (['singlet state'], ['anion'], ['cation'], ['triplet state']).
The ChEMBL and SureChEMBL identifiers combined with the retrieval of wavefunction files in the QMugs data set¶
The ChEMBL identifier has been added to the search IDs which can be used for screening compounds in the MQS database via the search query field in the dashboard or the API. The current supported ID set for searching through the database is summarized in the following:
For more information about the ChemBL identifier, we recommend the following blog article: http://chembl.blogspot.com/2011/08/chembl-identifiers.html
One has to be careful when searching via for example "chembl" because one will also receive results where molecules in their synonyms list have a SureCHEMBL (SCHEMBL) identifier. Therefore it is best to search directly with chembl_id in the search string: "chembl_id: chembl"
Each molecule in the QMugs dataset has up to three conformer structures and is linked to one ChEMBL identifier in the database. To understand the terms "conformation" and "conformer" we quote from the QMugs paper:
"In chemical terminology, the term “conformation” refers to any arrangement of atoms in space, whereas “conformer” refers to a conformation that is a local minimum on the potential energy surface of the molecule." (https://doi.org/10.1351/pac199668122193)
Here an example to retrieve the total energy [E_h] data from the QMugs data set with the ChEMBL and SChEMBL identifiers:
chembl_list = []
total_energy_list_qmugs = []
for i in range(len(molecule_id_set)):
chembl = data_set[i]['chembl_id']
print("chembl", chembl)
chembl_list.append(smiles)
try:
total_energy_qmugs = data_set[i]["qmugs"]["confs"][0]["GFN2_TOTAL_ENERGY"]
total_energy_qmugs.append(total_energy_qmugs)
except: print("No qmugs data available for", chembl, ".")
print("total_energy_list_qmugs:", total_energy_list_qmugs)
E_h is the total energy in units of Hartree and can be converted to other energy units. Here a conversion overview provided by the National Chiao Tung University (NCTU) in Taiwan which was merged with the National Yang-Ming University to the National Yang Ming Chiao Tung University (NYCU): http://wild.life.nctu.edu.tw/class/common/energy-unit-conv-table-detail.html
The QMugs data set also holds wavefunction related information and are stored as .tar.gz files. In the Dashboard one can download the wavefunction files by clicking on the respective button within the selected tab view:
Figure 5: Download button provided in the MQS Dashboard UI for wavefunctions from the QMugs dataset.
The following three steps can be followed to download the wave function files via the MQS API:
-
Search for a compound and retrieve the compound ID
-
Look up the download URL for the specific compound ID and
-
Download the wavefunction .tar.gz file with the URL, unpack it and look at the content of the file directly in your terminal
The following Python code snippets implement these three steps.
Step 1:
import requests
credentials = {
"username": "YOUR_EMAIL",
"password": "YOUR_PW"
}
api_url = "https://api.mqs.dk"
token = requests.post(f"{api_url}/token", data=credentials).json()["access_token"]
headers = {"Accept": "application/json", "Authorization": f"Bearer {token}"}
query = "isomeric_smiles:CC"
results = requests.get(f"{api_url}/search", params={"q": query}, headers=headers).json()
print("results:", results)
#Use the ID from the last search result to get the molecule's complete datasheet
id = results["response"]["docs"][0]["id"]
print("id:", id)
Step 2:
data = requests.get(f"{api_url}/compound/{id}/datasets/qmugs/wavefunctions_url", headers=headers).json()
download_url = data["url"]
print("download_url:", download_url)
Step 3:
#The urlretrieve function has to be imported: from urllib.request import urlretrieve
filename = str(id) + '_wavefunction.tar.gz'
urlretrieve(download_url, filename)
#The tarfile class has to be imported: import tarfile
file = tarfile.open(filename)
folder_path = "./"
file.extractall(folder_path)
file.close()
# os needs to be imported: import os
for filename in os.listdir(folder_path):
f = os.path.join(directory, filename)
wavefunctions_data = numpy.load(f)
print(wavefunctions_data)
The properties which can be retrieved from the wave function files are listed in Table 3 of the QMugs paper:
https://www.nature.com/articles/s41597-022-01390-7/tables/4
HOMO-LUMO gap, torsional angles, toxicity of PCBs
Over the last decades researchers have developed quantum chemistry models ranging from ab-initio methods such as density functional theory (DFT) to semi-empirical methods such as PMx and GNFx-xTB. These methods can be used to calculate quantum properties of molecules. By using these methods to perform high-throughput screening over hundred thousands to millions of molecules, the results of these calculations have been collected into various datasets in recent years, for example the PubChemQC PM6 1 which mined data from PubChem or QMugs based on the ChEMBL database 2. This blog article presents how these quantum chemistry datasets can be leveraged in combination with machine learning (ML) methods to retrieve valuable insights.
Aryl hydrocarbon receptor (AhR, dioxin receptor)
The Aryl Hydrocarbon Receptor (AhR), commonly referred to as the dioxin receptor, is a vital protein encoded by the AHR gene in humans. It functions as a transcription factor, playing a crucial role in regulating gene expression. AhR belongs to the PAS (Per-Arnt-Sim) superfamily of receptors, which are responsible for responding to various environmental stresses like hypoxia and circadian rhythm. Moreover, they also exert control over fundamental physiological processes such as vascular development, learning, and neurogenesis, highlighting the receptor's multifaceted functions 3.
AhR exhibits a remarkable capacity to integrate diverse signals from the environment, diet, microbiota, and metabolism to orchestrate complex transcriptional programs 4.
This integration occurs in a manner that is specific to the ligand, cell type, and overall context. Consequently, AhR's regulatory activities are highly versatile and tailored to specific biological contexts. However, studying the effects of AhR agonists and other chemicals has traditionally relied on assays involving biomaterials, a process that can be time-consuming with in-vitro experiments.
To better understand the intricate workings of AhR and assess the toxicity of AhR agonists and other chemicals, researchers are actively exploring alternative methods that reduce the need of animal experiments and decrease research duration on biomaterial-based assays 5 6 7 8.
These advances have the potential to accelerate toxicological research, facilitating the development of safer and more environmentally friendly chemical compounds.
Polychlorinated biphenyls (PCBs)
Figure 6: 3D model of a polychlorinated biphenyl.
Polychlorinated biphenyls (PCBs; Figure 1) encompass a collection of synthetic organic chemicals that were once extensively utilized in both industrial and consumer products. However, their ban in the 1970s stemmed from their harmful effects on health and their ability to persist in the environment. The AhR, a protein found in various cell types throughout the body, plays a pivotal role in the metabolism of numerous drugs and environmental toxins, including PCBs. Activation of the AhR by PCBs initiates alterations in gene expression, which can elicit a broad range of physiological effects.
The toxicity of PCBs is contingent upon the arrangement of chlorine atoms substituted on the phenyl rings 5. Within this group, there exists a subset of 12 PCBs called non-ortho- or mono-ortho-chlorinated PCBs, which exhibit AhR-related activities akin to those observed in dioxins.
Toxicity in similar molecules can be inferred by examining the energy gap between the highest occupied molecular orbital (HOMO) and the lowest unoccupied orbital (LUMO) 9.
A larger energy gap implies greater stability of the molecule concerning reactions with biomolecules.
The relative stability of highly chlorinated PCBs, coupled with their increased lipophilicity, leads to their widespread distribution across diverse environments, followed by prolonged accumulation within wildlife and human populations. Despite previous extensive studies, accurately determining the toxicity of PCBs remains a challenge. Consequently, PCBs serve as ideal candidates for testing models aimed at assessing the toxic and bioaccumulative behaviours of AhR agonists.
HOMO-LUMO gap as a toxicity indicator
Molecular orbitals, such as HOMO and LUMO, represent wave functions that describe the distribution of electrons within a molecule. Composed of atomic components derived from constituent atoms, the HOMO and LUMO exhibit distinct energy levels. The energy difference between these two orbitals is referred to as the HOMO-LUMO gap. The HOMO-LUMO gap is expressed in electronvolts (eV), a unit of energy commonly used to quantify small energy amounts. The magnitude of the energy gap can vary among different molecules. A small gap, below a certain threshold, is considered a necessary condition for molecules to possess significant potency.
The assessment of the HOMO-LUMO energy gap allows to determine the toxicity of PCBs. Below the threshold of 4.87 eV are the 12 dioxin-like PCBs, while above the threshold are non-dioxins. In the context of PCBs, the potency of AhR-mediated effects serves as a distinguishing factor between dioxin-like and non-dioxin-like molecules. However, it is important to note that the assessment of toxicity within the dioxin-like subgroup requires consideration of additional factors. While the HOMO-LUMO energy gap provides an overarching indicator of PCBs' toxicity, it does not provide the sole determinant 5.
Torsional angle
The torsional angle (Figure 2) refers to the spatial configuration of atoms in a molecule, specifically the angle formed by the rotation of one group of atoms around a bond axis with respect to another group. The torsional angle, which is crucial in determining the overall shape and stability of a molecule, varies among different molecules depending on their structural arrangement, and play a significant role in understanding and predicting the behaviour of a molecule, as the magnitude of these angles directly affects molecular properties such as conformational stability, reactivity, and intermolecular interactions. Torsional angles arise due to the presence of rotatable bonds and can be quantified using the degree of rotation.
The torsional angle in a PCB molecule can impact their interaction with the aryl hydrocarbon receptor (AhR). The conformational change caused by the torsional angle can modulate the binding affinity and activation of AhR by PCBs, ultimately influencing their toxicological effects 5.
Figure 7: Torsional angle of a PCB molecule
PubchemQC PM6¶
The PubChemQC PM6 dataset is a collection of electronic properties of 221 million molecules with optimized molecular geometries 1. The electronic properties were calculated using the PM6 method. The dataset covers 94.0% of the original PubChem dataset with respect to neutral molecules, making it a valuable resource for researchers in the field. How to access the dataset via the MQS Search UI and API can be read up on in the first Search API tutorial: MQS Search API Part 1 Tutorial.
QMugs dataset¶
The QMugs (Quantum-Mechanical Properties of Drug-like Molecules) dataset contains optimized molecular geometries calculated via density functional theory (DFT), thermodynamic data calculated with the semi-empirical method GFN2-xTB, atomic and molecular properties calculated with both DFT (ω B97X-D/def2-SVP) and GFN2-xTB.
The collection of QMugs consists of over 665,000 molecules that are biologically and pharmacologically significant. These molecules have been obtained from the ChEMBL database 8.
The dataset presented has multiple potential applications as highlighted in the QMugs research paper 8:
- It can offer researchers a larger dataset that can be utilized to predict quantum chemical properties or establish correlations between the GFN2-xTB and ωB97X-D/def2-SVP levels of theory.
- It can aid in the advancement of machine learning techniques for generating molecular conformations and predicting molecular properties.
- It can support the development of deep learning frameworks that can predict quantum mechanical wave functions in atomic orbitals.
- It can facilitate research on quantum featurization in the context of pharmacologically relevant biological data.
The QMugs dataset features DFT-level properties and is found to have larger molecular samples and a larger number of distinct molecules than the QM9 (http://quantum-machine.org/datasets/) and ANI-1 (https://github.com/isayev/ANI1_dataset) datasets. It provides multiple conformers per molecule, enabling the training of QML models that can differentiate between molecular constitution and conformation. QMugs also provides a wide range of properties on multiple levels of theory and includes mostly compounds not previously reported in other DFT data collections.
Important to acknowledge is that the QMugs dataset holds a very limited number of PCBs compared to the PubchemQC PM6 dataset and has therefore implications for the machine learning model which is going to be presented in the next section. The Search API tutorial part 2 also holds some more information about the QMugs dataset: https://blog.mqs.dk/posts/getting_started_search_api_part2/
DelFTa toolbox¶
The DelFTa toolbox 9 is an open-source model that has been trained with the QMugs dataset to predict quantum-mechanical properties of drug-like molecules. The toolbox uses PyTorch modules and libraries, the Open Babel library, Numpy and JSON to define and run various models and tasks. The application provides a command-line interface in order to predict molecular properties from a SMILES string or a molecule file. The output results are provided either as JSON or CSV files.
DelFTa can predict a wide array of quantum observables on the molecular, atomic, and bond levels. Some of the properties that DelFTa can predict are:
- Formation energies: The energy required to form a molecule from its constituent atoms.
- Energies of HOMO and LUMO: Energies of the highest occupied and lowest unoccupied molecular orbital. These are important for understanding the electronic structure of molecules.
- HOMO/LUMO gaps: The energy difference between the HOMO and LUMO orbitals.
- Dipole moments: A measure of the separation of positive and negative charges in a molecule.
- Mulliken partial charges: A way to partition the electron density in a molecule among its constituent atoms.
- Wiberg bond orders: A measure of the strength of covalent and non-covalent chemical bonds.
The DelFTa toolbox applies an equivariant graph neural network with a message-passing layer.
Graph Neural Networks (GNNs) are neural networks specifically designed to operate on graph-structured data. Equivariant Graph Neural Networks (EGNNs) are an extension of GNNs that aim to capture and preserve symmetries present in the input graph data applying equivariant operations. In the context of graphs, symmetry refers to the invariance of the graph structure and its features under transformations such as rotations, translations, or permutations of the nodes. This can lead to several advantages, including better generalization to unseen data, improved robustness to noise and perturbations, and more efficient learning by leveraging the inherent structure and patterns of the data.
A message-passing layer facilitates information exchange and aggregation between connected nodes, enabling nodes to update their representations based on messages received from neighboring nodes. This enabling effective modeling of relational structures and improving the network's ability to learn from graph-structured data.
The current version of the DelFTa toolbox only supports Python 3.8.X and 3.9.X versions of Linux builds.
The following points summarize the recommended installation steps for a conda environment:
1) The DelFTa toolbox can be cloned from its offical GitHub repository.
2) Create the Python environment with required installations using the environment.yml
file.
This can be achieved by using conda env create -f environment.yml
in the downloaded repository directory.
After submitting the command, a conda environment named delfta
will be created. The environment can be activated by using the conda activate delfta
command.
3) The DelFTa toolbox can also be installed on a prepared environment by running the python setup.py install --user
command in the DelFTa repo directory.
4) To get required files that include trained DelFTa model* and XTB utils, run: `python -c "from delfta.download import _download_required; _download_required()".
*To download additional models which were trained on different training set sizes, follow this download link provided by ETH Zurich.
In the following we are now going to compare the HOMO-LUMO gap data obtained from PubchemQC PM6, QMugs and via the DelFTa toolbox.
Example: HOMO-LUMO gap, torsional angles, toxicity of poly-chlorinated biphenyls¶
Get HOMO-LUMO gap data by using the MQS Search API for PCBs and fine-tune DelFTa model
We are going to use the MQS Search API to get HOMO-LUMO gap data of PCBs stored in the PubchemQC PM6 dataset.
We will start with getting our token to be able to reach the database via the Search API.
On the molecule search page we provide a documentation of the Search API.
Now you can prepare a search query to search for the PCBs:
query = "PCB AND C12* NOT formula:*S NOT formula:*N* NOT formula:*O"
data = requests.get(f"{api_url}/search", params={"q": query}, headers=headers).json()
We are storing the SMILES identifiers of the PCBs and make use of the SMILES in the next section of this tutorial:
for pcb in data["response"]["docs"]:
# Molecule ID:
pcb_id = pcb["id"]
# SMILES code of molecule:
SMILES = pcb["isomeric_smiles"]
pcb_props = requests.get(f"{api_url}/compound/{pcb["id"]}", headers=headers).json()
To select the HOMO-LUMO gap, we choose S0
as the molecule state and the alpha
register as the energy property. The 3d coordinates information is also being stored in a new array for later use:
for pcb_state in pcb_props["pubchemqc_pm6"]:
if pcb_state["state"] == "S0":
# Get HOMO−LUMO gap:
HOMO_LUMO_Gap = pcb_state["properties"]["energy"]["alpha"]["gap"]
# Get 3D geometry with shape of (<Number_of_Atoms>, 3):
coords = np.array(pcb_state["atoms"]["coords"]["3d"]).reshape(-1, 3)
We have now accessed specific data from QMugs and this is an example what the data for one PCB holds:
- PCB Name:
2,3,3’,5-Tetrachlorobiphenyl
- SMILES Code:
C1=CC(=CC(=C1)Cl)C2=CC(=CC(=C2Cl)Cl)Cl
- HOMO−LUMO Gap Value:
6.574814855781
- 3D Atom Coordinates:
Array(shape:(Number_of_Atoms, 3))
Predicting the HOMO-LUMO gap by using DelFTa
Now we are going to predict the HOMO-LUMO gap value of a PCB by applying the DelFTa toolbox with the SMILES strings and the 3D atom coordinates data that we have collected.
Create an Open Babel Molecule Object from the SMILES string of the PCB and add hydrogen atoms:
from openbabel.pybel import readstring
mol = readstring("smi", SMILES)
mol.addh() # Adds Hydrogen atoms.
Pass 3D coordinates of atoms to the molecule object:
Predict the HOMO-LUMO gap via DelFTa:
from delfta.calculator import DelftaCalculator
# Prepare DelFTa Calculator:
calc = DelftaCalculator(addh=False)
# Make predictions:
preds = calc.predict(mol)
The DelFTa prediction function will return a Python dictionary of the predicted values:
{'wbo':
[{'0-1': 1.4635928285108784, '0-3': 0.11247402231651216, '0-5': 1.4651485560975337, '0-16':
0.9307198423872055, '1-2': 1.416853712718369, '1-17': 0.931873304102257, '2-3':
1.410260321460608, '2-5': 0.10810251813581974, '2-7': 1.0212926424278403, '3-4':
1.447762294545536, '3-18': 0.9264065040542494, '4-5': 1.4439071195028848, '4-6':
1.2419898643430511, '5-19': 0.9298978451088848, '7-8': 1.4107464260690588, '7-12':
1.3855066583460698, '8-9': 1.4394256891469703, '8-20': 0.924661022763001, '9-10':
1.4439716051888596, '9-15': 1.251288105013964, '10-11': 1.4389150192484086, '10-21':
0.9227660093041355, '11-12': 1.4176788092449024, '11-14': 1.2569783603366196, '12-13':
1.2456382012974325}],
'E_form':
array([-4.0662913], dtype=float32),
'charges':
[array([-0.01280808, -0.00144799, -0.01410628, 0.0281181 , 0.03468447, 0.0072453 ,
-0.08946717, -0.00093684, 0.00319889, 0.04281836, 0.0217519 , 0.0199199 , 0.03606375,
-0.03929779, -0.05472087, -0.07673271, 0.02948804, 0.01668386, 0.01211333, 0.01979109,
0.0190095 , 0.02159348])],
'E_homo':
array([-0.32399112], dtype=float32),
'E_lumo':
array([0.00836667], dtype=float32),
'E_gap':
array([0.33245534], dtype=float32),
'dipole':
array([1.9053881], dtype=float32)}
Convert the predicted HOMO-LUMO gap value from Hartree to Electronvolt (eV) unit (DelFTa returns HOMO-LUMO gap value as Hartree in a dictionary so we need to multiply by Hartree-eV constants):
from scipy.constants import physical_constants
hartree_to_ev = physical_constants['Hartree energy in eV'][0]
HOMO_LUMO_Gap = preds["E_gap"].item() * hartree_to_ev
Calculation of the torsional angle
To calculate the torsional angle of a PCB, we need to apply the following steps:
Create a molecule object from the SMILES string with RDKit:
Create a conformer object and pass atom coordinates to the object:
# RDKit Conformer Object:
conf = Chem.Conformer(mol.GetNumAtoms())
# Pass coordinate information of atoms to molecule conformation:
for indx, coord in enumerate(PCBs[pcb_name]["3DCoords"]):
conf.SetAtomPosition(indx, coord)
To find the torsional angle between two benzene rings of a PCB, we need to first locate the benzene rings in the molecule and determine the bond that connects the two rings. Then we need to identify the bond atoms consisting of two atoms and a neighbor atom on a ring for each bond atom. Finally, we have now defined the bond atoms between the rings and the benzene ring atoms as the two planes to calculate the torsional angle.
# Create a two benzene template:
query = Chem.MolFromSmarts('c1ccccc1-c1ccccc1')
# Find atom indices of planes using template:
atom1, atom2, atom3, atom4 = mol.GetSubstructMatches(query)[0][4:8]
As an example atom1
, atom2
, atom3
, atom4
variables will be equal to atom indices 2, 1, 9, 10 on Figure: Benzene bridge of hexachlorobiphenyl.
from rdkit.Chem import rdMolTransforms
# Calculate Torsional angle between
# atom1, atom2 vector plane and atom3, atom4 vector plane:
angle = rdMolTransforms.GetDihedralDeg(conf, atom1, atom2, atom3, atom4)
As an example, angle
variable will be equal to 59.2896 degrees for the PCB molecule of Hexachlorobiphenyl.
Figure 9: Benzene bridge of Hexachlorobiphenyl
Counting PCB Ortho Configurations
Locating the number of ortho configurations of a PCB we need to match ortho substructures:
# Create RDKit Molecule Object using SMILES code of the PCB:
mol = Chem.MolFromSmiles(SMILES)
# Prepare ortho template:
ortho_template = "[*]c1ccc([*])c([*])c1"
query = Chem.MolFromSmarts(ortho_template)
# Find and count ortho groups:
ortho_groups = mol.GetSubstructMatches(query)
num_orthos = len(ortho_groups)
As an example, the num_orthos
variable (number of orthos) will be equal to 8 for the PCB molecule of Hexachlorobiphenyl.
Next we will call molecules without any ortho group non-ortho structures , molecules that only have one ortho group will be defined as mono-ortho PCBs, and molecules that have two or more ortho group will belong to the class of multi-ortho PCBs.
Comparison and Results
The comparison of HOMO-LUMO gap values from the PubchemQC PM6 dataset and DelFTa predictions, we prepared a figure with 100 different PCBs (Figure 5) and their respective HOMO-LUMO gap values.
Figure 10: Comparison of HOMO-LUMO gap calculations.
On this figure the x-axis represents different PCB molecules, y-axis represents HOMO-LUMO gap values as electronvolt (eV). Red dots represents DelFTa prediction and green dots represents PubchemQC PM6 dataset values of HOMO-LUMO gap of PCB projected on x-axis. Dark black lines between red and green dots represent the difference of HOMO-LUMO gap values of PubchemQC PM6 dataset and DelFTa predictions.
We did also plot HOMO-LUMO gap values of the PubchemQC PM6 dataset and DelFTa predictions for a HOMO-LUMO gap versus torsional angle representation (Figure 6 and 7) which allow to visualize the toxicity threshold together with torsional angle and the HOMO-LUMO gap values. In this way one can identify if a molecule lies below the toxicity threshold or above.
Figure 11: HOMO-LUMO gap - Torsional angle representation of PubchemQC PM6 PCBs with their ortho numbers.
Figure 12: HOMO-LUMO Gap - Torsional angle representation of DelFTa predicted PCBs with their ortho numbers.
On these figures the radius of a quarter circle represents the HOMO-LUMO gap values and the angle of a point to the (0,0) coordinate represents the torsional angle. Blue circles represent multi-ortho, yellow triangles represents mono-ortho, and red squares represent non-ortho PCBs. The small quarter circle represents the Dioxin threshold which is equal to 4.87 eV.
Figure 13: HOMO-LUMO gap outlier molecules with PCB indices.
Figure 8 depicts all outlier molecules and together with Table 1 we can differentiate between the outliers and see that most of the outliers are multi-ortho PCBs (9 out of 14), followed by non-ortho (3 out of 14) and mono-ortho (2 out of 14) PCBs. The outliers have been ordered by declining HOMO-LUMO gap difference values.
The highest HOMO-LUMO gap difference (3.1681) is observed for the multi-ortho PCB (PCB Index = 62) with a torsional angle of -90.12 degrees. This is the only PCB in the outlier set with a torsional angle around ±90 degrees. Most outlier PCBs exhibit torsional angles of around ±120 degrees (8 out of 14), followed by torsional angles of ±60 degrees (5 out of 14).
This analysis is valuable for re-training or fine-tuning purposes of the DelFTa model where more PCB structures with multi-ortho configuration and with a torsional angle of ±120 degrees could be added to the training set.
Table 1: HOMO-LUMO gap difference analysis between Delfta and PubchemQC PM6
PCB Index | HOMO-LUMO gap difference | Torsional angle | Ortho-configurations |
---|---|---|---|
62 | 3.1681 | -90.12 | multi-ortho |
79 | 2.9972 | -120.35 | multi-ortho |
30 | 2.9387 | 58.78 | non-ortho |
31 | 2.9309 | -121.09 | non-ortho |
44 | 2.8868 | 59.52 | mono-ortho |
94 | 2.4335 | 59.80 | multi-ortho |
51 | 2.4069 | -120.34 | multi-ortho |
15 | 2.2279 | 120.77 | multi-ortho |
80 | 2.1173 | -120.86 | non-ortho |
45 | 1.6816 | 59.04 | mono-ortho |
95 | 1.6462 | 59.65 | multi-ortho |
8 | 1.3525 | -120.77 | multi-ortho |
56 | 1.2610 | -120.20 | multi-ortho |
49 | 1.0624 | -120.25 | multi-ortho |
-
Maho Nakata, Tomomi Shimazaki, Masatomo Hashimoto, and Toshiyuki Maeda. Pubchemqc pm6: data sets of 221 million molecules with optimized molecular geometries and electronic properties. Journal of Chemical Information and Modeling, 60(12):5891–5899, 2020. doi:10.1021/acs.jcim.0c00740. ↩↩
-
Clemens Isert, Kenneth Atz, José Jiménez-Luna, and Gisbert Schneider. Qmugs, quantum mechanical properties of drug-like molecules. Scientific Data, 9(1):273, 2022. doi:10.1038/s41597-022-01390-7. ↩
-
J. Marlowe and A. Puga. 2.07 - novel ahr interactions. In Charlene A. McQueen, editor, Comprehensive Toxicology (Second Edition), pages 93–115. Elsevier, Oxford, second edition edition, 2010. doi:https://doi.org/10.1016/B978-0-08-046884-6.00207-4. ↩
-
Veit Rothhammer and Francisco J Quintana. The aryl hydrocarbon receptor: an environmental sensor integrating immune responses in health and disease. Nature Reviews. Immunology, 19(3):184–197, 2019. doi:10.1038/s41577-019-0125-8. ↩
-
Forrest J, Bazylewski P, Bauer R, Hong S, Kim CY, Giesy JP, Khim JS, and Chang GS. A comprehensive model for chemical bioavailability and toxicity of organic chemicals based on first principles. Front. Mar. Sci. 1:31, 2014. doi:10.3389/fmars.2014.00031. ↩↩↩↩
-
S Safe. Toxicology, structure-function relationship, and human and environmental health impacts of polychlorinated biphenyls: progress and problems. Environmental Health Perspectives, 100():259–268, 1993. doi:10.1289/ehp.93100259. ↩
-
Weihua Yang, Xiaohua Liu, Hongling Liu, Yang Wu, John P. Giesy, and Hongxia Yu. Molecular docking and comparative molecular similarity indices analysis of estrogenicity of polybrominated diphenyl ethers and their analogues. Environmental Toxicology and Chemistry, 29(3):660–668, 2010. doi:https://doi.org/10.1002/etc.70. ↩
-
Weihua Yang, Yunsong Mu, John P. Giesy, Aiqian Zhang, and Hongxia Yu. Anti-androgen activity of polybrominated diphenyl ethers determined by comparative molecular similarity indices and molecular docking. Chemosphere, 75(9):1159–1164, 2009. doi:https://doi.org/10.1016/j.chemosphere.2009.02.047. ↩↩↩
-
Mary M. Lynam, Michal Kuty, Jiri Damborsky, Jaroslav Koca, and Peter Adriaens. Molecular orbital calculations to describe microbial reductive dechlorination of polychlorinated dioxins. Environmental Toxicology and Chemistry, 17(6):988–997, 1998. doi:https://doi.org/10.1002/etc.5620170603. ↩↩