Skip to content

Descriptors

Molecular descriptors are n-dimensional data which can be utilised for training regression or classification models.

The Cebule SDK provides many possibilities to generate highly informative descriptors based on first principle calculations which can be applied for training machine learning models, also termed physics-informed machine learning.

The difference between Cebule and other descriptor generation software such as Dragon or mordred is that the set of high quality descriptors Cebule provides is based on quantum chemistry / molecular dynamics calculations via high performance compute infrastructure.

This allows to train modern machine learning models such as graph neural networks to obtain state of the art predictive performance.

The following sections give an overview of the descriptors which can be generated via Cebule.

Atom order

Contains the ordered list of atoms in the molecule based on the SMILES string, in sorted order by atomic number. This defines the canonical ordering of atoms used in other descriptors generated via Cebule and the tasks: geometry, group contribution, and forces. The atom_order task/descriptor is important for maintaining consistency across different descriptors.

This descriptor dictionary can be generated via the Cebule SDK TaskType: ATOM_ORDER task.

Geometry

Contains 3D coordinates in Angstrom of each molecule after geometry optimization by applying the following geometry optimization sequence: - Initial force field optimization - Semi-empirical/MLIP optimization

One can identify which atom in the molecule corresponds to each XYZ coordinate in the geometry list by referencing the atom order (see atom order descriptor).

This descriptor dictionary can be generated via the Cebule SDK TaskType: GEOMETRY_OPT task.

Group contribution

Contains UNIFAC group contribution descriptors. These parameters describe the molecule in terms of functional groups:

  • Counts of each group in the molecules, e.g. `{"CH3": 1, "CH2": 1}.
  • Atom indices corresponding to each instance of a functional group, e.g. {"CH3": [[3]], "CH2": [[4]]}. One knows which atom in the molecule is at each index by referencing the atom order.
  • Parameters of each group in the molecules: the functional group volume (\(R\)) and surface area (\(Q\)).
  • Group interaction parameters \(\Psi{nm}\) containing the effect between one functional group and another (interaction is asymmetric).

This descriptor dictionary can be generated via the Cebule SDK TaskType: GROUP_CONTRIBUTION task.

Sigma profiles

Sigma profiles can be generated using the COSMO-RS and COSMO-SAC solvation models:

  • Sigma profiles: Probability distributions of surface screening charge density on the molecular surface
  • COSMO-RS: Conductor-like screening model for real solvents
  • COSMO-SAC: Conductor-like screening model for segment activity coefficient

The sigma profiles capture the electrostatic interaction tendencies of molecules via molecular cavity surface segments and are vital for predicting solvation effects and thermodynamic properties for liquid mixtures. The descriptor dictionaries contain the charge bins and charge densities of each bin.

COSMO-RS

The COSMO-RS descriptor dictionary holds the charge bins and charge densities of each bin, as well as the sigma moments. Sigma-moments are chemical descriptors derived from the sigma profiles as part of the COSMO-RS method. They are analogous to moments of a statistical distribution and serve to reduce the high-dimensional information in a sigma profile to a smaller number of descriptors that characterize molecular surface properties :

  • sigma-moments: Each moment represents different physical properties of the molecule:
    • M0: Molecular surface area of the compound
    • M1: Negative of total charge of the compound
    • M2: Polarity of the compound
    • M3: Asymmetry of the sigma profile
    • M4-6: Higher-order moments (without well-established physical interpretations)
    • Mhbacc: Quantifies the molecule's hydrogen bond acceptor strength
    • Mhbdon: Quantifies the molecule's hydrogen bond donor strength

These moments are valuable descriptors for training quantitative structure-property relationships (QSPR) / quantitative structure-activity relationships (QSAR) models. The moments efficiently represent the solvent space and have shown to be better descriptors than the complete sigma profile histogram for generating regression models.

This descriptor dictionary can be generated via the Cebule SDK TaskType: SIGMA task with cosmo_method = cosmo-rs. See the linked task for input and output descriptions and examples.

COSMO-SAC

The COSMO-SAC descriptor dictionary is separated into the following key value pairs for the three distinct sigma profile sections:

  • NHB (Non-Hydrogen Bonding): Represents surface segments that do not participate in hydrogen bonding. These areas generally correspond to hydrophobic or weakly polar regions of the molecule.

  • OH (Hydroxyl): Represents surface segments associated with hydroxyl groups that can form hydrogen bonds. These areas have distinctive charge density distributions that reflect their strong hydrogen bond donor/acceptor capabilities.

  • OT (Other): Represents surface segments that can participate in hydrogen bonding but are not hydroxyl groups. This includes N, F, and O (outside of OH) atoms.

These sigma profiles descriptors provide a detailed characterization of the molecule's surface polarity and hydrogen bonding capacity, which are critical factors in predicting how molecules interact in solution and ultimately affect thermodynamic behaviour of the pure compound or mixture of compounds.

This descriptor dictionary can be generated via the Cebule SDK TaskType: SIGMA task with cosmo_method = cosmo-sac. See the linked task for input and output descriptions and examples.

Molecular dynamics forces

The force field MD descriptor contains average forces (kJ/mol/nm) on each atom from classical MD simulations:

  • Calculated using force field molecular dynamics (FFMD) with the OpenFF SAGE force field and NPT molecular dynamics.
  • Represents the average magnitude of force experienced by each atom on a primary molecule in a user-defined solution environment after equilibriation.
  • Forces are averaged across multiple copies of the primary molecule in simulation.

One knows which atom in the molecule corresponds to each force in the force list by referencing the atom order (see atom order descriptor).

These forces provide insights into molecular stability and interactions that influence viscosity properties in solution.

This descriptor dictionary can be generated via the Cebule SDK TaskType: FORCE_FIELD_MD task.