Photo reactive molecule prediction
This week I looked into the chemical origins of the UV/Vis spectrum. Today I am reading about practical applications of UV/Vis spectroscopy and machine learning.
It turns out that UV/Vis spectroscopy is very relevant in material design, including nano materials, pesticides, pharmaceuticals, organic electronics, and more. As I discovered this week, the UV–Vis absorption spectrum is an important charactersitic organic compounds, and it is closely related to it’s optoelectronic properties and photochemical reactivity.
Designing materials and molecules
When desining new materials or molecules, the UV/Vis spectrum has generally high interest. Measuring the UV/Vis spectrum of candidate materials and molecules is common practice. It is also common to use computational methods to predict the UV/Vis spectrum of candidate materials and molecules, which can save time and resources in the design process.
The paper I discuss today is titled “Machine learning prediction of UV–Vis spectra features of organic compounds related to photoreactive potential” (Mamede, Pereira, and Aires-de-Sousa 2021)
The paper uses machine learning to predict part of the UV/Vis spectrum of organic compounds, specifically the molar extinction coefficient (MEC) in the 290-700 nm range, which is relevant for photoreactivity assessment in pharmaceuticals.
More about photoreactivity is right below.
Photoreactive Molecules and Phototoxicity in Pharmaceutical Development
In the human body, external compounds can interact with light, leading to photochemical reactions that can cause adverse effects. This phenomenon, known as phototoxicity, is particularly relevant in the context of pharmaceuticals. Many drugs can become photoreactive upon exposure to UV or visible light.
When these compounds absorb light, they can undergo electronic transitions that lead to the formation of reactive species, which can then interact with biological macromolecules such as proteins, lipids, and DNA. This interaction can result in cellular damage, inflammation, and even carcinogenesis.
Clinical Manifestations
Phototoxicity can manifest in various ways, depending on the type of exposure and the specific drug involved. The clinical manifestations can be broadly categorized into acute, chronic, and ocular reactions:
- Acute: Sunburn-like erythema, edema, and pain in sun-exposed areas within hours of exposure
- Chronic: Hyperpigmentation, photoaging, and potential skin cancer risk with repeated exposure
- Ocular: Potential retinal damage if systemically administered drugs accumulate in eye tissues
Cancer development is a significant concern, as phototoxic compounds can induce DNA damage through the generation of reactive oxygen species (ROS) and other photoproducts. This can lead to mutations and ultimately carcinogenesis, particularly in tissues exposed to light. In clinical settings carciogensis cannot be ruled out, as the long-term effects of phototoxic drugs are not always fully understood.
Photoreactive Molecular Features
The key molecular features that confer photoreactivity include:
Chromophores: Extended conjugated systems, aromatic rings, and α,β-unsaturated carbonyls that can absorb light in the relevant wavelength range. These structural elements create delocalized π-electron systems that lower the energy gap between molecular orbitals, enabling absorption of lower-energy (longer wavelength) photons.
Molar Extinction Coefficient (MEC): The intensity of light absorption, with compounds having MEC ≥ 1000 L·mol⁻¹·cm⁻¹ in the 290-700 nm range considered potentially photoreactive according to ICH S10 guidance.
The ICH S10 guidance on photosafety evaluation mandates assessment of photoreactive potential for all pharmaceuticals intended for human use. This guidance establishes:
- Absorption threshold criteria (290-700 nm, MEC ≥ 1000 L·mol⁻¹·cm⁻¹)
- Testing strategies for photosafety assessment
- Risk mitigation approaches
Mechanisms of Phototoxicity
Phototoxicity occurs through several interconnected mechanisms:
Type I Reactions: The excited molecule directly interacts with biological targets (proteins, lipids, DNA) through electron or hydrogen atom transfer, creating covalent bonds or causing oxidative damage.
Type II Reactions: The excited molecule transfers energy to molecular oxygen, generating reactive oxygen species (ROS) such as singlet oxygen, superoxide radicals, and hydroxyl radicals. These highly reactive species cause widespread cellular damage through lipid peroxidation, protein oxidation, and DNA strand breaks.
Hapten Formation: Some photoreactive compounds form covalent adducts with proteins, creating new antigenic determinants that can trigger allergic photodermatitis upon subsequent exposure.
Drug-Induced Phototoxicity
Phototoxicity represents a significant safety concern in pharmaceutical development, as numerous drug classes have been associated with photosensitivity reactions:
High-Risk Drug Classes:
- Fluoroquinolone antibiotics (ciprofloxacin, levofloxacin): Contain quinolone chromophores
- Tetracyclines: Extended conjugated systems
- NSAIDs (naproxen, ketoprofen): Aromatic structures with UV absorption
- Diuretics (furosemide, hydrochlorothiazide): Sulfonamide and benzothiadiazine chromophores
- Phenothiazine antipsychotics: Tricyclic aromatic systems
- Psoralens: Used in PUVA therapy but highly phototoxic
Machine Learning for Photoreactivity Prediction
Let’s got through the paper (Mamede, Pereira, and Aires-de-Sousa 2021) and see how machine learning can be used to predict photoreactivity in organic compounds.
Here is one of the most interesting quotes of hte paper:
Training ML models to predict full UV–Vis spectra requires large databases of spectra obtained under consistent conditions to predict multiple continuous variables. (e.g., the molar extinction coefficients at several wavelengths).
This is a pain point: UV/Vis spectra can vary significantly depending on the conditions under which they are measured, such as solvent, pH, temperature, and concentration. Even the device used to measure the spectra can introduce variations. Machine learning is sensitive to the quality and consistency of the training data, and this is a crucial aspect when it comes to predicting UV/Vis spectra: useually there are no large databases of spectra obtained under consistent conditions available.
Paradoxicly, machine learning can deal with variations in data. Think about applications like automatic speach recgnition. It reconginizes different voices and accents almost without error.
Automatic speach recognition relies on deep learning models trained on hugage amounts of data that cover a lot of variation. For science applications, such data sets are generally not available.
The authors continue:
Differently, here we report the exploration of ML tools to classify organic molecules in terms of their UV–Vis absorption spectrum based on molecular descriptors.
That authors overcome the small dataset problem by not predicting the full UV/Vis spectrum, or so they suggest.
Instead of letting their models predict part of the spectrum, the authers make their models answer a yes/no question: “is this molecule photoreactive (yes/no)?”.
For simplicity the matter of photoreactivity was reduced to the ICH S10 guidance threshold of MEC ≥ 1000 L·mol⁻¹·cm⁻¹ in the 290-700 nm range, as real world photoreactive behavoir is not readily obtainable for arbitrary molecules.
Data and Labeling
The authors retrieved data on 80 000 molecules from Reaxys; a comprehensive database of chemical compounds and reactions. Each molecule was labeled as photoreactive or not, based on its molar extinction coefficient (MEC) in the 290-700 nm range, with a threshold of MEC ≥ 1000 L·mol⁻¹·cm⁻¹ indicating photoreactivity.
Duplicates and stereoisomers were removed together with charged and radical compounds, leaving 72,788 molecules.
As usual in machine learning, traing and testing sets were created::
The data set was randomly divided into a training set of 72,788 molecules (POS class: 36,036 molecules and NEG class: 36,752 molecules), a test set I of 998 molecules (POS class: 501 molecules and NEG class: 497 molecules), and a test set II of 998 molecules (POS class: 512 molecules and NEG class: 486 molecules).
Here’s the table converted to markdown format:
POS classᵃ | Training set | Test set I | Test set II |
---|---|---|---|
1000 ≤ MEC ≤ 5000 | 21.3 | 22.4 | 19.5 |
5000 ≤ MEC < 10,000 | 24.0 | 23.3 | 25 |
MEC ≥ 10,000 | 54.7 | 54.3 | 55.5 |
NEG classᵇ | |||
λ < 290 nm, MEC < 1000 | 10.4 | 10.7 | 10.5 |
λ < 290 nm, MEC ≥ 1000 | 91.1 | 88.9 | 91.6 |
λ > 700 nm, MEC < 1000 | 0.005 | 0 | 0 |
λ > 700 nm, MEC ≥ 1000 | 0.07 | 0.20 | 0 |
290 ≤ λ ≤ 700 nm, MEC ≤ 900 | 6.5 | 8.0 | 5.6 |
290 ≤ λ ≤ 700 nm, MEC > 900 | 0.23 | 0.4 | 0.21 |
Descriptors
A large set of molecular descriptors and finger prints was calculated for each molecule. The authors used a combination of constitutional, topological, and electronic descriptors, as well as various types of fingerprints, computed with the RDKit cheminformatics toolkit.
Fingerprint Types
MACCS Keys (166 bits) These are expert-curated structural keys representing predefined chemical patterns. Each of the 166 bits corresponds to a specific SMARTS pattern that captures common functional groups, ring systems, and structural motifs. MACCS keys are highly interpretable since each bit has a known chemical meaning, making them valuable for understanding why molecules are considered similar.
Substructure Fingerprints (307 bits) These use SMARTS patterns for Laggner functional group classification, implemented in two variants:
- Sub: Binary presence/absence of each pattern (307 bits)
- SubC: Count-based version tracking how many times each pattern appears
This approach focuses specifically on pharmacologically relevant functional groups, making it particularly useful for drug discovery applications.
PubChem Fingerprints (881 bits) A comprehensive structural fingerprint system with 881 predefined patterns covering element counts, ring systems, atom pairs, functional groups, and complex structural motifs. These provide much broader coverage than MACCS keys and are designed to capture subtle structural variations important for similarity searching and QSAR modeling.
Circular Fingerprints
CDK Circular Fingerprints (1024 bits) Implementation of Extended Connectivity Fingerprints (ECFP) algorithm using the Chemistry Development Kit. These capture local molecular environments by iteratively expanding around each atom, generating unique identifiers for circular substructures that are then folded into a 1024-bit vector.
CDK Extended (1024 bits) Enhanced version of CDK fingerprints that dedicates additional bits specifically to ring features - capturing ring sizes, counts, aromaticity patterns, and ring closure information. This provides better representation of complex ring systems while maintaining the same 1024-bit length.
Morgan Fingerprints (1024 bits) RDKit’s implementation of circular fingerprints, using the Morgan algorithm to generate atom identifiers based on atomic properties and iteratively updating them by incorporating neighbor information. These are widely used due to their robust implementation and excellent performance in similarity searching.
Molecular Descriptors
1D&2D Descriptors (1443 total) A comprehensive set including:
- Constitutional: Basic molecular properties (molecular weight, atom counts, bond counts)
- Topological: Graph-based descriptors capturing molecular connectivity and shape
- Electronic: Properties related to electron distribution and molecular orbitals
Modified Distance Descriptors (Md) A specialized approach with unique characteristics:
- Connectivity-based: Uses only molecular connectivity, avoiding bond orders and formal charges
- No 3D conformers: Eliminates need for 3D structure generation, aromaticity definitions, or mesomerism standardization
- Modified distances: Incorporates van der Waals radii and Sanderson electronegativity of neighboring atoms
- Parameters: 1010 intervals, 0.017 resolution, distances up to 4 bonds, distance factor of 4
- Function: Counts atom pairs at specific modified distances, providing a more chemically-informed distance metric than simple topological distance
Quantum Descriptors
ML Quantum Descriptors (MLQD) Machine learning-predicted quantum chemical properties:
- Properties: EHOMO (highest occupied molecular orbital energy), ELUMO (lowest unoccupied molecular orbital energy), and GAP (HOMO-LUMO energy gap)
- Implementation: 10 different ML models for each property, providing ensemble predictions
- Training: Models trained on DFT (Density Functional Theory) calculated data
- Advantage: Provides quantum chemical insights without expensive DFT calculations
Computational Tools
PaDEL-Descriptor (v2.21): Comprehensive descriptor calculation software that can compute over 1000 molecular descriptors and fingerprints, providing a standardized platform for molecular characterization.
RDKit: Open-source cheminformatics toolkit particularly strong in fingerprint calculation and molecular property prediction, widely used in both academic and industrial settings.
This combination of descriptor types provides complementary views of molecular structure: fingerprints excel at pattern recognition and similarity searching, constitutional/topological descriptors capture fundamental molecular properties, Modified Distance descriptors provide a unique connectivity-based perspective, and ML quantum descriptors add electronic structure information without computational expense.
Machine learning
Several machine learning algorithms were trained, including random forest, support vector machines and CART.
Random forest is used for feature selection. Instead of selecting individual features, the model is trained on groups of featues as shown in table ?@tbl-feature-groups.
Descriptor group | MCC |
---|---|
RDKitMorganFP | 0.76 |
ExtCDK | 0.75 |
CDK | 0.74 |
MACCS | 0.74 |
Md | 0.74 |
PubChem | 0.73 |
RDKitFP | 0.73 |
1D&2D | 0.7 |
SubC | 0.67 |
Sub | 0.61 |
ML_QD | 0.51 |
The top groups are close. The authors do not find the results to improve when using combinations of 2 feature groups.
Scores on models other than the random forest are missing. Somewhere the authors mentioned that random foret peforms best, I vaguely rember.
The RF model trained with all RDKitMorgan fingerprints predicted the test set I with accuracy of 0.88 and MCC 0.76.
The authors did inspected the missclassified molecules and found the hard cut-off of MEC ≥ 1000 L·mol⁻¹·cm⁻¹ an important source for misclassification. When the MEC value is close to the cutoff, the model has to be increasingly precise, or the prediction will fall at random on one side or the other.
All fales positives could be accounted for in one way or another: hard cutoffs, data errors in related training examples. They surfaced this issue through similarity searches.
For the fales negatives, the authors found similar hard treshold issues, some database errors and an effect that were not properly learned.
additional chlorine substituent in the aromatic ring added a new absorption band at a higher wavelength37, and the ML model apparently did not learn that effect
Discussion
The machine learning part of the paper was hard to follow. The authors did not provide enough details on the models, and the results were not presented in a clear way.
The outlier analysis makes this paper strong. Even though the models found may not be optimal, the authors show that
- Database errors in training lead to errors in the predictions
- Similarity search can be used to find similar molecules in the training set
- Hard cut-offs lead to misclassification
- Some effects may not be learned by the model, leading to misclassification
- Every error in the test set could be accounted for