Photo reactive molecule prediction

Published

June 27, 2025

This week I looked into the chemical origins of the UV/Vis spectrum. Today I am reading about practical applications of UV/Vis spectroscopy and machine learning.

It turns out that UV/Vis spectroscopy is very relevant in material design, including nano materials, pesticides, pharmaceuticals, organic electronics, and more. As I discovered this week, the UV–Vis absorption spectrum is an important charactersitic organic compounds, and it is closely related to it’s optoelectronic properties and photochemical reactivity.

Designing materials and molecules

When desining new materials or molecules, the UV/Vis spectrum has generally high interest. Measuring the UV/Vis spectrum of candidate materials and molecules is common practice. It is also common to use computational methods to predict the UV/Vis spectrum of candidate materials and molecules, which can save time and resources in the design process.

The paper I discuss today is titled “Machine learning prediction of UV–Vis spectra features of organic compounds related to photoreactive potential” (Mamede, Pereira, and Aires-de-Sousa 2021)

The paper uses machine learning to predict part of the UV/Vis spectrum of organic compounds, specifically the molar extinction coefficient (MEC) in the 290-700 nm range, which is relevant for photoreactivity assessment in pharmaceuticals.

More about photoreactivity is right below.

Photoreactive Molecules and Phototoxicity in Pharmaceutical Development

In the human body, external compounds can interact with light, leading to photochemical reactions that can cause adverse effects. This phenomenon, known as phototoxicity, is particularly relevant in the context of pharmaceuticals. Many drugs can become photoreactive upon exposure to UV or visible light.

When these compounds absorb light, they can undergo electronic transitions that lead to the formation of reactive species, which can then interact with biological macromolecules such as proteins, lipids, and DNA. This interaction can result in cellular damage, inflammation, and even carcinogenesis.

Clinical Manifestations

Phototoxicity can manifest in various ways, depending on the type of exposure and the specific drug involved. The clinical manifestations can be broadly categorized into acute, chronic, and ocular reactions:

Acute: Sunburn-like erythema, edema, and pain in sun-exposed areas within hours of exposure
Chronic: Hyperpigmentation, photoaging, and potential skin cancer risk with repeated exposure
Ocular: Potential retinal damage if systemically administered drugs accumulate in eye tissues

Cancer development is a significant concern, as phototoxic compounds can induce DNA damage through the generation of reactive oxygen species (ROS) and other photoproducts. This can lead to mutations and ultimately carcinogenesis, particularly in tissues exposed to light. In clinical settings carciogensis cannot be ruled out, as the long-term effects of phototoxic drugs are not always fully understood.

Photoreactive Molecular Features

The key molecular features that confer photoreactivity include:

Chromophores: Extended conjugated systems, aromatic rings, and α,β-unsaturated carbonyls that can absorb light in the relevant wavelength range. These structural elements create delocalized π-electron systems that lower the energy gap between molecular orbitals, enabling absorption of lower-energy (longer wavelength) photons.

Molar Extinction Coefficient (MEC): The intensity of light absorption, with compounds having MEC ≥ 1000 L·mol⁻¹·cm⁻¹ in the 290-700 nm range considered potentially photoreactive according to ICH S10 guidance.

The ICH S10 guidance on photosafety evaluation mandates assessment of photoreactive potential for all pharmaceuticals intended for human use. This guidance establishes:

Absorption threshold criteria (290-700 nm, MEC ≥ 1000 L·mol⁻¹·cm⁻¹)
Testing strategies for photosafety assessment
Risk mitigation approaches

Mechanisms of Phototoxicity

Phototoxicity occurs through several interconnected mechanisms:

Type I Reactions: The excited molecule directly interacts with biological targets (proteins, lipids, DNA) through electron or hydrogen atom transfer, creating covalent bonds or causing oxidative damage.

Type II Reactions: The excited molecule transfers energy to molecular oxygen, generating reactive oxygen species (ROS) such as singlet oxygen, superoxide radicals, and hydroxyl radicals. These highly reactive species cause widespread cellular damage through lipid peroxidation, protein oxidation, and DNA strand breaks.

Hapten Formation: Some photoreactive compounds form covalent adducts with proteins, creating new antigenic determinants that can trigger allergic photodermatitis upon subsequent exposure.

Drug-Induced Phototoxicity

Phototoxicity represents a significant safety concern in pharmaceutical development, as numerous drug classes have been associated with photosensitivity reactions:

High-Risk Drug Classes:

Fluoroquinolone antibiotics (ciprofloxacin, levofloxacin): Contain quinolone chromophores
Tetracyclines: Extended conjugated systems
NSAIDs (naproxen, ketoprofen): Aromatic structures with UV absorption
Diuretics (furosemide, hydrochlorothiazide): Sulfonamide and benzothiadiazine chromophores
Phenothiazine antipsychotics: Tricyclic aromatic systems
Psoralens: Used in PUVA therapy but highly phototoxic

Machine Learning for Photoreactivity Prediction

Let’s got through the paper (Mamede, Pereira, and Aires-de-Sousa 2021) and see how machine learning can be used to predict photoreactivity in organic compounds.

Here is one of the most interesting quotes of hte paper:

Training ML models to predict full UV–Vis spectra requires large databases of spectra obtained under consistent conditions to predict multiple continuous variables. (e.g., the molar extinction coefficients at several wavelengths).

This is a pain point: UV/Vis spectra can vary significantly depending on the conditions under which they are measured, such as solvent, pH, temperature, and concentration. Even the device used to measure the spectra can introduce variations. Machine learning is sensitive to the quality and consistency of the training data, and this is a crucial aspect when it comes to predicting UV/Vis spectra: useually there are no large databases of spectra obtained under consistent conditions available.

Paradoxicly, machine learning can deal with variations in data. Think about applications like automatic speach recgnition. It reconginizes different voices and accents almost without error.

Automatic speach recognition relies on deep learning models trained on hugage amounts of data that cover a lot of variation. For science applications, such data sets are generally not available.

The authors continue:

Differently, here we report the exploration of ML tools to classify organic molecules in terms of their UV–Vis absorption spectrum based on molecular descriptors.

That authors overcome the small dataset problem by not predicting the full UV/Vis spectrum, or so they suggest.

Instead of letting their models predict part of the spectrum, the authers make their models answer a yes/no question: “is this molecule photoreactive (yes/no)?”.

For simplicity the matter of photoreactivity was reduced to the ICH S10 guidance threshold of MEC ≥ 1000 L·mol⁻¹·cm⁻¹ in the 290-700 nm range, as real world photoreactive behavoir is not readily obtainable for arbitrary molecules.

Data and Labeling

The authors retrieved data on 80 000 molecules from Reaxys; a comprehensive database of chemical compounds and reactions. Each molecule was labeled as photoreactive or not, based on its molar extinction coefficient (MEC) in the 290-700 nm range, with a threshold of MEC ≥ 1000 L·mol⁻¹·cm⁻¹ indicating photoreactivity.

Duplicates and stereoisomers were removed together with charged and radical compounds, leaving 72,788 molecules.

As usual in machine learning, traing and testing sets were created::

The data set was randomly divided into a training set of 72,788 molecules (POS class: 36,036 molecules and NEG class: 36,752 molecules), a test set I of 998 molecules (POS class: 501 molecules and NEG class: 497 molecules), and a test set II of 998 molecules (POS class: 512 molecules and NEG class: 486 molecules).

Here’s the table converted to markdown format:

Distribution of UV–Vis absorption features in the data sets. {#tbl-data-distribution}. a) Statistics concerning the peak with the highest MEC within the 290–700 nm window; b) statistics concerning any listed peak (Mamede, Pereira, and Aires-de-Sousa 2021).
POS classᵃ	Training set	Test set I	Test set II
1000 ≤ MEC ≤ 5000	21.3	22.4	19.5
5000 ≤ MEC < 10,000	24.0	23.3	25
MEC ≥ 10,000	54.7	54.3	55.5
NEG classᵇ
λ < 290 nm, MEC < 1000	10.4	10.7	10.5
λ < 290 nm, MEC ≥ 1000	91.1	88.9	91.6
λ > 700 nm, MEC < 1000	0.005	0	0
λ > 700 nm, MEC ≥ 1000	0.07	0.20	0
290 ≤ λ ≤ 700 nm, MEC ≤ 900	6.5	8.0	5.6
290 ≤ λ ≤ 700 nm, MEC > 900	0.23	0.4	0.21

Descriptors

A large set of molecular descriptors and finger prints was calculated for each molecule. The authors used a combination of constitutional, topological, and electronic descriptors, as well as various types of fingerprints, computed with the RDKit cheminformatics toolkit.

Fingerprint Types

MACCS Keys (166 bits) These are expert-curated structural keys representing predefined chemical patterns. Each of the 166 bits corresponds to a specific SMARTS pattern that captures common functional groups, ring systems, and structural motifs. MACCS keys are highly interpretable since each bit has a known chemical meaning, making them valuable for understanding why molecules are considered similar.

Substructure Fingerprints (307 bits) These use SMARTS patterns for Laggner functional group classification, implemented in two variants:

Sub: Binary presence/absence of each pattern (307 bits)
SubC: Count-based version tracking how many times each pattern appears

This approach focuses specifically on pharmacologically relevant functional groups, making it particularly useful for drug discovery applications.

PubChem Fingerprints (881 bits) A comprehensive structural fingerprint system with 881 predefined patterns covering element counts, ring systems, atom pairs, functional groups, and complex structural motifs. These provide much broader coverage than MACCS keys and are designed to capture subtle structural variations important for similarity searching and QSAR modeling.

Circular Fingerprints

CDK Circular Fingerprints (1024 bits) Implementation of Extended Connectivity Fingerprints (ECFP) algorithm using the Chemistry Development Kit. These capture local molecular environments by iteratively expanding around each atom, generating unique identifiers for circular substructures that are then folded into a 1024-bit vector.

CDK Extended (1024 bits) Enhanced version of CDK fingerprints that dedicates additional bits specifically to ring features - capturing ring sizes, counts, aromaticity patterns, and ring closure information. This provides better representation of complex ring systems while maintaining the same 1024-bit length.

Morgan Fingerprints (1024 bits) RDKit’s implementation of circular fingerprints, using the Morgan algorithm to generate atom identifiers based on atomic properties and iteratively updating them by incorporating neighbor information. These are widely used due to their robust implementation and excellent performance in similarity searching.

Molecular Descriptors

1D&2D Descriptors (1443 total) A comprehensive set including:

Constitutional: Basic molecular properties (molecular weight, atom counts, bond counts)
Topological: Graph-based descriptors capturing molecular connectivity and shape
Electronic: Properties related to electron distribution and molecular orbitals

Modified Distance Descriptors (Md) A specialized approach with unique characteristics:

Connectivity-based: Uses only molecular connectivity, avoiding bond orders and formal charges
No 3D conformers: Eliminates need for 3D structure generation, aromaticity definitions, or mesomerism standardization
Modified distances: Incorporates van der Waals radii and Sanderson electronegativity of neighboring atoms
Parameters: 1010 intervals, 0.017 resolution, distances up to 4 bonds, distance factor of 4
Function: Counts atom pairs at specific modified distances, providing a more chemically-informed distance metric than simple topological distance

Quantum Descriptors

ML Quantum Descriptors (MLQD) Machine learning-predicted quantum chemical properties:

Properties: EHOMO (highest occupied molecular orbital energy), ELUMO (lowest unoccupied molecular orbital energy), and GAP (HOMO-LUMO energy gap)
Implementation: 10 different ML models for each property, providing ensemble predictions
Training: Models trained on DFT (Density Functional Theory) calculated data
Advantage: Provides quantum chemical insights without expensive DFT calculations

Computational Tools

PaDEL-Descriptor (v2.21): Comprehensive descriptor calculation software that can compute over 1000 molecular descriptors and fingerprints, providing a standardized platform for molecular characterization.

RDKit: Open-source cheminformatics toolkit particularly strong in fingerprint calculation and molecular property prediction, widely used in both academic and industrial settings.

This combination of descriptor types provides complementary views of molecular structure: fingerprints excel at pattern recognition and similarity searching, constitutional/topological descriptors capture fundamental molecular properties, Modified Distance descriptors provide a unique connectivity-based perspective, and ML quantum descriptors add electronic structure information without computational expense.

Machine learning

Several machine learning algorithms were trained, including random forest, support vector machines and CART.

Random forest is used for feature selection. Instead of selecting individual features, the model is trained on groups of featues as shown in table ?@tbl-feature-groups.

Distribution of UV–Vis absorption features in the data sets. {#tbl-feature-groups}. Matthews correlation coefficient (MCC) for random forest model trained on single descriptor groups (Mamede, Pereira, and Aires-de-Sousa 2021).
Descriptor group	MCC
RDKitMorganFP	0.76
ExtCDK	0.75
CDK	0.74
MACCS	0.74
Md	0.74
PubChem	0.73
RDKitFP	0.73
1D&2D	0.7
SubC	0.67
Sub	0.61
ML_QD	0.51

The top groups are close. The authors do not find the results to improve when using combinations of 2 feature groups.

Scores on models other than the random forest are missing. Somewhere the authors mentioned that random foret peforms best, I vaguely rember.

The RF model trained with all RDKitMorgan fingerprints predicted the test set I with accuracy of 0.88 and MCC 0.76.

The authors did inspected the missclassified molecules and found the hard cut-off of MEC ≥ 1000 L·mol⁻¹·cm⁻¹ an important source for misclassification. When the MEC value is close to the cutoff, the model has to be increasingly precise, or the prediction will fall at random on one side or the other.

All fales positives could be accounted for in one way or another: hard cutoffs, data errors in related training examples. They surfaced this issue through similarity searches.

For the fales negatives, the authors found similar hard treshold issues, some database errors and an effect that were not properly learned.

additional chlorine substituent in the aromatic ring added a new absorption band at a higher wavelength37, and the ML model apparently did not learn that effect

Discussion

The machine learning part of the paper was hard to follow. The authors did not provide enough details on the models, and the results were not presented in a clear way.

The outlier analysis makes this paper strong. Even though the models found may not be optimal, the authors show that

Database errors in training lead to errors in the predictions
Similarity search can be used to find similar molecules in the training set
Hard cut-offs lead to misclassification
Some effects may not be learned by the model, leading to misclassification
Every error in the test set could be accounted for

References

Mamede, Rafael, Florbela Pereira, and João Aires-de-Sousa. 2021. “Machine Learning Prediction of UV–Vis Spectra Features of Organic Compounds Related to Photoreactive Potential.” Scientific Reports 11 (1): 23720. https://doi.org/10.1038/s41598-021-03070-9.

--- title: "Photo reactive molecule prediction" date: 2025-06-27 categories: [] draft: false bibliography: photoactive-prediction.bib --- This week I looked into the chemical origins of the UV/Vis spectrum. Today I am reading about practical applications of UV/Vis spectroscopy and machine learning. It turns out that **UV/Vis spectroscopy is very relevant in material design**, including nano materials, pesticides, pharmaceuticals, organic electronics, and more. As I discovered this week, the UV–Vis absorption spectrum is an important charactersitic organic compounds, and it is closely related to it's optoelectronic properties and photochemical reactivity. ## Designing materials and molecules When desining new materials or molecules, the UV/Vis spectrum has generally high interest. Measuring the UV/Vis spectrum of candidate materials and molecules is common practice. It is also common to use computational methods to predict the UV/Vis spectrum of candidate materials and molecules, which can save time and resources in the design process. The paper I discuss today is titled "Machine learning prediction of UV–Vis spectra features of organic compounds related to photoreactive potential" [@mamedeMachineLearningPrediction2021] The paper uses **machine learning to predict part of the UV/Vis spectrum of organic compounds**, specifically the molar extinction coefficient (MEC) in the 290-700 nm range, which is relevant for photoreactivity assessment in pharmaceuticals. More about photoreactivity is right below. # Photoreactive Molecules and Phototoxicity in Pharmaceutical Development In the human body, external compounds can interact with light, leading to photochemical reactions that can cause adverse effects. This phenomenon, known as phototoxicity, is particularly relevant in the context of pharmaceuticals. Many drugs can become photoreactive upon exposure to UV or visible light. When these compounds absorb light, they can undergo electronic transitions that lead to the formation of reactive species, which can then interact with biological macromolecules such as proteins, lipids, and DNA. This interaction can result in cellular damage, inflammation, and even carcinogenesis. ### **Clinical Manifestations** Phototoxicity can manifest in various ways, depending on the type of exposure and the specific drug involved. The clinical manifestations can be broadly categorized into acute, chronic, and ocular reactions: - **Acute**: Sunburn-like erythema, edema, and pain in sun-exposed areas within hours of exposure - **Chronic**: Hyperpigmentation, photoaging, and potential skin cancer risk with repeated exposure - **Ocular**: Potential retinal damage if systemically administered drugs accumulate in eye tissues Cancer development is a significant concern, as phototoxic compounds can induce DNA damage through the generation of reactive oxygen species (ROS) and other photoproducts. This can lead to mutations and ultimately carcinogenesis, particularly in tissues exposed to light. In clinical settings carciogensis cannot be ruled out, as the long-term effects of phototoxic drugs are not always fully understood. ### **Photoreactive Molecular Features** The key molecular features that confer photoreactivity include: **Chromophores**: Extended conjugated systems, aromatic rings, and α,β-unsaturated carbonyls that can absorb light in the relevant wavelength range. These structural elements create delocalized π-electron systems that lower the energy gap between molecular orbitals, enabling absorption of lower-energy (longer wavelength) photons. **Molar Extinction Coefficient (MEC)**: The intensity of light absorption, with compounds having MEC ≥ 1000 L·mol⁻¹·cm⁻¹ in the 290-700 nm range considered potentially photoreactive according to ICH S10 guidance. The ICH S10 guidance on photosafety evaluation mandates assessment of photoreactive potential for all pharmaceuticals intended for human use. This guidance establishes: - Absorption threshold criteria (290-700 nm, MEC ≥ 1000 L·mol⁻¹·cm⁻¹) - Testing strategies for photosafety assessment - Risk mitigation approaches ### **Mechanisms of Phototoxicity** Phototoxicity occurs through several interconnected mechanisms: **Type I Reactions**: The excited molecule directly interacts with biological targets (proteins, lipids, DNA) through electron or hydrogen atom transfer, creating covalent bonds or causing oxidative damage. **Type II Reactions**: The excited molecule transfers energy to molecular oxygen, generating reactive oxygen species (ROS) such as singlet oxygen, superoxide radicals, and hydroxyl radicals. These highly reactive species cause widespread cellular damage through lipid peroxidation, protein oxidation, and DNA strand breaks. **Hapten Formation**: Some photoreactive compounds form covalent adducts with proteins, creating new antigenic determinants that can trigger allergic photodermatitis upon subsequent exposure. ### **Drug-Induced Phototoxicity** Phototoxicity represents a significant safety concern in pharmaceutical development, as numerous drug classes have been associated with photosensitivity reactions: **High-Risk Drug Classes**: - **Fluoroquinolone antibiotics** (ciprofloxacin, levofloxacin): Contain quinolone chromophores - **Tetracyclines**: Extended conjugated systems - **NSAIDs** (naproxen, ketoprofen): Aromatic structures with UV absorption - **Diuretics** (furosemide, hydrochlorothiazide): Sulfonamide and benzothiadiazine chromophores - **Phenothiazine antipsychotics**: Tricyclic aromatic systems - **Psoralens**: Used in PUVA therapy but highly phototoxic # **Machine Learning for Photoreactivity Prediction** Let's got through the paper [@mamedeMachineLearningPrediction2021] and see how machine learning can be used to predict photoreactivity in organic compounds. Here is one of the most interesting quotes of hte paper: > Training ML models to predict full UV–Vis spectra requires large databases of spectra obtained under consistent conditions to predict multiple continuous variables. (e.g., the molar extinction coefficients at several wavelengths). This is a pain point: **UV/Vis spectra can vary significantly depending on the conditions** under which they are measured, such as solvent, pH, temperature, and concentration. Even the device used to measure the spectra can introduce variations. **Machine learning is sensitive to the quality and consistency of the training data**, and this is a crucial aspect when it comes to predicting UV/Vis spectra: useually there are no large databases of spectra obtained under consistent conditions available. Paradoxicly, **machine learning can deal with variations in data**. Think about applications like automatic speach recgnition. It reconginizes different voices and accents almost without error. Automatic speach recognition relies on deep learning models trained on hugage amounts of data that cover a lot of variation. For science applications, such data sets are generally not available. The authors continue: > Differently, here we report the exploration of ML tools to classify organic molecules in terms of their UV–Vis absorption spectrum based on molecular descriptors. That authors **overcome the small dataset problem** by not predicting the full UV/Vis spectrum, or so they suggest. Instead of letting their models predict part of the spectrum, **the authers make their models answer a yes/no question**: "is this molecule photoreactive (yes/no)?". For simplicity the matter of photoreactivity was reduced to the ICH S10 guidance threshold of MEC ≥ 1000 L·mol⁻¹·cm⁻¹ in the 290-700 nm range, as real world photoreactive behavoir is not readily obtainable for arbitrary molecules. # **Data and Labeling** The authors retrieved data on 80 000 molecules from [Reaxys](https://www.reaxys.com/); a comprehensive database of chemical compounds and reactions. Each molecule was labeled as photoreactive or not, based on its molar extinction coefficient (MEC) in the 290-700 nm range, with a threshold of MEC ≥ 1000 L·mol⁻¹·cm⁻¹ indicating photoreactivity. Duplicates and stereoisomers were removed together with charged and radical compounds, leaving 72,788 molecules. As usual in machine learning, traing and testing sets were created:: > The data set was randomly divided into a training set of 72,788 molecules (POS class: 36,036 molecules and NEG class: 36,752 molecules), a test set I of 998 molecules (POS class: 501 molecules and NEG class: 497 molecules), and a test set II of 998 molecules (POS class: 512 molecules and NEG class: 486 molecules). Here's the table converted to markdown format: | **POS class**ᵃ | **Training set** | **Test set I** | **Test set II** | |---|---|---|---| | 1000 ≤ MEC ≤ 5000 | 21.3 | 22.4 | 19.5 | | 5000 ≤ MEC < 10,000 | 24.0 | 23.3 | 25 | | MEC ≥ 10,000 | 54.7 | 54.3 | 55.5 | | **NEG class**ᵇ | | | | | λ < 290 nm, MEC < 1000 | 10.4 | 10.7 | 10.5 | | λ < 290 nm, MEC ≥ 1000 | 91.1 | 88.9 | 91.6 | | λ > 700 nm, MEC < 1000 | 0.005 | 0 | 0 | | λ > 700 nm, MEC ≥ 1000 | 0.07 | 0.20 | 0 | | 290 ≤ λ ≤ 700 nm, MEC ≤ 900 | 6.5 | 8.0 | 5.6 | | 290 ≤ λ ≤ 700 nm, MEC > 900 | 0.23 | 0.4 | 0.21 | : Distribution of UV–Vis absorption features in the data sets. {#tbl-data-distribution}. a) Statistics concerning the peak with the highest MEC within the 290–700 nm window; b) statistics concerning any listed peak [@mamedeMachineLearningPrediction2021]. # **Descriptors** A large set of molecular descriptors and finger prints was calculated for each molecule. The authors used a combination of constitutional, topological, and electronic descriptors, as well as various types of fingerprints, computed with the RDKit cheminformatics toolkit. ## Fingerprint Types **MACCS Keys (166 bits)** These are expert-curated structural keys representing predefined chemical patterns. Each of the 166 bits corresponds to a specific SMARTS pattern that captures common functional groups, ring systems, and structural motifs. MACCS keys are highly interpretable since each bit has a known chemical meaning, making them valuable for understanding why molecules are considered similar. **Substructure Fingerprints (307 bits)** These use SMARTS patterns for Laggner functional group classification, implemented in two variants: - **Sub**: Binary presence/absence of each pattern (307 bits) - **SubC**: Count-based version tracking how many times each pattern appears This approach focuses specifically on pharmacologically relevant functional groups, making it particularly useful for drug discovery applications. **PubChem Fingerprints (881 bits)** A comprehensive structural fingerprint system with 881 predefined patterns covering element counts, ring systems, atom pairs, functional groups, and complex structural motifs. These provide much broader coverage than MACCS keys and are designed to capture subtle structural variations important for similarity searching and QSAR modeling. ## Circular Fingerprints **CDK Circular Fingerprints (1024 bits)** Implementation of Extended Connectivity Fingerprints (ECFP) algorithm using the Chemistry Development Kit. These capture local molecular environments by iteratively expanding around each atom, generating unique identifiers for circular substructures that are then folded into a 1024-bit vector. **CDK Extended (1024 bits)** Enhanced version of CDK fingerprints that dedicates additional bits specifically to ring features - capturing ring sizes, counts, aromaticity patterns, and ring closure information. This provides better representation of complex ring systems while maintaining the same 1024-bit length. **Morgan Fingerprints (1024 bits)** RDKit's implementation of circular fingerprints, using the Morgan algorithm to generate atom identifiers based on atomic properties and iteratively updating them by incorporating neighbor information. These are widely used due to their robust implementation and excellent performance in similarity searching. ## Molecular Descriptors **1D&2D Descriptors (1443 total)** A comprehensive set including: - **Constitutional**: Basic molecular properties (molecular weight, atom counts, bond counts) - **Topological**: Graph-based descriptors capturing molecular connectivity and shape - **Electronic**: Properties related to electron distribution and molecular orbitals **Modified Distance Descriptors (Md)** A specialized approach with unique characteristics: - **Connectivity-based**: Uses only molecular connectivity, avoiding bond orders and formal charges - **No 3D conformers**: Eliminates need for 3D structure generation, aromaticity definitions, or mesomerism standardization - **Modified distances**: Incorporates van der Waals radii and Sanderson electronegativity of neighboring atoms - **Parameters**: 1010 intervals, 0.017 resolution, distances up to 4 bonds, distance factor of 4 - **Function**: Counts atom pairs at specific modified distances, providing a more chemically-informed distance metric than simple topological distance ## Quantum Descriptors **ML Quantum Descriptors (MLQD)** Machine learning-predicted quantum chemical properties: - **Properties**: EHOMO (highest occupied molecular orbital energy), ELUMO (lowest unoccupied molecular orbital energy), and GAP (HOMO-LUMO energy gap) - **Implementation**: 10 different ML models for each property, providing ensemble predictions - **Training**: Models trained on DFT (Density Functional Theory) calculated data - **Advantage**: Provides quantum chemical insights without expensive DFT calculations ## Computational Tools **PaDEL-Descriptor (v2.21)**: Comprehensive descriptor calculation software that can compute over 1000 molecular descriptors and fingerprints, providing a standardized platform for molecular characterization. **RDKit**: Open-source cheminformatics toolkit particularly strong in fingerprint calculation and molecular property prediction, widely used in both academic and industrial settings. This combination of descriptor types provides complementary views of molecular structure: fingerprints excel at pattern recognition and similarity searching, constitutional/topological descriptors capture fundamental molecular properties, Modified Distance descriptors provide a unique connectivity-based perspective, and ML quantum descriptors add electronic structure information without computational expense. # Machine learning Several machine learning algorithms were trained, including random forest, support vector machines and CART. Random forest is used for feature selection. Instead of selecting individual features, the model is trained on groups of featues as shown in table @tbl-feature-groups. | Descriptor group | MCC | |------------|-----| | RDKitMorganFP | 0.76 | | ExtCDK | 0.75 | | CDK | 0.74 | | MACCS | 0.74 | | Md | 0.74 | | PubChem | 0.73 | | RDKitFP | 0.73 | | 1D&2D | 0.7 | | SubC | 0.67 | | Sub | 0.61 | | ML_QD | 0.51 | : Distribution of UV–Vis absorption features in the data sets. {#tbl-feature-groups}. Matthews correlation coefficient (MCC) for random forest model trained on single descriptor groups [@mamedeMachineLearningPrediction2021]. The top groups are close. The authors do not find the results to improve when using combinations of 2 feature groups. Scores on models other than the random forest are missing. Somewhere the authors mentioned that random foret peforms best, I vaguely rember. The RF model trained with all RDKitMorgan fingerprints predicted the test set I with accuracy of 0.88 and MCC 0.76. The authors did inspected the missclassified molecules and found the hard cut-off of MEC ≥ 1000 L·mol⁻¹·cm⁻¹ an important source for misclassification. When the MEC value is close to the cutoff, the model has to be increasingly precise, or the prediction will fall at random on one side or the other. All fales positives could be accounted for in one way or another: hard cutoffs, data errors in related training examples. They surfaced this issue through similarity searches. For the fales negatives, the authors found similar hard treshold issues, some database errors and an effect that were not properly learned. > additional chlorine substituent in the aromatic ring added a new absorption band at a higher wavelength37, and the ML model apparently did not learn that effect # Discussion The machine learning part of the paper was hard to follow. The authors did not provide enough details on the models, and the results were not presented in a clear way. The outlier analysis makes this paper strong. Even though the models found may not be optimal, the authors show that - Database errors in training lead to errors in the predictions - Similarity search can be used to find similar molecules in the training set - Hard cut-offs lead to misclassification - Some effects may not be learned by the model, leading to misclassification - Every error in the test set could be accounted for