For those who want the quick version, the video explainer is at the bottom of this post.
The ability to predict a specific property or outcome of a chemical and/or its interaction with a system is nothing new to those familiar with quantitative structure-activity relationship (QSAR) modelin. I picked toxicity as the property to predict in this demonstration due to the readily available datasets, but the universe of valuable things to predict about chemicals is much larger. From physical properties like melting temperature and solubility in material sciences to drug activity, half-life, and blood-brain barrier permeability in pharmaceuticals.
While there are some pretty nifty new ways to predict chemical activity using Graph Neural Networks and Message Passing Neural Networks (useful demonstrations in the DeepChem Python packages), this demonstration will focus on using relatively more explainable methods.
Starting With The Data
I took the openly available Ames Mutagenecity data from https://doc.ml.tu-berlin.de/toxbenchmark/. The data has multiple columns but the ones I need to build a predictive model are the target which is 0 = “Not Toxic”, 1 = “Toxic” and the SMILEs representation. The code to pre-process the the data is shown below. We need to numerically represent the chemical structures, which we do by counting atoms, functional groups, rings, and aromaticity. There are many more ways to represent a chemical structure using approaches like fingerprinting, atom-pair counting, and graph based approaches.
For more information on how to get from a SMILEs representation to numerical representation check out ChemmineR (graphic shown below).
df <- read_csv("D://Mutagenicity_N6512.csv") %>% select(Activity, Steroid, Canonical_Smiles)
sdfset <- smiles2sdf(df$Canonical_Smiles[1:500])
propma <- atomcountMA(sdfset, addH=FALSE)
propma <- data.frame(MF=MF(sdfset, addH=FALSE), MW=MW(sdfset, addH=FALSE),
Ncharges=sapply(bonds(sdfset, type="charge"), length),
rings(sdfset, upper=6, type="count", arom=TRUE))
data <- cbind(propma[,-1], target = df$Activity[1:500])
train_data <- data[-(1:5),]
holdout_data <- data[(1:5),]
The train data (which will further be split into train/test) is shown below.
The holdout data is shown below.
Building The Models
I will use the interpretable machine learning app that I’ve coded out in R to train models, evaluate their performance on the test set and then use the best model on the holdout to both make predictions about their odds of being toxic and explain those predictions.
To make things easier I recorded a video showing how this works.