For those who want the quick version, the video explainer is at the bottom of this post.
There are a multitude of things that are worth predicting about DNA and a multitude of ways to predict them. Starting with the ways of predicting properties of DNA, there are unsupervised methods like BLAST (sequence alignment) where based on sequence similarity, a researcher may conclude that two sequences have similar functions or 3D shapes. A more mathematically defined way of predicting some property of DNA is with a statistical or machine learning model. This post will cover this second approach.
Following a demonstration given in the rDNAse package from R, we will try to predict whether a stretch of DNA is DNAseI Hyper Sensitive or not (1 = “Hyper Sensitive”, 0 = “Not Hyper Sensitive”).
To prepare the data for modeling I used the following code.
require(rDNAse) pos_hs = readFASTA(system.file('dnaseq/hs.fasta', package = 'rDNAse')) neg_hs = readFASTA(system.file('dnaseq/non-hs.fasta', package = 'rDNAse')) x1 = t(sapply(pos_hs, kmer)) x2 = t(sapply(neg_hs, kmer)) x = rbind(x1, x2) labels = as.factor(c(rep(0, length(pos_hs)), rep(1, length(neg_hs)))) train_data <- as.data.frame(cbind(x, target = labels))[-(1:5),] holdout_data <- as.data.frame(cbind(x, target = labels))[(1:5),] glimpse(train_data) glimpse(holdout_data) write_csv(train_data, "TrainingData.csv") write_csv(holdout_data, "HoldoutData.csv")
Below we see the train data which has both “0” and “1” classes as well as a simple dinucleotide count of the sequence to be used as the input data.
Below we see the 5 holdout sequences, all class “1” to be check out in the interpretable machine learning app.
Building The Models
I will use the interpretable machine learning app that I’ve coded out in R to train models, evaluate their performance on the test set and then use the best model on the holdout to both make predictions about their odds of being DNA sites hypersensitive DNAseI.
To make things easier I recorded a video showing how this works.