From Pairwise to Batch Sequence Alignment in R

Quick Plug For My Newsletter

Alignment. What Is It Good For?

One of the most common and important bioinformatic analyses is the aligning biological sequences like DNA, RNA, and protein against other similar sequences. These alignments can ranges from pairwise (one sequence vs another sequence) to multi-sequence alignment (many sequences vs many sequences).

The alignment can be used to calculate a percent identity (how many base-pairs or amino-acids line up exactly). This in turn can be used to determine how likely these sequences are to have similar biological function, how closely related they are, or simply as a diagnostic to determine if your sample has the right specimen.
There are various open-source options simple one-off pairwise and multiple sequence alignments such as…

https://blast.ncbi.nlm.nih.gov/Blast.cgi
https://www.ebi.ac.uk/Tools/msa/muscle/
you can even download BLAST from NCBI to run locally.

However, for situations where you anticipate having to…

run MANY alignments every day, week, or month
cases where the PRIVACY AND SECURITY of the sequence is a concern

creating a custom app that runs batch alignments on your own computer may be the answer.

I will walk through how to run these alignments in R using the msa package. I will go through levels of job complexity and show how to run big alignment jobs even faster. And finally I will show how to automate this work by creating an app that just requires dragging and dropping your sequences.

The high level structure this article will follow is…

Starting With Sequences: I’ll show code to pull open source flu DNA sequences I use in this example
Aligning Sequences Level 1: One Pairwise Alignment – Simply aligning one sequence against another, a case where you want to demonstrate what you think you already know about two hand-picked sequences.
Aligning Sequences Level 2: Aligning One Sequence Against a Database of Known Sequences in a Pairwise Manner – When you want to know what your sequence is most similar to and have a database of known sequences it could be.
Aligning Sequences Level 3: Aligning Multiple New Sequences Against a Database of Known Sequences in a Pairwise Manner – When you have many unknown sequences and want to know what each sequence is most similar to and have a database of known sequences each input sequence could be.
Aligning Sequences Level 4: Making Sequence Alignment Faster With Parallelization – When you start needing to align more new sequences against a growing database of sequences, parallelization will allow you to try all combinations faster.
Aligning Sequences Level 5: Making an App to Automate Sequence Alignment.
Conclusion: The pros and cons of doing biological sequence alignment on a local computer.

For those in a hurry I will put the video from “Level 5” right here as well as further below.

Starting With Sequences

To get some example sequences to demonstrate a few different ways to run alignments, I will pull the open source flu sequences from http://www.fludb.org using an API call.

library(readr)
library(here)
library(seqinr)
library(msa)
library(XVector)
library(tidyverse)
library(doParallel)

library(httr)
library(seqRFLP)
# FLUdb API URL
url= "https://www.fludb.org/brc/api/sequence?"
# Queries
#change here will change sequences pulled in
query="datatype=genome&family=influenza&fromyear=2020&toyear=2022&completeseq=y&host=swine&segment=4&country=USA&metadata=strainname,country,host,date,segment,subtype&output=fasta)"
# get raw results
raw.result <- httr::GET(paste0(url, query))
# 200 means successful
#httr::http_status(raw.result)
this.raw.content <- rawToChar(raw.result$content)
#convert to a data frame; will spit single column dataframe
df<-as.data.frame(read.table(text = this.raw.content, sep="\n", header=F))

df2<- data.frame(Sequence_ID = df$V1[c(TRUE, FALSE)], sequence = df$V1[c(FALSE, TRUE)])

df2 <- df2 %>% rename(names = Sequence_ID)
r_char_ct_unique <- function(x) map_int(map(x, function(x) unique(strsplit(x, "")[[1]])), length)


df2$count <- r_char_ct_unique(df2$sequence)
df2 <- df2 %>% filter(count == 4) %>% select(-count)

After pulling the sequences with a few constraints there are 229 flu sequences in our example data-set. The current data-frame has an ID column and Sequence column as seen below.

glimpse(df2)

## Rows: 229
## Columns: 2
## $ names    <chr> ">MN999893|A/swine/Indiana/A02478980/2020|USA|Swine|2020-01-0~
## $ sequence <chr> "ATGAAGACTATCATTGCTTTTAGCTGCATTCTATGTCTGATTATCGCCCAAAAACTTCCC~

With a little programming we can strip out some more information from the names and get an idea of the distribution H and N types. We can see most influenza types have about the same amount of sequences in this database except H3N1.

df2 <- cbind(df2, do.call(rbind, strsplit(df2$names, "[|]"))) %>% select(names, sequence, `7`) %>% 
  rename(HN = `7`)

df2 %>% group_by(HN) %>% summarise(count = n())

## # A tibble: 4 x 2
##   HN    count
##   <chr> <int>
## 1 H1N1     84
## 2 H1N2     60
## 3 H3N1      1
## 4 H3N2     84

Aligning Sequences Level 1: One Pairwise Alignment

Simply aligning one sequence against another, a case where you want to demonstrate what you think you already know about two hand-picked sequences. I will narrow down the pool to H1N1 sequences and take the top 2 sequences. This represents a situation where you think you know something about two sequences (they are similar) and just need to prove it with a one-off alignment. We can see the default alignment below.

subset <- df2 %>% filter(HN == "H1N1")
subset <- subset[1:2,]
subset_prepped <- DNAStringSet(x = subset$sequence, use.names = T)
aln <- msaClustalOmega(subset_prepped, order = "input")

aln
## ClustalOmega 1.2.0 
## 
## Call:
##    msaClustalOmega(subset_prepped, order = "input")
## 
## MsaDNAMultipleAlignment with 2 rows and 1701 columns
##     aln 
## [1] ATGAAGGCAGTACTAGTAGTCCTGCTATATACATTT...TCCAATGGGTCGCTACAGTGCAGAATATGCATTTAA
## [2] ATGAAGGCAATACTGGTAGTTCTGCTGTATACATTT...TCTAATGGGTCTCTACAATGTAGAATATGTATTTAA
## Con ATGAAGGCA?TACT?GTAGT?CTGCT?TATACATTT...TC?AATGGGTC?CTACA?TG?AGAATATG?ATTTAA

With a little coding we can change the view of the alignment to flag differences with a “1” and similarities with a “0”.

aln <- msaConvert(aln, type="seqinr::alignment")
z <- as.data.frame(cbind(as.character(s2c(aln$seq[1])), as.character(s2c(aln$seq[2]))))
z$V1 <- as.character(z$V1)
z$V2 <- as.character(z$V2)
z$diff <- ifelse(z[,1] == z[,2], 0, 1)
head(z,30)

##    V1 V2 diff
## 1   A  A    0
## 2   T  T    0
## 3   G  G    0
## 4   A  A    0
## 5   A  A    0
## 6   G  G    0
## 7   G  G    0
## 8   C  C    0
## 9   A  A    0
## 10  G  A    1
## 11  T  T    0
## 12  A  A    0
## 13  C  C    0
## 14  T  T    0
## 15  A  G    1
## 16  G  G    0
## 17  T  T    0
## 18  A  A    0
## 19  G  G    0
## 20  T  T    0
## 21  C  T    1
## 22  C  C    0
## 23  T  T    0
## 24  G  G    0
## 25  C  C    0
## 26  T  T    0
## 27  A  G    1
## 28  T  T    0
## 29  A  A    0
## 30  T  T    0

And by doing a little math on the “1”’s and “0”’s you can calculate the percent identity.

percent_identity <- round((1- sum(z$diff)/nrow(z))*100, 1)
percent_identity
## [1] 88.1

The code below sets up the function for making pairwise alignments against a reference database of multiple sequences. It can be run in serial (slow) or parallel (faster). We will use this function in the coming levels.

  parBlast <- function (j)
      {
        library(shiny)
        library(readr)
        library(here)
        
        
        library(ORFik)
        
        library(Biostrings)
        library(shinyWidgets)
        
        library(seqinr)
        library(msa)
        library(XVector)
        library(tidyverse)
        library(doParallel)
        
        merged <- bind_rows(a, ref[j,])
        new2 <- DNAStringSet(x = merged$seq, use.names = T)
        names(new2) <- as.character(merged$names)
        aln <- msaClustalOmega(new2, order = "input")
        #aln <- msa(new2)
        aln <- msaConvert(aln, type="seqinr::alignment")
        z2 <- as.matrix(dist.alignment(aln))
        
        z <- as.data.frame(cbind(as.character(s2c(aln$seq[1])), as.character(s2c(aln$seq[2]))))
        z$V1 <- as.character(z$V1)
        z$V2 <- as.character(z$V2)
        z$diff <- ifelse(z[,1] == z[,2], 0, 1)
        dist <- round((1- sum(z$diff)/nrow(z))*100, 1)
        dist
        
      }

Aligning Sequences Level 2: Aligning One Sequence Against a Database of Known Sequences in a Pairwise Manner

Now lets say we have one unknown sequence and want to know what sequence it has the highest percent identity to from a database of known sequences. In this example we will borrow a sequence from the database to be our “unknown” sequence so we expect it to match %100 with its reference database version. Further below we see it took about 1 minute to pairwise align this sequence against all 229 sequences and keep the top result based on percent identity.

a2 <- df2[1,] %>% rename(seq = sequence)
glimpse(a2)

## Rows: 1
## Columns: 3
## $ names <chr> ">MN999893|A/swine/Indiana/A02478980/2020|USA|Swine|2020-01-06|4~
## $ seq   <chr> "ATGAAGACTATCATTGCTTTTAGCTGCATTCTATGTCTGATTATCGCCCAAAAACTTCCCGGA~
## $ HN    <chr> "H3N2"

start = Sys.time()
ref <- df2[]%>% rename(seq = sequence)



final <- NULL
      
      for ( z in 1:nrow(a2)) {
        a <- as.data.frame(cbind(a2$seq[z], "input"))
        colnames(a) <- c("seq", "names")
        dist2 <- NULL
        #For each database sequence selected, do an individual allignment, add the allignment to the list, 
        #take the % identity, and at the end sort in descending order based on % identity to new sequence
        dist2 <- foreach(j = 1:nrow(ref), .combine = "rbind") %do%
          #parBlast is a custom function I made. Using %do% makes it run in serial.
          parBlast(j)
        
        
        
        
        row.names(dist2) <- ref$names
       
        
        colnames(dist2) <- "Percent Identity with Input DNA"
        dist2 <- as.data.frame(dist2)

        dist2 <- rownames_to_column(dist2, "Database Sequence")
        # dist2$Protein_ID <- ref$`Protein Type`
        # dist2$Strain <- ref$Strain
        # dist2$Country <- ref$Country
        # dist2$State <- ref$State
        # dist2$Date <- ref$Date
        # dist2$Input_Seq <- a2$name[z]
        
        dist2 <- dist2[order(-dist2[,2]),]
        dist2 <- dist2[1,]
        dist2 = cbind(`Input Sequence` = a2$names[z], dist2)
        
        
        final <- rbind(final, dist2)

      }
end = Sys.time()
start-end
## Time difference of -1.005542 mins

final

##                                                         Input Sequence
## 1 >MN999893|A/swine/Indiana/A02478980/2020|USA|Swine|2020-01-06|4|H3N2
##                                                      Database Sequence
## 1 >MN999893|A/swine/Indiana/A02478980/2020|USA|Swine|2020-01-06|4|H3N2
##   Percent Identity with Input DNA
## 1                             100

Aligning Sequences Level 3: Aligning Multiple New Sequences Against a Database of Known Sequences in a Pairwise Manner

Now lets say we have one unknown sequence and want to know what sequence it has the highest percent identity to from a database of known sequences. In this example we will borrow a sequence from the database to be our “unknown” sequence so we expect it to match %100 with its reference database version. Further below we see it took about 4 minutes to pairwise align these 5 sequences against all 229 sequences and keep the top result based on percent identity when running serially.

a2 <- df2[1:5,] %>% rename(seq = sequence)
glimpse(a2)

## Rows: 5
## Columns: 3
## $ names <chr> ">MN999893|A/swine/Indiana/A02478980/2020|USA|Swine|2020-01-06|4~
## $ seq   <chr> "ATGAAGACTATCATTGCTTTTAGCTGCATTCTATGTCTGATTATCGCCCAAAAACTTCCCGGA~
## $ HN    <chr> "H3N2", "H1N2", "H1N2", "H1N2", "H3N2"

start = Sys.time()
ref <- df2[]%>% rename(seq = sequence)



final <- NULL
      
      for ( z in 1:nrow(a2)) {
        a <- as.data.frame(cbind(a2$seq[z], "input"))
        colnames(a) <- c("seq", "names")
        dist2 <- NULL
        #For each database sequence selected, do an individual allignment, add the allignment to the list, 
        #take the % identity, and at the end sort in descending order based on % identity to new sequence
        dist2 <- foreach(j = 1:nrow(ref), .combine = "rbind") %do%
          #parBlast is a custom function I made. Using %do% makes it run in serial.
          parBlast(j)
        
        
        
        
        row.names(dist2) <- ref$names
       
        
        colnames(dist2) <- "Percent Identity with Input DNA"
        dist2 <- as.data.frame(dist2)

        dist2 <- rownames_to_column(dist2, "Database Sequence")
        # dist2$Protein_ID <- ref$`Protein Type`
        # dist2$Strain <- ref$Strain
        # dist2$Country <- ref$Country
        # dist2$State <- ref$State
        # dist2$Date <- ref$Date
        # dist2$Input_Seq <- a2$name[z]
        
        dist2 <- dist2[order(-dist2[,2]),]
        dist2 <- dist2[1,]
        dist2 = cbind(`Input Sequence` = a2$names[z], dist2)
        
        
        final <- rbind(final, dist2)

      }

end = Sys.time()
start-end
## Time difference of -4.278218 mins
final

##                                                           Input Sequence
## 1   >MN999893|A/swine/Indiana/A02478980/2020|USA|Swine|2020-01-06|4|H3N2
## 2      >MN999909|A/swine/Iowa/A02478968/2020|USA|Swine|2020-01-02|4|H1N2
## 3 >MT020038|A/swine/Minnesota/A02245378/2020|USA|Swine|2020-01-08|4|H1N2
## 4 >MT020040|A/swine/Minnesota/A02245375/2020|USA|Swine|2020-01-07|4|H1N2
## 5    >MT020046|A/swine/Kansas/A02245376/2020|USA|Swine|2020-01-07|4|H3N2
##                                                        Database Sequence
## 1   >MN999893|A/swine/Indiana/A02478980/2020|USA|Swine|2020-01-06|4|H3N2
## 2      >MN999909|A/swine/Iowa/A02478968/2020|USA|Swine|2020-01-02|4|H1N2
## 3 >MT020038|A/swine/Minnesota/A02245378/2020|USA|Swine|2020-01-08|4|H1N2
## 4 >MT020040|A/swine/Minnesota/A02245375/2020|USA|Swine|2020-01-07|4|H1N2
## 5    >MT020046|A/swine/Kansas/A02245376/2020|USA|Swine|2020-01-07|4|H3N2
##   Percent Identity with Input DNA
## 1                             100
## 2                             100
## 3                             100
## 4                             100
## 5                             100

Aligning Sequences Level 4: Making Sequence Alignment Faster With Parallelization

Now lets say we have one unknown sequence and want to know what sequence it has the highest percent identity to from a database of known sequences. In this example we will borrow a sequence from the database to be our “unknown” sequence so we expect it to match %100 with its reference database version. Further below we see it took about 1.5 minutes to pairwise align these 5 sequences against all 229 sequences and keep the top result based on percent identity when running in parallel.

a2 <- df2[1:5,] %>% rename(seq = sequence)
glimpse(a2)

## Rows: 5
## Columns: 3
## $ names <chr> ">MN999893|A/swine/Indiana/A02478980/2020|USA|Swine|2020-01-06|4~
## $ seq   <chr> "ATGAAGACTATCATTGCTTTTAGCTGCATTCTATGTCTGATTATCGCCCAAAAACTTCCCGGA~
## $ HN    <chr> "H3N2", "H1N2", "H1N2", "H1N2", "H3N2"

start = Sys.time()
ref <- df2[]%>% rename(seq = sequence)

      stopImplicitCluster()
      no_cores <- detectCores() 
      cl <- makeCluster(no_cores)
      registerDoParallel(cl)

final <- NULL
      
      for ( z in 1:nrow(a2)) {
        a <- as.data.frame(cbind(a2$seq[z], "input"))
        colnames(a) <- c("seq", "names")
        dist2 <- NULL
        #For each database sequence selected, do an individual allignment, add the allignment to the list, 
        #take the % identity, and at the end sort in descending order based on % identity to new sequence
        dist2 <- foreach(j = 1:nrow(ref), .combine = "rbind") %dopar%
          #parBlast is a custom function I made. Using %do% makes it run in serial.
          parBlast(j)
        
        
        
        
        row.names(dist2) <- ref$names
       
        
        colnames(dist2) <- "Percent Identity with Input DNA"
        dist2 <- as.data.frame(dist2)

        dist2 <- rownames_to_column(dist2, "Database Sequence")
        # dist2$Protein_ID <- ref$`Protein Type`
        # dist2$Strain <- ref$Strain
        # dist2$Country <- ref$Country
        # dist2$State <- ref$State
        # dist2$Date <- ref$Date
        # dist2$Input_Seq <- a2$name[z]
        
        dist2 <- dist2[order(-dist2[,2]),]
        dist2 <- dist2[1,]
        dist2 = cbind(`Input Sequence` = a2$names[z], dist2)
        
        
        final <- rbind(final, dist2)

      }
end = Sys.time()
start-end
## Time difference of -1.415135 mins
final
##                                                           Input Sequence
## 1   >MN999893|A/swine/Indiana/A02478980/2020|USA|Swine|2020-01-06|4|H3N2
## 2      >MN999909|A/swine/Iowa/A02478968/2020|USA|Swine|2020-01-02|4|H1N2
## 3 >MT020038|A/swine/Minnesota/A02245378/2020|USA|Swine|2020-01-08|4|H1N2
## 4 >MT020040|A/swine/Minnesota/A02245375/2020|USA|Swine|2020-01-07|4|H1N2
## 5    >MT020046|A/swine/Kansas/A02245376/2020|USA|Swine|2020-01-07|4|H3N2
##                                                        Database Sequence
## 1   >MN999893|A/swine/Indiana/A02478980/2020|USA|Swine|2020-01-06|4|H3N2
## 2      >MN999909|A/swine/Iowa/A02478968/2020|USA|Swine|2020-01-02|4|H1N2
## 3 >MT020038|A/swine/Minnesota/A02245378/2020|USA|Swine|2020-01-08|4|H1N2
## 4 >MT020040|A/swine/Minnesota/A02245375/2020|USA|Swine|2020-01-07|4|H1N2
## 5    >MT020046|A/swine/Kansas/A02245376/2020|USA|Swine|2020-01-07|4|H3N2
##   Percent Identity with Input DNA
## 1                             100
## 2                             100
## 3                             100
## 4                             100
## 5                             100

Aligning Sequences Level 5: Making an App to Automate Sequence Alignment

Conclusion

If you are in a situation where the volume of alignments you have to do makes using public alignments resources unrealistic, or you don’t want to risk having your sequence information on other peoples servers, making custom alignments with R may be the best option for you.

James Young