\documentclass[a4paper]{article}
\usepackage{haskell}
\begin{document}
\section{Alignment Algorithms}
Support module for the Align module.
This document describes sevaral implementations of shortest edit distance, using
variations over dynamic programming.
It is also a literate haskell module which implements them.
\begin{code}
{-# OPTIONS_GHC -fglasgow-exts #-} -- \end{code}\begin{code}
module Bio.Sequence.AlignBase where
import Data.Array hiding (index)
import Data.ByteString.Char8 (ByteString,pack)
import qualified Data.ByteString.Char8 as B
import Data.List (maximumBy,unfoldr,lookup,elemIndex)
import Data.Maybe (fromJust)
import Test.QuickCheck
import Data.Char
-- | data type for an Alignment -- others are certainly possible
data Edit = Ins Char | Del Char | Repl Char Char deriving (Show,Eq)
type Alignment = [Edit]
showalign a = let (s1,s2) = toStrings a in s1++"\n"++s2
-- | turn an alignment into sequences with '-' representing gaps
-- (for checking, filtering out the '-' characters should return
-- the original sequences, provided '-' isn't part of the sequence
-- alphabet)
toStrings :: Alignment -> (String,String)
toStrings [] = ("","")
toStrings (x:xs) = let (a1',a2') = toStrings xs
in case x of Ins c -> ('-':a1', c:a2')
Del c -> (c:a1', '-':a2')
Repl c1 c2 -> (c1:a1', c2:a2')
\end{code}
\subsection{Scoring alignments}
In order to talk about ``optimal'' alignments, we need to define
an alignment score. Typically, we assign a constant, negative score to gaps,
and a score to replacement that depends on the actual characters being
replaced. The alignment cost is then the sum of costs.
\begin{code}
type ScoreFunc a = Edit -> a
eval_align :: Num a => ScoreFunc a -> Alignment -> a
eval_align f = sum . map f
-- scores for equal characters, gaps, and replacements
simple_eval :: Num a => (a,a,a) -> ScoreFunc a
simple_eval (eq,_,rep) (Repl x y) = if x==y then eq else rep
simple_eval (_,gp,_) _ = gp
-- more advanced scoring could use a similarity matrix to calculate
-- the cost or replacement
type SubstMx a = Char -> Char -> a
gen_eval :: Num a => SubstMx a -> a -> ScoreFunc a
gen_eval mx _ (Repl x y) = mx x y
gen_eval _ gp _ = gp
-- using gen_eval to define an alternative simple_eval
simple_eval2 (eq,gp,rep) = gen_eval rf gp where rf x y = if x==y then eq else rep
-- | Calculating the score of an alignment using one set of parameter values
score_align_1 :: Alignment -> Int
score_align_1 = eval_align (simple_eval (2,-1,-2))
\end{code}
\subsection{QC properties}
\begin{code}
instance Arbitrary Char where
arbitrary = elements "ACGT"
coarbitrary n = variant (ord n)
instance Arbitrary ByteString where
arbitrary = arbitrary >>= return . pack
coarbitrary = undefined
-- generic checking an alignment for containing all the correct characters
qc_align af s1 s2 = let (a1,a2) = toStrings $ af s1 s2
rep = pack . filter (/='-')
in rep a1 == s1 && rep a2 == s2
\end{code}
\section{A straightforward, specific implementations}
Needleman-Wunsch implements global alignment. This is the textbook
implementation, and it populates an $n\times m$ array of partial
alignment scores, and backtracks to determine the actual alignment.
First, we do the direct, exponential time recurrence as an
illustration. Observe that we can construct the partial alignment
of s1[0..i] and s2[0..j] in three ways, ending with either inserting
s2[j], deleting s1[i], or replacing s1[i] with s2[j].
This gives us the following recursion:
\begin{code}
needleman_wunsch_score_0 (eq,gp,rep) s1 s2 = cost (B.length s1,B.length s2)
where cost (0,0) = 0
cost (0,j) = j*gp
cost (i,0) = i*gp
cost (i,j) = maximum [cost (i-1,j) + gp
,cost (i,j-1) + gp
,cost (i-1,j-1)+if B.index s1 (i-1) == B.index s2 (j-1) then eq else rep ]
\end{code}
\section{Dynamic Programming}
To avoid the exponential time of the recurrence, we observe that since $i$ runs from 0 to $n$ and $j$ from
0 to $m$, we actually only calculate $(n+1)(m+1)$ values. We can thus do this in quadratic time by
memoizing the function values -- memoization of a recurrence is, for reasons unknown, called
\emph{dynamic programming}.
\begin{code}
-- Straightforward N-W
needleman_wunsch_score_1 prop s1 s2 = (nwmatrix prop s1 s2)!(B.length s1,B.length s2)
needleman_wunsch_1 (eq,gp,rep) s1 s2 = nwbacktrack (eq,gp,rep) (nwmatrix (eq,gp,rep) s1 s2) s1 s2
-- check recovery
prop_nw1 = qc_align (needleman_wunsch_1 (2,-1,-2))
-- simple DP implementation, recursive definition of the array
nwmatrix (eq,gp,rep) s1 s2 =
let a = array ((0,0),(B.length s1,B.length s2)) [((i,j),score i j) | i <- [0..B.length s1], j <- [0..B.length s2]]
score i j = if i == 0 then j*gp else if j == 0 then i*gp else
maximum [a!(i-1,j-1)+if B.index s1 (i-1) == B.index s2 (j-1) then eq else rep
,a!(i-1,j)+gp, a!(i,j-1)+gp]
in a
-- backtracking does, in essence, the opposite
nwbacktrack (eq,gp,rep) a x y = let (_,(i,j)) = bounds a in Prelude.reverse $ mka' (i,j)
where mka' (0,0) = []
mka' (i,0) = Prelude.map (Del . B.index x) (reverse [0..i-1])
mka' (0,j) = Prelude.map (Ins . B.index y) (reverse [0..j-1])
mka' (i,j) | a!(i,j)-a!(i-1,j-1) == eq && (B.index x (i-1) == B.index y (j-1))
|| a!(i,j)-a!(i-1,j-1) == rep && (B.index x (i-1) /= B.index y (j-1))
= Repl (B.index x (i-1)) (B.index y (j-1)) : mka' (i-1,j-1)
| a!(i,j)-a!(i,j-1) == gp = Ins (B.index y (j-1)): mka' (i,j-1)
| a!(i,j)-a!(i-1,j) == gp = Del (B.index x (i-1)): mka' (i-1,j)
| otherwise = error (show (i,j,a))
-- QuickCheck properties to verify that the scores are correct
-- Compare the score of the alignment to the score in the last position in the DP matrix
prop_NW_1_score s1 s2 = let parms = (2,-1,-2)
in score_align_1 (needleman_wunsch_1 parms s1 s2) == (nwmatrix parms s1 s2)!(B.length s1,B.length s2)
-- Compare against the recurrence -- the exponential version is slow, so we need to cap the string lengths
prop_NW_01_score s1' s2' = let (s1,s2) = (B.take 5 s1', B.take 5 s2')
parms = (2,-1,-2)
in needleman_wunsch_score_0 parms s1 s2 == (nwmatrix parms s1 s2)!(B.length s1,B.length s2)
\end{code}
\section{Use a generalized scoring function}
Scoring functions are often more complex, typically an amino acid is more
likely to be substituted by some candidates than by others. The substitution
scores are collected in substitution matrices (e.g. BLOSUM), but we will
just use a generic function that calculates the score from a number of possible
edits.
For good measure, we'll also provide local alignments (i.e. Smith-Waterman).
This is similar to Needleman-Wunsch, but replaces any negative score with zero,
and starts backtracking from the maximal value in the DP matrix.
\begin{code}
-- | a selector function picks a value from a list of edits
type Selector a = [(a,Edit)] -> a
-- | score is built from a selector and an evaluator
-- note that we assume sum as the total score
score :: Num a => ([a]->a) -> (Edit->a) -> Selector a
score select evaluate = select . Prelude.map (\(a,e) -> evaluate e + a)
-- | Needleman-Wunsch (global) and Smith-Waterman (local) alignment selectors
-- using the 'simple_eval' scoring function
nws, sws :: (Ord a,Num a) => (a,a,a) -> Selector a
nws ps = score (\xs -> if Prelude.null xs then 0 else maximum xs) (simple_eval ps)
sws ps = score (maximum . (0:)) (simple_eval ps)
\end{code}
More generally, we can allow an arbitrary replacement function.
This lets the user select an optimal function representing a
substitution matrix - it could be an alist, an associative map,
or an array.
\begin{code}
mxscore :: (Ord a,Num a) => SubstMx a -> a -> Selector a
mxscore mx g = score maximum (gen_eval mx g)
-- simple_mx is equivalent to regular NW with repl cost -2 and equality reward 2
simple_mx :: (Ord a,Num a) => SubstMx a
simple_mx x y = let cost = [((a,b),if a==b then 2 else -2) | a <- "ACGT", b<-"ACGT"]
in maybe (error "no such character") id . flip Data.List.lookup cost $ (x,y)
-- compare simple_mx subst matrix with regular NW
prop_subst_mx s1 s2 = let mx = mxscore simple_mx (negate 1)
nw = nws (2,-1,-2)
in mkcolumns mx s1 s2 == mkcolumns nw s1 s2
\end{code}
Now we are ready to implement the DP matrix and backtracking, making use of these functions:
\begin{code}
-- building the DP array using the selector function to select the optimal edit
mkarray :: Selector a -> ByteString -> ByteString -> Array (Int,Int) a
mkarray f s1 s2 = let a = array ((0,0),(B.length s1,B.length s2))
[((i,j),score i j) | i <- [0..B.length s1], j <- [0..B.length s2]]
score i j = let (xi,yi) = (B.index s1 (i-1),B.index s2 (j-1))
edits = Prelude.concat [if i>0 then [(a!(i-1,j),Del xi)] else []
,if j>0 then [(a!(i,j-1),Ins yi)] else []
,if i>0 && j>0 then [(a!(i-1,j-1),Repl xi yi)] else []]
in f edits
in a
-- backtracking, again using the selector function to select the correct path
-- also, we allow an optional starting point and termination condition to allow local alignments
backtrackFrom :: (Eq a, Ord a)
=> Maybe ((Int,Int),a->Bool) -> Selector a -> Array (Int,Int) a -> ByteString -> ByteString -> Alignment
backtrackFrom local f a s1 s2 = let (_,(i,j)) = bounds a
in Prelude.reverse $ mka' (maybe (const False) snd local) (maybe (i,j) fst local)
where mka' cr (0,0) = []
mka' cr (i,0) = Prelude.map (Del . B.index s1) (reverse [0..i-1])
mka' cr (0,j) = Prelude.map (Ins . B.index s2) (reverse [0..j-1])
mka' cr (i,j) = if cr (a!(i,j)) then []
else let (xi,yi) = (B.index s1 (i-1),B.index s2 (j-1))
in if a!(i,j) == f [(a!(i-1,j-1),Repl xi yi)] then Repl xi yi : mka' cr (i-1,j-1)
else if a!(i,j) == f [(a!(i,j-1),Ins yi)] then Ins yi : mka' cr (i,j-1)
else if a!(i,j) == f [(a!(i-1,j),Del xi)] then Del xi : mka' cr (i-1,j)
else error ("mka failed at "++show ((i,j),xi,yi))
-- using the generalized functions to do N-W and S-W
needleman_wunsch, smith_waterman :: (Int,Int,Int) -> ByteString -> ByteString -> Alignment
needleman_wunsch ps x y = global_align (nws ps) x y
smith_waterman ps x y = local_align (sws ps) x y
-- check string recovery
prop_nw = qc_align (needleman_wunsch (2,-1,-2))
global_align, local_align :: (Num a,Ord a) => Selector a -> ByteString -> ByteString -> Alignment
-- align from corner to corner
global_align f s1 s2 = let a = mkarray f s1 s2
in backtrackFrom Nothing f a s1 s2
-- align from the highest score, backtrack until score <= 0
local_align f s1 s2 = let a = mkarray f s1 s2
start = Just (fst $ maximumBy (compare `on` snd) $ assocs a,(<=0))
in backtrackFrom start f a s1 s2
on cmp f = \x y -> cmp (f x) (f y)
-- note that this also checks the bias in the backtracking (preference in case of equal scoring paths)
prop_NW1_NW s1 s2 = let parms = (2,-1,-2)
in needleman_wunsch_1 parms s1 s2 == needleman_wunsch parms s1 s2
{-
-- affine NW scoring function
-- this won't work, as backtracking must consider only one of the values!
nwsa :: (Ord a,Num a) => (a,a,a,a) -> Selector (a,a)
nwsa ps = score (\xs -> if null xs then (0,0) else mymax xs) (affinescore ps)
where mymax xs = (maximum $ map fst xs, maximum $ map snd xs)
-- parameters: equal, gap open, gap extend, replacement
affinescore :: Num a => (a,a,a,a) -> [(a,a),Edit] -> (a,a)
affinescore (_,gp,ge,_) ((sg,sc),Del _) = (max (sg+ge) (sc+gp),
affinescore (_,gp,ge,_) ((sg,sc),Ins _) = (sg+ge,sc+gp)
affinescore (eq,_,_,rep) ((sg,sc),Repl x y) = let c = if x==y then eq else rep in (sg+c,sc+c)
-}
\end{code}
\section{Linear space algorithm}
If you're only interested in the actual score, and not the entire alignment,
an O(min(n,m)) space method is possible by storing only the current column of
the alignment matrix.
We take the liberty of producing a list of columns -- this way, we can either
retain the entire list for backtracking, or apply last to keep only the final
column.
We'll add a parameter for functions deconstructing the string, which we'll make
use of later.
\begin{code}
type HeadTail = (ByteString->Char, ByteString->ByteString)
mkcolumns = mkcolumns' (B.head, B.tail)
mkcolumns' :: forall a . (Num a,Ord a) => HeadTail -> Selector a -> ByteString -> ByteString -> [[a]]
mkcolumns' (hd,tl) f s1 s2 =
let c0 = 0 : map f (zipWith (\x y -> [(x,y)]) c0 (map Ins (elts s2)))
elts s = if B.null s then [] else hd s : elts (tl s) -- B.unpack in the forward case
mkcol :: ([a],ByteString) -> Maybe ([a],([a],ByteString))
mkcol (prev,x) = if B.null x then Nothing
else let xi = hd x
ys = elts s2
c = f [(Prelude.head prev,Del xi)] : newcol prev x
newcol prev x' = [f [del,ins,rep] | del <- zip (Prelude.tail prev) $ repeat (Del xi)
| ins <- zip c $ map Ins ys
| rep <- zip prev $ map (Repl xi) ys]
in Just (c,(c,tl x))
in c0 : Data.List.unfoldr mkcol (c0,s1)
-- compare mkcolumns to mkarray
prop_mkcolumns_mkarray x y = concat (mkcolumns (nws (2,-2,-1)) x y) == elems (mkarray (nws (2,-2,-1)) x y)
-- switching the strings transposes the matrix
prop_mkcolumns_transpose s1 s2 = mkcolumns f s1 s2 == transpose (mkcolumns f s2 s1)
where transpose ([]:_) = []
transpose xs = map Prelude.head xs : transpose (map Prelude.tail xs)
f = nws (3,-1,-2)
prop_mkcolumns_score s1 s2 = (last $ last $ mkcolumns f s1 s2) == needleman_wunsch_score_1 (2,-1,-2) s1 s2
where f = nws (2,-1,-2)
-- quis custodet... this always fails when equal chars are aligned
-- prop_fails s1 s2 = mkcolumns (nws (3,-2,-3)) s1 s2 == mkcolumns (nws (2,-2,-3)) s1 s2
\end{code}
We could backtrack over the set of columns, but there is another option.
If we calculate the alignment scores of the first half of s1 against s2
(i.e. the n/2th column), and the alignment score of the reversed second
half of s1 against s2 (i.e. going from the bottom rigth corner instead of
the top left), we can sum the columns, and find the maximum alignment score.
\begin{code}
nw3score f s1 s2 = let n = B.length s1
s1a = B.take (n `div` 2) s1
s1b = B.reverse $ B.drop (n `div` 2) s1
s2r = B.reverse s2
col = zipWith (+) (last $ mkcolumns f s1a s2)
(reverse $ last $ mkcolumns f s1b s2r)
in maximum col
prop_nw3score s1 s2 = nw3score (nws (2,-1,-2)) s1 s2 == needleman_wunsch_score_1 (2,-1,-2) s1 s2
\end{code}
One observation one is tempted to make, is that the optimal global alignment
has to pass through the cell containing the maximum (say it's cell $j$ in the column).
Which means we can then divide and conquer by aligning s1[0..n/2] with s2[0..j-1]
and s1[n/2+1..n-1] with s2[j..m-1], and concatenate the two alignments to construct
the complete global alignments.
Now, all those reverses make the code convoluted as well as inefficient. Especially when we want
to iterate. So let's make use of the HeadTail parameter to mkcolumns' to do the reversals.
\begin{code}
prop_revcols s1 s2 = mkcolumns f (B.reverse s1) (B.reverse s2) == mkcolumns' (B.last,B.init) f s1 s2
where f = nws (2,-1,-2)
nw4score f s1 s2 = let (s1a,s1b) = B.splitAt (B.length s1 `div` 2) s1
col = zipWith (+) (last $ mkcolumns' (B.head,B.tail) f s1a s2)
(reverse $ last $ mkcolumns' (B.last,B.init) f s1b s2)
in maximum col
prop_nw4score s1 s2 = nw4score (nws (2,-1,-2)) s1 s2 == needleman_wunsch_score_1 (2,-1,-2) s1 s2
\end{code}
Iterate to build the alignment.
\begin{code}
nw4 f s1 s2 = case (B.length s1,B.length s2) of
(0,_) -> map Ins (B.unpack s2)
(_,0) -> map Del (B.unpack s1)
(1,n) -> global_align f s1 s2 -- fall back to traditional
(_,_) -> let n = (1+B.length s1) `div` 2
(s1a,s1b) = B.splitAt n s1
(s2a,s2b) = B.splitAt (nw4' f s1a s1b s2) s2
in nw4 f s1a s2a ++ nw4 f s1b s2b
nw4' f s1a s1b s2 = let col = zipWith (+) (last $ mkcolumns' (B.head,B.tail) f s1a s2)
(reverse $ last $ mkcolumns' (B.last,B.init) f s1b s2)
in fromJust $ Data.List.elemIndex (maximum col) col
-- this algorithm has a different bias than the previous algorithms, and will
-- produce different (but equally valid) alignments
prop_nw_nw4 = qc_align (nw4 $ nws (2,-1,-2))
prop_nw_nw4_score s1 s2 = let f = nws (2,-1,-2)
in (score_align_1 $ nw4 f s1 s2) == (nw4score f s1 s2)
\end{code}
Local alignment can be implemented by searching the matrix forward to find the last position
of the alignment, and then searching on the reversed strings to find the starting point. With
the substrings identified, it becomes a simple global alignment.
(Is there a better way?)
\end{document}