\documentclass[a4paper]{article} \usepackage{haskell} \begin{document} \section{Alignment Algorithms} Support module for the Align module. This document describes sevaral implementations of shortest edit distance, using variations over dynamic programming. It is also a literate haskell module which implements them. \begin{code} {-# OPTIONS_GHC -fglasgow-exts #-} -- \end{code}\begin{code} module Bio.Sequence.AlignBase where import Data.Array hiding (index) import Data.ByteString.Char8 (ByteString,pack) import qualified Data.ByteString.Char8 as B import Data.List (maximumBy,unfoldr,lookup,elemIndex) import Data.Maybe (fromJust) import Test.QuickCheck import Data.Char -- | data type for an Alignment -- others are certainly possible data Edit = Ins Char | Del Char | Repl Char Char deriving (Show,Eq) type Alignment = [Edit] showalign a = let (s1,s2) = toStrings a in s1++"\n"++s2 -- | turn an alignment into sequences with '-' representing gaps -- (for checking, filtering out the '-' characters should return -- the original sequences, provided '-' isn't part of the sequence -- alphabet) toStrings :: Alignment -> (String,String) toStrings [] = ("","") toStrings (x:xs) = let (a1',a2') = toStrings xs in case x of Ins c -> ('-':a1', c:a2') Del c -> (c:a1', '-':a2') Repl c1 c2 -> (c1:a1', c2:a2') \end{code} \subsection{Scoring alignments} In order to talk about ``optimal'' alignments, we need to define an alignment score. Typically, we assign a constant, negative score to gaps, and a score to replacement that depends on the actual characters being replaced. The alignment cost is then the sum of costs. \begin{code} type ScoreFunc a = Edit -> a eval_align :: Num a => ScoreFunc a -> Alignment -> a eval_align f = sum . map f -- scores for equal characters, gaps, and replacements simple_eval :: Num a => (a,a,a) -> ScoreFunc a simple_eval (eq,_,rep) (Repl x y) = if x==y then eq else rep simple_eval (_,gp,_) _ = gp -- more advanced scoring could use a similarity matrix to calculate -- the cost or replacement type SubstMx a = Char -> Char -> a gen_eval :: Num a => SubstMx a -> a -> ScoreFunc a gen_eval mx _ (Repl x y) = mx x y gen_eval _ gp _ = gp -- using gen_eval to define an alternative simple_eval simple_eval2 (eq,gp,rep) = gen_eval rf gp where rf x y = if x==y then eq else rep -- | Calculating the score of an alignment using one set of parameter values score_align_1 :: Alignment -> Int score_align_1 = eval_align (simple_eval (2,-1,-2)) \end{code} \subsection{QC properties} \begin{code} instance Arbitrary Char where arbitrary = elements "ACGT" coarbitrary n = variant (ord n) instance Arbitrary ByteString where arbitrary = arbitrary >>= return . pack coarbitrary = undefined -- generic checking an alignment for containing all the correct characters qc_align af s1 s2 = let (a1,a2) = toStrings $ af s1 s2 rep = pack . filter (/='-') in rep a1 == s1 && rep a2 == s2 \end{code} \section{A straightforward, specific implementations} Needleman-Wunsch implements global alignment. This is the textbook implementation, and it populates an $n\times m$ array of partial alignment scores, and backtracks to determine the actual alignment. First, we do the direct, exponential time recurrence as an illustration. Observe that we can construct the partial alignment of s1[0..i] and s2[0..j] in three ways, ending with either inserting s2[j], deleting s1[i], or replacing s1[i] with s2[j]. This gives us the following recursion: \begin{code} needleman_wunsch_score_0 (eq,gp,rep) s1 s2 = cost (B.length s1,B.length s2) where cost (0,0) = 0 cost (0,j) = j*gp cost (i,0) = i*gp cost (i,j) = maximum [cost (i-1,j) + gp ,cost (i,j-1) + gp ,cost (i-1,j-1)+if B.index s1 (i-1) == B.index s2 (j-1) then eq else rep ] \end{code} \section{Dynamic Programming} To avoid the exponential time of the recurrence, we observe that since $i$ runs from 0 to $n$ and $j$ from 0 to $m$, we actually only calculate $(n+1)(m+1)$ values. We can thus do this in quadratic time by memoizing the function values -- memoization of a recurrence is, for reasons unknown, called \emph{dynamic programming}. \begin{code} -- Straightforward N-W needleman_wunsch_score_1 prop s1 s2 = (nwmatrix prop s1 s2)!(B.length s1,B.length s2) needleman_wunsch_1 (eq,gp,rep) s1 s2 = nwbacktrack (eq,gp,rep) (nwmatrix (eq,gp,rep) s1 s2) s1 s2 -- check recovery prop_nw1 = qc_align (needleman_wunsch_1 (2,-1,-2)) -- simple DP implementation, recursive definition of the array nwmatrix (eq,gp,rep) s1 s2 = let a = array ((0,0),(B.length s1,B.length s2)) [((i,j),score i j) | i <- [0..B.length s1], j <- [0..B.length s2]] score i j = if i == 0 then j*gp else if j == 0 then i*gp else maximum [a!(i-1,j-1)+if B.index s1 (i-1) == B.index s2 (j-1) then eq else rep ,a!(i-1,j)+gp, a!(i,j-1)+gp] in a -- backtracking does, in essence, the opposite nwbacktrack (eq,gp,rep) a x y = let (_,(i,j)) = bounds a in Prelude.reverse $ mka' (i,j) where mka' (0,0) = [] mka' (i,0) = Prelude.map (Del . B.index x) (reverse [0..i-1]) mka' (0,j) = Prelude.map (Ins . B.index y) (reverse [0..j-1]) mka' (i,j) | a!(i,j)-a!(i-1,j-1) == eq && (B.index x (i-1) == B.index y (j-1)) || a!(i,j)-a!(i-1,j-1) == rep && (B.index x (i-1) /= B.index y (j-1)) = Repl (B.index x (i-1)) (B.index y (j-1)) : mka' (i-1,j-1) | a!(i,j)-a!(i,j-1) == gp = Ins (B.index y (j-1)): mka' (i,j-1) | a!(i,j)-a!(i-1,j) == gp = Del (B.index x (i-1)): mka' (i-1,j) | otherwise = error (show (i,j,a)) -- QuickCheck properties to verify that the scores are correct -- Compare the score of the alignment to the score in the last position in the DP matrix prop_NW_1_score s1 s2 = let parms = (2,-1,-2) in score_align_1 (needleman_wunsch_1 parms s1 s2) == (nwmatrix parms s1 s2)!(B.length s1,B.length s2) -- Compare against the recurrence -- the exponential version is slow, so we need to cap the string lengths prop_NW_01_score s1' s2' = let (s1,s2) = (B.take 5 s1', B.take 5 s2') parms = (2,-1,-2) in needleman_wunsch_score_0 parms s1 s2 == (nwmatrix parms s1 s2)!(B.length s1,B.length s2) \end{code} \section{Use a generalized scoring function} Scoring functions are often more complex, typically an amino acid is more likely to be substituted by some candidates than by others. The substitution scores are collected in substitution matrices (e.g. BLOSUM), but we will just use a generic function that calculates the score from a number of possible edits. For good measure, we'll also provide local alignments (i.e. Smith-Waterman). This is similar to Needleman-Wunsch, but replaces any negative score with zero, and starts backtracking from the maximal value in the DP matrix. \begin{code} -- | a selector function picks a value from a list of edits type Selector a = [(a,Edit)] -> a -- | score is built from a selector and an evaluator -- note that we assume sum as the total score score :: Num a => ([a]->a) -> (Edit->a) -> Selector a score select evaluate = select . Prelude.map (\(a,e) -> evaluate e + a) -- | Needleman-Wunsch (global) and Smith-Waterman (local) alignment selectors -- using the 'simple_eval' scoring function nws, sws :: (Ord a,Num a) => (a,a,a) -> Selector a nws ps = score (\xs -> if Prelude.null xs then 0 else maximum xs) (simple_eval ps) sws ps = score (maximum . (0:)) (simple_eval ps) \end{code} More generally, we can allow an arbitrary replacement function. This lets the user select an optimal function representing a substitution matrix - it could be an alist, an associative map, or an array. \begin{code} mxscore :: (Ord a,Num a) => SubstMx a -> a -> Selector a mxscore mx g = score maximum (gen_eval mx g) -- simple_mx is equivalent to regular NW with repl cost -2 and equality reward 2 simple_mx :: (Ord a,Num a) => SubstMx a simple_mx x y = let cost = [((a,b),if a==b then 2 else -2) | a <- "ACGT", b<-"ACGT"] in maybe (error "no such character") id . flip Data.List.lookup cost $ (x,y) -- compare simple_mx subst matrix with regular NW prop_subst_mx s1 s2 = let mx = mxscore simple_mx (negate 1) nw = nws (2,-1,-2) in mkcolumns mx s1 s2 == mkcolumns nw s1 s2 \end{code} Now we are ready to implement the DP matrix and backtracking, making use of these functions: \begin{code} -- building the DP array using the selector function to select the optimal edit mkarray :: Selector a -> ByteString -> ByteString -> Array (Int,Int) a mkarray f s1 s2 = let a = array ((0,0),(B.length s1,B.length s2)) [((i,j),score i j) | i <- [0..B.length s1], j <- [0..B.length s2]] score i j = let (xi,yi) = (B.index s1 (i-1),B.index s2 (j-1)) edits = Prelude.concat [if i>0 then [(a!(i-1,j),Del xi)] else [] ,if j>0 then [(a!(i,j-1),Ins yi)] else [] ,if i>0 && j>0 then [(a!(i-1,j-1),Repl xi yi)] else []] in f edits in a -- backtracking, again using the selector function to select the correct path -- also, we allow an optional starting point and termination condition to allow local alignments backtrackFrom :: (Eq a, Ord a) => Maybe ((Int,Int),a->Bool) -> Selector a -> Array (Int,Int) a -> ByteString -> ByteString -> Alignment backtrackFrom local f a s1 s2 = let (_,(i,j)) = bounds a in Prelude.reverse $ mka' (maybe (const False) snd local) (maybe (i,j) fst local) where mka' cr (0,0) = [] mka' cr (i,0) = Prelude.map (Del . B.index s1) (reverse [0..i-1]) mka' cr (0,j) = Prelude.map (Ins . B.index s2) (reverse [0..j-1]) mka' cr (i,j) = if cr (a!(i,j)) then [] else let (xi,yi) = (B.index s1 (i-1),B.index s2 (j-1)) in if a!(i,j) == f [(a!(i-1,j-1),Repl xi yi)] then Repl xi yi : mka' cr (i-1,j-1) else if a!(i,j) == f [(a!(i,j-1),Ins yi)] then Ins yi : mka' cr (i,j-1) else if a!(i,j) == f [(a!(i-1,j),Del xi)] then Del xi : mka' cr (i-1,j) else error ("mka failed at "++show ((i,j),xi,yi)) -- using the generalized functions to do N-W and S-W needleman_wunsch, smith_waterman :: (Int,Int,Int) -> ByteString -> ByteString -> Alignment needleman_wunsch ps x y = global_align (nws ps) x y smith_waterman ps x y = local_align (sws ps) x y -- check string recovery prop_nw = qc_align (needleman_wunsch (2,-1,-2)) global_align, local_align :: (Num a,Ord a) => Selector a -> ByteString -> ByteString -> Alignment -- align from corner to corner global_align f s1 s2 = let a = mkarray f s1 s2 in backtrackFrom Nothing f a s1 s2 -- align from the highest score, backtrack until score <= 0 local_align f s1 s2 = let a = mkarray f s1 s2 start = Just (fst $ maximumBy (compare `on` snd) $ assocs a,(<=0)) in backtrackFrom start f a s1 s2 on cmp f = \x y -> cmp (f x) (f y) -- note that this also checks the bias in the backtracking (preference in case of equal scoring paths) prop_NW1_NW s1 s2 = let parms = (2,-1,-2) in needleman_wunsch_1 parms s1 s2 == needleman_wunsch parms s1 s2 {- -- affine NW scoring function -- this won't work, as backtracking must consider only one of the values! nwsa :: (Ord a,Num a) => (a,a,a,a) -> Selector (a,a) nwsa ps = score (\xs -> if null xs then (0,0) else mymax xs) (affinescore ps) where mymax xs = (maximum $ map fst xs, maximum $ map snd xs) -- parameters: equal, gap open, gap extend, replacement affinescore :: Num a => (a,a,a,a) -> [(a,a),Edit] -> (a,a) affinescore (_,gp,ge,_) ((sg,sc),Del _) = (max (sg+ge) (sc+gp), affinescore (_,gp,ge,_) ((sg,sc),Ins _) = (sg+ge,sc+gp) affinescore (eq,_,_,rep) ((sg,sc),Repl x y) = let c = if x==y then eq else rep in (sg+c,sc+c) -} \end{code} \section{Linear space algorithm} If you're only interested in the actual score, and not the entire alignment, an O(min(n,m)) space method is possible by storing only the current column of the alignment matrix. We take the liberty of producing a list of columns -- this way, we can either retain the entire list for backtracking, or apply last to keep only the final column. We'll add a parameter for functions deconstructing the string, which we'll make use of later. \begin{code} type HeadTail = (ByteString->Char, ByteString->ByteString) mkcolumns = mkcolumns' (B.head, B.tail) mkcolumns' :: forall a . (Num a,Ord a) => HeadTail -> Selector a -> ByteString -> ByteString -> [[a]] mkcolumns' (hd,tl) f s1 s2 = let c0 = 0 : map f (zipWith (\x y -> [(x,y)]) c0 (map Ins (elts s2))) elts s = if B.null s then [] else hd s : elts (tl s) -- B.unpack in the forward case mkcol :: ([a],ByteString) -> Maybe ([a],([a],ByteString)) mkcol (prev,x) = if B.null x then Nothing else let xi = hd x ys = elts s2 c = f [(Prelude.head prev,Del xi)] : newcol prev x newcol prev x' = [f [del,ins,rep] | del <- zip (Prelude.tail prev) $ repeat (Del xi) | ins <- zip c $ map Ins ys | rep <- zip prev $ map (Repl xi) ys] in Just (c,(c,tl x)) in c0 : Data.List.unfoldr mkcol (c0,s1) -- compare mkcolumns to mkarray prop_mkcolumns_mkarray x y = concat (mkcolumns (nws (2,-2,-1)) x y) == elems (mkarray (nws (2,-2,-1)) x y) -- switching the strings transposes the matrix prop_mkcolumns_transpose s1 s2 = mkcolumns f s1 s2 == transpose (mkcolumns f s2 s1) where transpose ([]:_) = [] transpose xs = map Prelude.head xs : transpose (map Prelude.tail xs) f = nws (3,-1,-2) prop_mkcolumns_score s1 s2 = (last $ last $ mkcolumns f s1 s2) == needleman_wunsch_score_1 (2,-1,-2) s1 s2 where f = nws (2,-1,-2) -- quis custodet... this always fails when equal chars are aligned -- prop_fails s1 s2 = mkcolumns (nws (3,-2,-3)) s1 s2 == mkcolumns (nws (2,-2,-3)) s1 s2 \end{code} We could backtrack over the set of columns, but there is another option. If we calculate the alignment scores of the first half of s1 against s2 (i.e. the n/2th column), and the alignment score of the reversed second half of s1 against s2 (i.e. going from the bottom rigth corner instead of the top left), we can sum the columns, and find the maximum alignment score. \begin{code} nw3score f s1 s2 = let n = B.length s1 s1a = B.take (n `div` 2) s1 s1b = B.reverse $ B.drop (n `div` 2) s1 s2r = B.reverse s2 col = zipWith (+) (last $ mkcolumns f s1a s2) (reverse $ last $ mkcolumns f s1b s2r) in maximum col prop_nw3score s1 s2 = nw3score (nws (2,-1,-2)) s1 s2 == needleman_wunsch_score_1 (2,-1,-2) s1 s2 \end{code} One observation one is tempted to make, is that the optimal global alignment has to pass through the cell containing the maximum (say it's cell $j$ in the column). Which means we can then divide and conquer by aligning s1[0..n/2] with s2[0..j-1] and s1[n/2+1..n-1] with s2[j..m-1], and concatenate the two alignments to construct the complete global alignments. Now, all those reverses make the code convoluted as well as inefficient. Especially when we want to iterate. So let's make use of the HeadTail parameter to mkcolumns' to do the reversals. \begin{code} prop_revcols s1 s2 = mkcolumns f (B.reverse s1) (B.reverse s2) == mkcolumns' (B.last,B.init) f s1 s2 where f = nws (2,-1,-2) nw4score f s1 s2 = let (s1a,s1b) = B.splitAt (B.length s1 `div` 2) s1 col = zipWith (+) (last $ mkcolumns' (B.head,B.tail) f s1a s2) (reverse $ last $ mkcolumns' (B.last,B.init) f s1b s2) in maximum col prop_nw4score s1 s2 = nw4score (nws (2,-1,-2)) s1 s2 == needleman_wunsch_score_1 (2,-1,-2) s1 s2 \end{code} Iterate to build the alignment. \begin{code} nw4 f s1 s2 = case (B.length s1,B.length s2) of (0,_) -> map Ins (B.unpack s2) (_,0) -> map Del (B.unpack s1) (1,n) -> global_align f s1 s2 -- fall back to traditional (_,_) -> let n = (1+B.length s1) `div` 2 (s1a,s1b) = B.splitAt n s1 (s2a,s2b) = B.splitAt (nw4' f s1a s1b s2) s2 in nw4 f s1a s2a ++ nw4 f s1b s2b nw4' f s1a s1b s2 = let col = zipWith (+) (last $ mkcolumns' (B.head,B.tail) f s1a s2) (reverse $ last $ mkcolumns' (B.last,B.init) f s1b s2) in fromJust $ Data.List.elemIndex (maximum col) col -- this algorithm has a different bias than the previous algorithms, and will -- produce different (but equally valid) alignments prop_nw_nw4 = qc_align (nw4 $ nws (2,-1,-2)) prop_nw_nw4_score s1 s2 = let f = nws (2,-1,-2) in (score_align_1 $ nw4 f s1 s2) == (nw4score f s1 s2) \end{code} Local alignment can be implemented by searching the matrix forward to find the last position of the alignment, and then searching on the reversed strings to find the starting point. With the substrings identified, it becomes a simple global alignment. (Is there a better way?) \end{document}