Patterns in biosequences

This page gives a very brief introduction to the use of patterns in sequence analysis. First we motivate why patterns are useful when analysing protein sequences and families. Next we briefly discuss how, given a pattern and a sequence/string, how all matches between the pattern and the string can be found. Finally we give a very brief introduction to some methods for discovering patterns in sets of biosequences. All of this is mainly focused on protein sequences, but similar methods may also be applied to DNA/RNA sequences.

Why patterns?

A common problem in protein sequence analysis is to search for common sequence patterns or motifs in groups of functionally related proteins. Such patterns may be the result of common ancestry combined with conservative evolutionary pressure to maintain important resdues at active sites and other functionally important parts of the protein. It is not always possible to identify conserved patterns in protein families. When they do occur, however, they can be very simple and useful tools in helping to identify new members of the families and in trying to understand the relationship between sequence, structure and function (Jonassen et al, 1995).

An example of use of patterns, is the PROSITE database (Bairoch and Bucher, 1994) of protein families. For most of the families in the database, diagnostic patterns are given. A diagnostic pattern is a pattern that matches all members of the family and no other sequences. Not all patterns are perfect diagnostic patterns. A pattern may have false positives (sequences not in the family matching the pattern), and false negatives (members of the family not matching the pattern). When a new sequence is to be analysed, it can be matched against all the patterns in PROSITE to see if it belongs to one of the families. The PROSITE pattern language is a subset of the regular expressions. An example of a PROSITE pattern is (from PROSITE entry PS00518):

ID   ZINC_FINGER_C3HC4; PATTERN.
PA   C-x-H-x-[LIVMFY]-C-x(2)-C-[LIVMYA]

This pattern will match any sequence (string) containing a segment starting with C followed by any symbol, followed by an H, followed by any symbol, followed by one of L, I, V, M, F, or Y, followed by a C, followed by any two symbols, followed by a C, followed by one of L, I, F, M, Y, or A.

PROSITE also contains profiles for some families (Bucher and Bairoch, 1994). A profile is a position-specific scoring table (Gribskov et al, 1987). For a protein family it may be possible to construct a profile, and use this to decide for a new seequence whether or not it is member of the family.

Match known pattern against string

Given a pattern and a string, there exist a number of algorithms for finding all matches to the pattern in the string. The matching may be exact or approximate. For some algorithms, see (Myers and Miller, 1989; Manber and Wu, 1992; Wu 1992). There is a number of tools available for matching a new sequence against all patterns in PROSITE, e.g. ScanProsite.

Discovering new patterns.

For a survey on different approaches to the automatic discovery of patterns in biosequences, see (Brazma et al., 1995). Having a set of sequences believed to be related, one may want to find a conserved pattern. There are (at least) two alternative ways of approaching the problem:

Do global multiple sequence alignment (e.g., using Clustal W, Thompson et al. 1994), and then look for conserved blocks.
Use pattern discovery algorithm, e.g. Pratt.

The first appoach will normally work well if the sequences are quite similar, globally. Then the automatic method for multiple alignment will probably get the alignment about right, and conserved patterns will show up as conserved or semi-conserved columns.

A number of methods have been proposed for the second problem. A simple is to look for shared substrings. For example given a set of strings S, find all substrings of length 5 occuring in all sequences in S. This method can find patterns of the form

	P= *ABCDEF*

(* matching an arbitrary number of arbitrary symbols - variable length don't care) where P matches all sequences containing the substring ABCDEF. If we want to be more formal we can define the language L(P) defined by P as the set of strings matching the pattern P.

If several words are shared by all the sequences, and if the words occur in the same order in each sequence, they may be combined into a composite pattern. For example if all strings have occurences of the word ABC and the word DEF, and such that an occurence of ABC is in front of an occurence of DEF, the sequences all match the pattern

	Q= *ABC*DEF*

Approximate matching may be used instead of exact matching. For the pattern P above a sequence s could be said to approximately match P if there exist a string s' such that s' exactly matches P and the distance between s and s' is below a threshold. The distance measure can allow for substitutions, or it may allow for both substitutions and insertions/deletions (indels) of single symbols. Approximate string matching is often called the k-mismatch (k-difference) problem if k substitutions (substitutions/indels) are allowed.

Other approaches are able to discover patterns of the type used in PROSITE. One method (Smith et al., 1989) is able to discover patterns that can be written in the form

 
	 P= A1-x(d1)-A2-x(d2)-A3

where A1, A2, and A3 are amino acid symbols, and d1, d2 are integers. The values of d1 and d2 are limited to be less than some limit, say 15. If the limit 20 is used, there are 20*15*20*15*20= 1,800,00 different patterns. A table is constructed with one entry per pattern, and for each pattern we record all matches in the sequences. Patterns matching all sequences (or an unexpected number of sequences) can be found by scanning this table.

This method was generalised to one finding more general patterns by Neuwand and Green (1994). They can find patterns having more than three positions, and they also allow for some ambiguous symbols (like [IL] matching I or L symbols). They use a block data structure to efficiently find all matches to a pattern, and a depth first search strategy for finding all conserved patterns. They also give algorithms for postprocessing of the identified patterns using profile analysis.

We (Jonassen et al., 1995) generalised this method to also find patterns having flexibility, that is patterns of the form

A(1)-x(i1,j1)-A2-x(i2,j2)-....A(p-1)-x(i(p-1),j(p-1))-A(p)

where the Ak's can be ambiguous symbols, and ik, jk are integers giving the minimum and maximum number of symbols to match the x(ik,jk) wildcard. After the initial search of pattern space using depth first search, the patterns found can be refined using a new pattern refinement algorithm. This is the only algorithm that we are aware of that can find patterns capturing a large subset of the pattern class used in the PROSITE database, allowing for both gaps of variable length and flexibly defined groups of amino-acids.

Survey on methods for Pattern discovery.

Tutorial on Pattern Discovery - given at ISMB-98 by Alvis Brazma and Inge Jonassen.

References.

Brazma A, Jonassen I, Eidhammer I, Gilbert D. (1995) Approaches to the automatic discovery of patterns in biosequences. Journal of Computational Biology 5, 279-305 (1998).
Earlier version: Dept. of Informatics, Univ. of Bergen, Reports in Informatics no 113, Dec. 1995. (Postscript tech report)
Bairoch A, Bucher P. (1994) PROSITE: recent developments. Nucleic Acids Res. 22, 3583-3589.
Bucher P, Bairoch A. (1994) A generalized profile syntax for biomolecular sequence motifs and its fuction in automatic sequence interpretation. Proceedings Second intl. conf. on int. syst. for molec. biol., p.53-56.
Gribskov M, McLachlan A D, Eisenberg D (1987) Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci USA 84, 4355-4358.
Manber U, Wu S. (1992) Approximate pattern matching. BYTE November 1992 p. 281-292.
Myers E W, Miller W. (1989) Approximate marching of regular expressions. Bull. Math. Biol. 51:5-37.
Jonassen I, Collins J F, Higgins D G (1995) Finding flexible patterns in unaligned protein sequences. Protein Science 4, 1587-1595 (see also Pratt home page).
Thompson J D, Higgins D G, Gibson T J. (1994) Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choi. Nucleic Acids Research 22: 4673-4680.
Wu S. (1992) Approximate pattern matching and its applications. Dept. of Computer Science, University of Arizona TR92-21.
See also my list of publications in bioinformatics.

Written by Inge Jonassen.