This page gives a very brief introduction to the use of patterns in sequence analysis. First we motivate why patterns are useful when analysing protein sequences and families. Next we briefly discuss how, given a pattern and a sequence/string, how all matches between the pattern and the string can be found. Finally we give a very brief introduction to some methods for discovering patterns in sets of biosequences. All of this is mainly focused on protein sequences, but similar methods may also be applied to DNA/RNA sequences.
An example of use of patterns, is the
PROSITE
database (Bairoch and Bucher, 1994) of protein families. For most
of the families in the database, diagnostic patterns are given.
A diagnostic pattern is a pattern that matches all members of
the family and no other sequences. Not all patterns are perfect
diagnostic patterns. A pattern may have false positives (sequences
not in the family matching the pattern), and false negatives
(members of the family not matching the pattern).
When a new sequence is to be analysed, it can be matched against all
the patterns in PROSITE to see if it belongs to one of the families.
The PROSITE pattern language is a subset of the regular expressions.
An example of a PROSITE pattern is (from PROSITE entry PS00518):
PROSITE also contains profiles for some families
(Bucher and Bairoch, 1994).
A profile is a position-specific scoring table (Gribskov et al, 1987).
For a protein family it may be possible to construct a profile,
and use this to decide for a new seequence whether or not it is
member of the family.
A number of methods have been proposed for the second problem. A simple
is to look for shared substrings. For example given a set of strings S,
find all substrings of length 5 occuring in all sequences in S. This method
can find patterns of the form
If several words are shared by all the sequences, and if the words occur in
the same order in each sequence, they may be combined into a composite pattern.
For example if all strings have occurences of the word ABC and the word DEF, and
such that an occurence of ABC is in front of an occurence of DEF, the sequences
all match the pattern
Approximate matching may be used instead of exact matching. For
the pattern P above a sequence s could be said to approximately
match P if there exist a string s' such that s' exactly matches P and
the distance between s and s' is below a threshold.
The distance measure can allow for substitutions, or it may allow for
both substitutions and insertions/deletions (indels) of single symbols.
Approximate string matching is often called the k-mismatch (k-difference)
problem if k substitutions (substitutions/indels) are allowed.
Other approaches are able to discover patterns of the type used in PROSITE.
One method (Smith et al., 1989) is able to discover patterns that can
be written in the form
This method was generalised to one finding more general patterns by
Neuwand and Green (1994). They can find patterns having more
than three positions,
and they also allow for some ambiguous symbols (like [IL] matching
I or L symbols). They use a block data structure to efficiently
find all matches to a pattern, and a depth first search strategy for
finding all conserved patterns.
They also give algorithms for postprocessing of the identified patterns
using profile analysis.
We (Jonassen et al., 1995) generalised this method to also find
patterns having flexibility, that is patterns of the form
Written by Inge Jonassen.
Why patterns?
A common problem in protein sequence analysis is to search for common
sequence patterns or motifs in groups of functionally related proteins.
Such patterns may be the result of common ancestry combined with
conservative evolutionary pressure to maintain important resdues at
active sites and other functionally important parts of the protein.
It is not always possible to identify conserved patterns in protein
families. When they do occur, however, they can be very simple and
useful tools in helping to identify new members of the families and
in trying to understand the relationship between sequence, structure
and function (Jonassen et al, 1995).
ID ZINC_FINGER_C3HC4; PATTERN.
PA C-x-H-x-[LIVMFY]-C-x(2)-C-[LIVMYA]
This pattern will match any sequence (string) containing a segment
starting with C followed by any symbol, followed by an H, followed
by any symbol, followed by one of L, I, V, M, F, or Y, followed by
a C, followed by any two symbols, followed by a C, followed by
one of L, I, F, M, Y, or A. Match known pattern against string
Given a pattern and a string, there exist a number of algorithms for finding
all matches to the pattern in the string. The matching may be exact or
approximate. For some algorithms, see (Myers and Miller, 1989; Manber and Wu,
1992; Wu 1992). There is a
number of tools
available for matching a new sequence against all patterns in PROSITE,
e.g. ScanProsite.
Discovering new patterns.
For a survey on different approaches to the automatic discovery of patterns
in biosequences, see (Brazma et al., 1995).
Having a set of sequences believed to be related, one may want to find
a conserved pattern. There are (at least) two alternative ways of
approaching the problem:
The first appoach will normally work well if the sequences are quite
similar, globally. Then the automatic method for multiple alignment will
probably get the alignment about right, and conserved patterns will show
up as conserved or semi-conserved columns.
P= *ABCDEF*
(* matching an arbitrary number of arbitrary symbols - variable length don't care)
where P matches all sequences containing the substring ABCDEF.
If we want to be more formal we can define the language L(P) defined by P
as the set of strings matching the pattern P.
Q= *ABC*DEF*
P= A1-x(d1)-A2-x(d2)-A3
where A1, A2, and A3 are amino acid symbols, and d1, d2 are integers.
The values of d1 and d2 are limited to be less than some limit, say 15.
If the limit 20 is used, there are 20*15*20*15*20= 1,800,00 different
patterns. A table is constructed with one entry per pattern, and
for each pattern we record all matches in the sequences. Patterns
matching all sequences (or an unexpected number of sequences) can
be found by scanning this table.
A(1)-x(i1,j1)-A2-x(i2,j2)-....A(p-1)-x(i(p-1),j(p-1))-A(p)
where the Ak's can be ambiguous symbols, and ik, jk are integers
giving the minimum and maximum number of symbols to match the
x(ik,jk) wildcard. After the initial search of pattern space using
depth first search, the patterns found can be refined using a new
pattern refinement algorithm. This is the only algorithm that
we are aware of that can find patterns capturing a large
subset of the pattern class used in the PROSITE database, allowing
for both gaps of variable length and flexibly defined groups of
amino-acids.
Survey on methods for Pattern discovery.
Tutorial on Pattern Discovery - given at ISMB-98 by Alvis Brazma and Inge Jonassen.
Related links:
References.
Earlier version: Dept. of Informatics, Univ. of Bergen, Reports in Informatics no 113, Dec. 1995. (Postscript tech report)