Pattern ranking functions that take sequence diversity into account
Inge Jonassen, Carsten Helgesen, Desmond Higgins.
Unpublished manuscript.
Abstract:
An important problem in sequence analysis is to find patterns matching sets
or subsets of sequences. The functions used to evaluate such patterns have
a major role to play in deciding which patterns, if any, are found. This
paper proposes that an evaluation scheme for measuring pattern quality should
take into account both the strength of the pattern and the diversity of the
matched sequences. A pattern that matches a very divergent set of sequences
should get a higher score, and hence ranking, than an equally strong pattern
matching a set of relatively similar sequences
Ideal requirements for such a scoring scheme are given.
It is assumed that the strength of a pattern and the diversity of the
sequence set can be evaluated independently, and combined into a
total score for the match. We use a restricted class of PROSITE-like
patterns, and an earlier reported method for evaluating pattern strength.
Two alternative schemes for evaluating the diversity of a set of
sequences are proposed. One uses a dendrogram (an estimated phylogenetic
tree) and the other uses a minimum cost spanning tree. Algorithms
and practical applications are given. The combined measures are
shown to have useful properties for a set of test cases.
Information on test cases and how to download source code for the
Pratt program used here is found at URL
http://www.ii.uib.no/~inge/papers/diversity/.
Keywords:
pattern discovery, PROSITE, scoring scheme, sequence diversity, pattern strength.
See also
Scoring function for pattern discovery programs taking into account sequence diversity.
Inge Jonassen, Carsten Helgesen, Desmond Higgins.
Dept. of Informatics, Univ. of Bergen, Reports in Informatics no 116, Febr. 1996.
Full text (postscript).
Test cases:
How to run the programs:
We are planning to install a WWW interface to Pratt which will allow you
to test the program without downloading it and installing it on your own
system. At the moment you need to do the following.
-
Download source code for the Pratt program - version 2.0 - (Pratt2.tar)
from our ftp server.
-
Install Pratt by doing
-
$ tar xvf Pratt2.tar
-
$ make
This should give you an executable file pratt.
-
If you have not installed Clustal W
already, you need to do so.
-
If you want to analyse a set of sequences a file seqs, run Clustal W
with these sequences as input to produce a dendrogram seqs.dnd.
-
Run Pratt to search for
patterns conserved in some proportion of the sequences in the file seqs:
-
$ pratt fasta seqs
if the sequences are in FastA format, or
-
$ pratt swissprot seqs
if they are in Swiss-Prot format.
Use option M to set the minimum number
of sequences that a pattern should match, and use option T to instruct
Pratt to use the scoring
function described in the paper evaluating sequence
diversity from a dendrogram.
Pratt will now search for patterns matching the
minimum number of sequences that you selected using option M and rank
them according to the ranking function described in the paper. The patterns
are output to a file.
Related pages:
Inge Jonassen's homepage.