Pattern ranking functions that take sequence diversity into account

Inge Jonassen, Carsten Helgesen, Desmond Higgins.

Unpublished manuscript.

Abstract:

An important problem in sequence analysis is to find patterns matching sets or subsets of sequences. The functions used to evaluate such patterns have a major role to play in deciding which patterns, if any, are found. This paper proposes that an evaluation scheme for measuring pattern quality should take into account both the strength of the pattern and the diversity of the matched sequences. A pattern that matches a very divergent set of sequences should get a higher score, and hence ranking, than an equally strong pattern matching a set of relatively similar sequences Ideal requirements for such a scoring scheme are given. It is assumed that the strength of a pattern and the diversity of the sequence set can be evaluated independently, and combined into a total score for the match. We use a restricted class of PROSITE-like patterns, and an earlier reported method for evaluating pattern strength. Two alternative schemes for evaluating the diversity of a set of sequences are proposed. One uses a dendrogram (an estimated phylogenetic tree) and the other uses a minimum cost spanning tree. Algorithms and practical applications are given. The combined measures are shown to have useful properties for a set of test cases. Information on test cases and how to download source code for the Pratt program used here is found at URL http://www.ii.uib.no/~inge/papers/diversity/.

Keywords:
pattern discovery, PROSITE, scoring scheme, sequence diversity, pattern strength.

See also
Scoring function for pattern discovery programs taking into account sequence diversity.
Inge Jonassen, Carsten Helgesen, Desmond Higgins.
Dept. of Informatics, Univ. of Bergen, Reports in Informatics no 116, Febr. 1996.
Full text (postscript).


Test cases:


How to run the programs:

We are planning to install a WWW interface to Pratt which will allow you to test the program without downloading it and installing it on your own system. At the moment you need to do the following.
  1. Download source code for the Pratt program - version 2.0 - (Pratt2.tar) from our ftp server.
  2. Install Pratt by doing This should give you an executable file pratt.
  3. If you have not installed Clustal W already, you need to do so.
  4. If you want to analyse a set of sequences a file seqs, run Clustal W with these sequences as input to produce a dendrogram seqs.dnd.
  5. Run Pratt to search for patterns conserved in some proportion of the sequences in the file seqs:
    Use option M to set the minimum number of sequences that a pattern should match, and use option T to instruct Pratt to use the scoring function described in the paper evaluating sequence diversity from a dendrogram. Pratt will now search for patterns matching the minimum number of sequences that you selected using option M and rank them according to the ranking function described in the paper. The patterns are output to a file.

Related pages:


Inge Jonassen's homepage.