Discovering patterns and subfamilies in biosequences

Alvis Brazma, Inge Jonassen, Esko Ukkonen, Jaak Vilo

Proceedings of the ISMB-96, p 34-43, AAAI Press 1996.


We consider the problem of automatic discovery of patterns and the corresponding subfamilies in a set of biosequences. The sequences are unaligned and may contain noise of unknown level. The patterns are of the type used in PROSITE database. In our approach we discover patterns and the respective subfamilies simultaneously. We develop a theoretically substantiated significance measure for a set of such patterns and an algorithm approximating the best pattern set and the subfamilies. The approach is based on the minimum description length (MDL) principle. We report a computing experiment correctly finding subfamilies in the family of chromo domains and revealing new strong patterns.

Keywords: pattern discovery, sequence motifs, machine learning, protein subfamilies, PROSITE, clustering, algorithms, Bayesian inference, MDL principle

Briefly about the program MDL-Pratt

In order to test the approach described in the paper, we developed a program for finding a collection of patterns and subfamilies. This uses the program Pratt (version 2.0) to find patterns shared by subsets of the given set of sequences. A greedy set-cover algorithm is used to chose patterns approximating the best pattern set (as defined in the paper, using the MDL principle).

In order to run MDL-Pratt, you need to

  1. download the perl script MDL-Pratt(one file, see below). (Make sure that the file has got the executable bit on.)
  2. download and install Pratt version 2.0 in the same directory.
  3. put your sequences in a single file <seqs> in Fasta format
  4. run MDL-Pratt:
    MDL-Pratt <seqs>
MDL-Pratt will create some temporary files while executing.

Relevant links:


The authors give no guarantee about the reliability of the software or the quality of its output. If problems occur, please email

Page compiled by: Inge Jonassen.

More papers.