Discovering patterns and subfamilies in biosequences
Alvis Brazma, Inge Jonassen, Esko Ukkonen, Jaak Vilo
Proceedings of the ISMB-96, p 34-43, AAAI Press 1996.
We consider the problem of automatic discovery of patterns and the
corresponding subfamilies in a set of biosequences. The sequences are
unaligned and may contain noise of unknown level. The patterns are of the
type used in PROSITE database. In our approach we discover
patterns and the respective subfamilies simultaneously. We develop a
theoretically substantiated significance measure for a set of such
patterns and an algorithm approximating the best pattern set and the
subfamilies. The approach is based on the minimum description length (MDL)
We report a computing experiment correctly finding subfamilies in the family of
chromo domains and revealing new strong patterns.
Keywords: pattern discovery, sequence motifs, machine learning,
protein subfamilies, PROSITE, clustering, algorithms, Bayesian inference,
Briefly about the program MDL-Pratt
In order to test the approach described in the paper, we developed a program
for finding a collection of patterns and subfamilies. This uses the program
Pratt (version 2.0) to find patterns shared by subsets of the given set of
sequences. A greedy set-cover algorithm is used to chose patterns approximating
the best pattern set (as defined in the paper, using the MDL principle).
In order to run MDL-Pratt, you need to
MDL-Pratt will create some temporary files while executing.
download the perl script MDL-Pratt(one file, see below). (Make sure that the file has got the executable bit on.)
download and install Pratt version 2.0 in the same directory.
put your sequences in a single file <seqs> in Fasta format
The authors give no guarantee about the reliability of the
software or the quality of its output. If problems occur, please
Page compiled by: Inge Jonassen.