Methods for finding motifs in sets of related biosequences

Dr. scient thesis


Inge Jonassen,

Dept. of Informatics,
University of Bergen,


The automatic discovery of patterns conserved in groups of related biological sequences is an important problem in molecular biology. This thesis discusses this problem, and presents a systematisation of a large number of reported methods. New methods for the automatic discovery of patterns and collection of patterns in sets of unaligned protein sequences, are proposed. The methods are able to discover patterns of a quite general type, and are guaranteed to find the best, according to a defined evaluation function, conserved patterns. Both non-heuristic and heuristic search methods are proposed. The problem of evaluating discovered patterns is discussed and several new evaluation functions are proposed. The new functions are shown to have useful properties for a set of test cases. The methods proposed in this thesis have been primarily designed for analysing protein sequences, but they may also be applicable to the analysis of nucleotide (DNA/RNA) sequences and possibly other types of sequence data.


bioinformatics, protein sequences, pattern discovery, machine learning, search methods, PROSITE, minimum descript length principle

The thesis:

The thesis consists of:

See also the entry in the University's Dissertation Database

Inge's Home page.