Pattern discovery from large data sets

Jaak Vilo, Department of Computer Science, University of Helsinki

We consider the problem of automatic discovery of patterns that occur frequently in the string or set of strings. The patterns belong to different subclasses of regular languages. In the simplest case they are substrings of original strings. They can contain also character group positions, and wildcards. We will show how the suffix tree construction algorithms can be used for discovering substring patterns and how they can be extended for discovery of more general pattern classes. Some applications of the pattern discovery in the analysis of DNA and protein sequences are demonstrated.

back to seminar homepage