# INF380 Spring 2006 - Exercise 5

1. Patterns
From Chapter 7 in Eidhammer, Jonassen and Taylor (p. 160): Exercises 1 and 2.
In 2 (a), compute both the probability based score presented in the lectures, and the information theoretic score from the book.

2. Gene prediction
1. Explain the difference between similarity based and ab initio gene prediction. Also, explain the difference between signal based and content based prediction.
2. Many signals in genes can (at least partly) be modeled as weight matrices. Examples of such signals are promoter elements, tranlation initiation, splice signals, and transcription termination. A weight matrix of length n for a DNA sequence is a n x 4 matrix W, where Wr,a is the score for the nucleotide a occurring in postion r of the gapless alignment.

Draw a simple Artificial Neural Network (ANN) that implements the scoring of such a weight matrix. The input is a DNA sequence window of length n, using the 4-node per base binary encoding as in Pedersen & Nielsen (1997), Hatzigeorgiou (2002). Label the edges with the corresponding weigths (Choose a small length n for the drawing.) How many hidden nodes do you need in the network?

3. How can a more general ANN use information about the pattern that can not be represented by a weight matrix? What are the drawbacks of using a general ANN?

4. Alternatively, we could use a Hidden Markov Model (HMM) to implement the weight matrix. A HMM works in a probabilistic setting, so instead of the using the scores Wr,a, let fr,a be the probability that a real occurrence of the pattern contains the nucleotide a occurring in postion r, and let pa be the background probability of nucleotide a. Design two simple HMMs for patterns and non-patterns, and combine them into a single HMM. Explain how you could use it to find the real patterns in a long sequence.

5. How can the HMM be extended to allow instertions and deletions in the signal? (This means that our weight matrix has turned into a profile.) Compare the properties that can be modeled by the HMM and the ANN discussed previously.