Many signals in genes can (at least partly) be modeled as weight matrices. Examples of such signals are promoter elements, tranlation initiation, splice signals, and transcription termination. A weight matrix of length n for a DNA sequence is a n x 4 matrix W, where W_{r,a} is the score for the nucleotide a occurring in postion r of the gapless alignment.
Draw a simple Artificial Neural Network (ANN) that implements the scoring of such a weight matrix. The input is a DNA sequence window of length n, using the 4-node per base binary encoding as in Pedersen & Nielsen (1997), Hatzigeorgiou (2002). Label the edges with the corresponding weigths (Choose a small length n for the drawing.) How many hidden nodes do you need in the network?
How can a more general ANN use information about the pattern that can not be represented by a weight matrix? What are the drawbacks of using a general ANN?
Alternatively, we could use a Hidden Markov Model (HMM) to implement the weight matrix. A HMM works in a probabilistic setting, so instead of the using the scores W_{r,a}, let f_{r,a} be the probability that a real occurrence of the pattern contains the nucleotide a occurring in postion r, and let p_{a} be the background probability of nucleotide a. Design two simple HMMs for patterns and non-patterns, and combine them into a single HMM. Explain how you could use it to find the real patterns in a long sequence.
How can the HMM be extended to allow instertions and deletions in the signal? (This means that our weight matrix has turned into a profile.) Compare the properties that can be modeled by the HMM and the ANN discussed previously.