# INF380 Spring 2006: Compulsory Assignment

Send your solution by mail to coward@ii.uib.no by March 9. A satisfactory solution is required in order to take the exam.

## Problem 1: Multiple alignment: Using CLUSTALW

This is a simple exercise to demonstrate the use of CLUSTALW. It corresponds to Exercise 11 in Chapter 4 of Eidhammer, Jonassen and Taylor, with some additional explanation. All answers should be brief!

A web-interface to a CLUSTALW server can be found at http://www.ebi.ac.uk/clustalw/.

Two small datasets to be aligned are found here:

(Note slightly different sequence formats: PIR format and FASTA format, respectively; both are accepted as input). Do the following for each sequence set in turn.

1. First, generate a guide tree and a multiple alignment. This is done by selecting "none" as tree type under Phylogenetic tree. How is the guide tree computed, and what is it used for?
2. Then generate a phylogenetic tree from the alignment. This is done by using the alignment file from the first step as input (not the raw sequences), and selecting tree type "nj" (for example). Compare the tree with the guide tree from a. Is there any difference? Which tree is most likely to show the true evolutionary relationship between the sequences? Why?
3. Generate a new tree, selecting the option that ignores gaps. Explain exactly what this option does, and compare the alignment with the one in b. Which of the two datasets gives the most different trees? Why?
4. Do a new multiple alignment after changing the gap open penalty to 1. Comment on the effect on the alignment.

## Problem 2: Multiple alignment: Doing It Yourself

In Eidhammer, Jonassen and Taylor, Chapter 4, do Exercise 9 a, b, and c (p. 98). This problem will give some insight in the principles of multiple alignment by computing a small pair-guided progressive alignment by hand. How does the procedure differ from the CLUSTALW approach?

## Problem 3: Hidden Markov Models

In this problem we are going to use a simple Hidden Markov Model (HMM) for predicting GC-rich regions, like the one presented in the lectures. The model has two states for GC-rich and GC-poor regions, and the symbols emitted are the DNA nucleotides A, C, G, T.

1. Given this HMM structure, it remains to estimate the model parameters. What are they? How many independent parameters are there?
2. We are given some examples of Use these data to estimate the emission probabilities.
3. We are told that in our data, we should expect that 10% of the sequence is in GC-rich regions. The average length of a GC-rich region is 20 bp. Use this to estimate the transition probabilities.
4. From the transition probability matrix, compute the stationary distribution. Does it agree with the data in the previous question? Explain why is it reasonable to set the initial state probabilities equal to the stationary distribution.
5. Optional: Write a computer program that takes a DNA sequence as input and uses the Viterbi algorithm to find the most probable state sequence in this model. Use it to predict the GC-regions in this sequence, which was randomly generated by the HMM in this problem. (You could also check your prediction by writing a program to generate such sequences yourself, and compare your predictions with the known state sequence.)
6. CpG islands are special GC-rich regions where the dinucleotide CG occurs particularly frequently. Explain why our HMM model is not very well designed to detect this feature. Describe a revised HMM with eight states that could be used for this purpose. What are now the emission probabilities for each state? How many independent parameteres are there in the HMM?
7. Optional: How can the data in b be used to estimate some of the parameters in the model? How could they be used to decide whether the revised model is better than the simple two-state model? Finally, how would you estimate the state transitions that we could not obtain from the data in b?

Eivind Coward
Last update: Feb 24, 2006